#!/usr/bin/env perl use strict; use warnings; use FindBin qw($Bin); use lib "$Bin"; use AsmHub; use File::Basename; my $argc = scalar(@ARGV); if ($argc != 3) { printf STDERR "usage: asmHubAugustusGene.pl asmId asmId.names.tab bbi/asmId\n"; printf STDERR "where asmId is the assembly identifier,\n"; printf STDERR "and asmId.names.tab is naming file for this assembly,\n"; printf STDERR "and bbi/asmId is the path prefix to .augustus.bb.\n"; exit 255; } my $asmId = shift; my @parts = split('_', $asmId, 3); my $accession = "$parts[0]_$parts[1]"; my $namesFile = shift; my $bbiPrefix = shift; my $augustusBbi = "$bbiPrefix.augustus.bb"; my $asmIdPath = &AsmHub::asmIdToPath($asmId); my $downloadGtf = "https://hgdownload.soe.ucsc.edu/hubs/$asmIdPath/$accession/genes/$asmId.augustus.gtf.gz"; if ( ! -s $augustusBbi ) { printf STDERR "ERROR: can not find augustus bbi file:\n\t'%s'\n", $augustusBbi; exit 255; } my $em = ""; my $noEm = ""; my $assemblyDate = `grep -v "^#" $namesFile | cut -f9`; chomp $assemblyDate; my $ncbiAssemblyId = `grep -v "^#" $namesFile | cut -f10`; chomp $ncbiAssemblyId; my $organism = `grep -v "^#" $namesFile | cut -f5`; chomp $organism; my $geneCount = `bigBedInfo $augustusBbi | egrep "itemCount:|basesCovered:" | xargs echo | sed -e 's/itemCount/Gene count/; s/ basesCovered/; Bases covered/;'`; chomp $geneCount; print <<_EOF_

Description

This track shows ab initio predictions from the program AUGUSTUS (version 3.1). for the $assemblyDate $em${organism}$noEm/$asmId genome assembly.

The predictions are based on the genome sequence alone.

$geneCount

Data Access

Download $asmId.augustus.gtf.gz GTF file.

Methods

Statistical signal models were built for splice sites, branch-point patterns, translation start sites, and the poly-A signal. Furthermore, models were built for the sequence content of protein-coding and non-coding regions as well as for the length distributions of different exon and intron types. Detailed descriptions of most of these different models can be found in Mario Stanke's dissertation. This track shows the most likely gene structure according to a Semi-Markov Conditional Random Field model. Alternative splicing transcripts were obtained with a sampling algorithm (--alternatives-from-sampling=true --sample=100 --minexonintronprob=0.2 --minmeanexonintronprob=0.5 --maxtracks=3 --temperature=2).

The different models used by Augustus were trained on a number of different species-specific gene sets, which included 1000-2000 training gene structures. The --species option allows one to choose the species used for training the models. Different training species were used for the --species option when generating these predictions for different groups of assemblies.

Assembly Group Training Species

Fish zebrafish

Birds chicken

Human and all other vertebrates human

Nematodes caenorhabditis

Drosophila fly

A. mellifera honeybee1

A. gambiae culex

S. cerevisiae saccharomyces

This table describes which training species was used for a particular group of assemblies. When available, the closest related training species was used.

Credits

Thanks to the Stanke lab for providing the AUGUSTUS program. The training for the chicken version was done by Stefanie König and the training for the human and zebrafish versions was done by Mario Stanke.

References

Stanke M, Diekhans M, Baertsch R, Haussler D. Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics. 2008 Mar 1;24(5):637-44. PMID: 18218656

Stanke M, Waack S. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics. 2003 Oct;19 Suppl 2:ii215-25. PMID: 14534192

_EOF_ ;

Assembly Group	Training Species
Fish	`zebrafish`
Birds	`chicken`
Human and all other vertebrates	`human`
Nematodes	`caenorhabditis`
Drosophila	`fly`
A. mellifera	`honeybee1`
A. gambiae	`culex`
S. cerevisiae	`saccharomyces`