Dear genome informaticians and biologists,
Accurate gene set reconstruction is now a solvable information-engineering
problem, one of collecting gene evidence and piecing it together, with
validations, rather than the uncertain gene predictions of twenty years
back. Yet these are not fully solved for many animal and plant gene sets
now in use, including models with extensive effort.
This is the usual result I and others get for coding genes pieced together
from RNA evidence, then validated with reference/related species proteins.
SRA2Genes pipeline collects several EvidentialGene methods into a complete,
automated gene set reconstruction pipeline. It fetches public RNA-seq from
NCBI SRA, over-assembles that into many millions of gene models, by varying
assembly methods and data slices, then reduces this over-assembly to its
most accurate, non-redundant coding gene loci and alternates, followed by
annotation with reference species genes. Included are checks for
contaminants, and formatting of gene sequences for public database
The Evigene software package including omnibus evgpipe_sra2genes.pl is at
evigene18jan01.tar (draft2 of evgpipe_sra2genes)
This early draft of SRA2Genes is one you should not expect to work fully
yet for you. If you have interests in testing it, and some experience
in RNA assembly, I am looking for feedback on how to generalize this for
A preliminary Zebrafish gene set from this evigene sra2genes is at
The resulting gene set is more accurate and complete than NCBI and Ensembl
gene sets for zebrafish, at recovering gene evidence of homology to related
fish and vertebrate conserved proteins, and in recovering expressed introns.
Another SRA2Genes result, for green sea urchin, is at
Likewise, recent EvidentialGene reconstructions of plant and animal genes
for Arabidopsis, Zea mays, white fly and water flea do recover gene
evidence more accurately than popular methods including MAKER, AUGUSTUS and
related genome-gene modeling, Trinity gene assemblies, PacBio "no-assembly"
gene assemblies, as well as NCBI EGAP and Ensembl/Gramene gene set
Evigene SRA2Genes is built from components that work well on cluster
systems that I use, courtesy of NSF-XSEDE cyberinfrasturcture. It
should work for others, once needed software components are linked (RNA
assemblers velvet/oases, idba_tran, soap, trinity, NCBI blast, vecscreen,
exonerate fastanrdb, etc.). This pipleline writes cluster shell scripts to
be run asynchonously.
Who should consider EvidentialGene for gene reconstruction?
* genomicists who want accurate, complete and objectively reconstructed genes,
including those of you who may not believe these results, but are
interested enough to check my claims.
* model and well-supported genome projects, where curators can use these
to improve precision of high value gene information.
* new species genomes, use as a primary gene set, including alternate transcripts,
and/or use to assess gene predictions and chromosome assemblies for accuracy.
* population-level gene set comparisons within a species
* gene/genome improvement projects, to add alternate transcripts,
un-discovered and fragmented gene models.
* transcriptome and expression projects for more accurate genes.
Reconstruction from RNA only provides independent gene evidence, free of
errors and biases from chromosome assemblies and other species gene sets.
The well known ortholog genes are reconstructed well, but also harder gene
problems of alternate transcripts, paralogs, and complex structured genes
are usually more complete from Evigene methods.
This methodology is highly automatable, and can deal with BIG DATA, but
needs improvements. Genes built with Evigene by independent authors
include a range of plants and animals, and several of these papers provide
independent reviews of Evigene versus other methods.
-- Don Gilbert
gilbertd @ indiana.edu