Dear genomics folks,
Re: EvidentialGene project at http://eugenes.org/EvidentialGene/
EvidentialGene has a high accuracy rate for gene set construction, compared
with other gene informatics methods, for fish, plants, various arthropods.
Recently I've generated gene sets for two Anopheles mosquito species
with Evigene mRNA assembly, and they surpass recently published* gene
sets from Vectorbase project in orthology completeness, using same
RNA-seq as that project reports.
The software pipeline pair of MAKER and Trinity form a common recipe now for
genome biologists, without those scientists realizing that greater accuracy is
possible and not much harder to obtain, I suspect. In all cases where I test,
with fishes, plants, insects, Evigene is producing the notably more accurate
and ortho-complete gene sets. See below for mosquitos, fishes at
http://eugenes.org/EvidentialGene/vertebrates/
The EvidentialGene gene reconstruction methods have been used for
several animal and plant genome projects, where they produce gene sets
more accurate than those of peer annotation methods. There are basic
reasons these methods have high accuracy: careful, complete assembly of
the now highly accurate RNA-sequences, and extensive use of protein
orthology testing to validate, reject or accept, alternate gene
constructions. Assembly of RNA sequences is similar but simpler than
of genomic DNA, as RNA-seq read sizes are near to gene transcript sizes,
there are no repetitive transposons, nor problematic intron breaks.
Accurate RNA assembly solves problems that exist for traditional genome
gene-modelling: artifacts from draft genome assemblies, from modelling
prediction algorithms that are not gene-level accurate, and from
artifacts contributed by related species gene models.
Improvements to the Evigene locus classifier, including chromosome-assembly
map classifier, are producing better discrimination of alternate
transcripts versus paralog genes. I hope to offer an update in coming
months that (a) improves gene locus classification (removing some
duplication, improving alternate transcript classification), and (b)
offering an initial mRNA-assembly by chromosome assembly classifier
(i.e. genome mapping of transcript assemblies).
If you have interests in accurate animal and plant gene-ome construction from
RNA sequences, with or without a chromosome assembly, this project may be of
interest. I would like to work with a few collaborators who have genome +
transcriptome data sets plus genome-modelled gene sets (e.g. from pipelines
such as MAKER, NCBI, Augustus, EvidenceModeller, etc) to compare with
EvidentialGene results.
Don Gilbert, 2016.feb
-----------------
* Evigene vs MAKER gene set of doi: 10.1126/science.1258522
Highly evolvable malaria vectors:the genomes of 16 Anopheles mosquitoes
Protein homology to reference genes, 2 gene sets for 2 species of
Anopheles mosquito. For both species published RNA-seq was assembled
with 4 gene assemblers, then reduced to locus/alternate gene sets with
Evigene (roughly 3 days work). The RNA data sets here were too small by
half of recommended amount, so some genes did not assemble properly.
With 100+ M read pairs instead of the 50 M provided, the completeness of
Evigene sets would be improved.
Highly conserved REFERENCE (BUSCO drosmel, nr=3038)
Anopheles-funestus Anopheles-albimanus
Evigene MAKER Evigene MAKER
found 99.4% 97.7% 98.3% 97.3%
align 87.3% 83.2% 87.3% 83.2%
best 30% 11.8% 26.5% 12.6%
equal 58% 61%
Drosophila mel. model REFERENCE (nr=10902)
Anopheles-funestus Anopheles-albimanus
Evigene MAKER Evigene MAKER
found 98.4% 96.1% 95.8% 95.8%
align 87.3% 83.2% 77.5% 76.8%
best 31.6% 15.1% 28.6% 18.6%
equal 58% 53%
Anopheles gambia REFERENCE (tr total=14870, locus total=12994)
Anopheles-funestus Anopheles-albimanus
Evigene MAKER Evigene MAKER
found 97.9% 96.6% 94.7% 96.3%
align 93.1 89.3 86.4% 87.5%
best 33.9% 16% 30.7% 21.2%
equal 50% 48%
---------------------------------------------------