Dear genome informaticians and genomicists,
If your interests include methods to produce accurate, complete gene sets
for animals and plants, please see these productions, and the software
to produce such. If any of you work in the area of assessing gene set
quality, please consider these as well worth testing. I would very much
appreciate further independent validation of what appears to be better
methods than are generally used today.
* Daphnia magna water flea gene set (hybrid):
* Ixodes scapularis deer tick gene set (mRNA assembled):
* Tribolium castenaeum beetle gene set (mRNA assembled):
* Apis mellifera honey bee gene set (mRNA assembled):
These EvidentialGene gene sets are generally more complete and accurate
than others for same species and related species, measured by orthology
These methods of course work with vertebrates, plants and other
eukaryotes, including a hybrid gene set for killifish that appears more
accurate than related fish species; a hybrid set for cacao chocolate
tree that likewise is an accurate plant gene set.
Several of these are 'reference-free' gene set assemblies from mRNA-seq,
without reference made to a genome assembly nor training/mapping from
other species genes.
Some of these are hybrid, mRNA-assembled and genome-assembly-modelled
gene sets, where discrepancies between methods in gene loci need resolving,
as best possible from available gene evidence.
The Daphnia magna hybrid gene set is coupled with a soon-to-be released
genome assembly by the Daphnia magna Genome Consortium, from 2010 but of
better quality compared to some of today's short-read only assemblies.
The gene construction methods called EvidentialGene that I've work out
over 5 years are now to the point of producing highly accurate gene
sets, when measured with objective biological criteria of orthology
completeness, that is, the presence and fullness of protein coding genes
that share orthology with related species.
There is a SourceForge copy of these gene construction methods:
https://sourceforge.net/projects/evidentialgene/ (still partial)
For primary methods, http://arthropods.eugenes.org/EvidentialGene/
Reference-free gene sets have values different from genome-based gene
sets, one important one is no external artifacts or errors contribute to
these genes. Any protein orthology measured has not been influenced by
gene modelling using other species (with their artifacts), and genome
An existing dogma in genome projects, that quality of a gene set is
dependent on the quality of the genome assembly, is no longer accurate.
mRNA-seq assembly now does as well or better than genome-gene modelling.
Hybrid gene sets can be the most accurate, but also are harder to build,
because errors from the different sources are hard to reconcile. In
general, consensus orthology among species provides the hardest
biological gene evidence, but exceptions from orthology are fairly
common, and too little consensus or consensus of genes derived from same
sources, allows for errors.
There remain problems with hybrid gene sets, an important one is that
major public data banks are not prepared to accept such. INSDC (DDBJ,
EMBL-EBI, NCBI) databases allow two forms of gene set deposition: only
genome-based, or only RNA assembled. We need also gene-centric
databases, that can retain the most accurate combination of gene
evidence, and include those species genes not located or poorly located
on (imperfect) genome assemblies.
-- Don Gilbert, 2015-June, gilbert at indiana edu