>>One could do this by writing a program that blastn all the genes
>>against the new sequence and then pick out the new coordinates for the
>>nearly identical hits. Gene duplicates etc could make it a bit messy
>>though.
> For bacterial genomes, BLAST is probably fast enough. For mammalian
> genomes it isn't (unless you have many hundreds of CPUs available, which
> only a few sites do).
How many times per day did you upgrade your sequence? While it is
definitely a good idea to throw a few CPUs at such a job I don't see why
you would need hundeds of them. A BLAST search of all predicted mouse
genes against a database of about the same size runs for only a few days
on something like 10 CPUs.
>>Has anyone out there done this before, and do you have any tips?
>>Would be extremely grateful if you could share them!
> Most genome annotation projects have this problem, as you suggest.
> I used to work at Incyte Genomics, and while there I employed someone
> specifically to write code to solve this very problem. Unfortunately
> the code was not made publically available.
> I can't speak for the particular problems of bacterial genomes, but in
> the human genome we were hit by the usual issues; many features are very
> difficult to remap automatically. For example, I remember trying to
> remap an STS tiling path across the coding regions of one particular
> gene from the original gene build (which was on HTG draft sequence) onto
> the final sequence that came along later.
> The problem was that this particular gene had about 12 alternative 5'
> exons, which were on average about 98% sequence identical with each
> other. Made remapping very difficult (as well as designing unique STSs
> for that gene, of course!)
> The second problem was speed. BLAST and other DP algorithms just
> weren't fast enough. We did come up with an exact string matching
> method that was much faster, but were usually left with about 20% of
> features which the algorithm would flag up as needing human
> intervention; typically this occurred when the new version of the
> sequence contained indels relative to the original build sequence.
> The smaller the feature, the harder it is to remap, of course, because
> it has more chance of occurring by chance. SNPs were the trickiest, of
> course, since you then have to decide on how much flanking sequence to
> use to help the mapping process. The more you use, the more accurate it
> gets, but slower to run.
> Many annotation projects currently seem to prefer the approach of
> re-running their automated annotation pipelines than trying to remap
> their existing annotation.
> You may consider this to be burying one's head in the sand, and I
> couldn't possibly comment. :-)
> Tim
> ---
--
Dr. Philipp Pagel Tel. +49-89-3187-3675
Institute for Bioinformatics / MIPS Fax. +49-89-3187-3585
GSF - National Research Center for Environment and Health
Ingolstaedter Landstrasse 1
85764 Neuherberg, Germany
---