In article <e818c15b.0307100440.58d9f00b at posting.google.com>,
Marcus Claesson <m.claesson at student.ucc.ie> wrote:
>One could do this by writing a program that blastn all the genes
>against the new sequence and then pick out the new coordinates for the
>nearly identical hits. Gene duplicates etc could make it a bit messy
For bacterial genomes, BLAST is probably fast enough. For mammalian
genomes it isn't (unless you have many hundreds of CPUs available, which
only a few sites do).
>Has anyone out there done this before, and do you have any tips?
>Would be extremely grateful if you could share them!
Most genome annotation projects have this problem, as you suggest.
I used to work at Incyte Genomics, and while there I employed someone
specifically to write code to solve this very problem. Unfortunately
the code was not made publically available.
I can't speak for the particular problems of bacterial genomes, but in
the human genome we were hit by the usual issues; many features are very
difficult to remap automatically. For example, I remember trying to
remap an STS tiling path across the coding regions of one particular
gene from the original gene build (which was on HTG draft sequence) onto
the final sequence that came along later.
The problem was that this particular gene had about 12 alternative 5'
exons, which were on average about 98% sequence identical with each
other. Made remapping very difficult (as well as designing unique STSs
for that gene, of course!)
The second problem was speed. BLAST and other DP algorithms just
weren't fast enough. We did come up with an exact string matching
method that was much faster, but were usually left with about 20% of
features which the algorithm would flag up as needing human
intervention; typically this occurred when the new version of the
sequence contained indels relative to the original build sequence.
The smaller the feature, the harder it is to remap, of course, because
it has more chance of occurring by chance. SNPs were the trickiest, of
course, since you then have to decide on how much flanking sequence to
use to help the mapping process. The more you use, the more accurate it
gets, but slower to run.
Many annotation projects currently seem to prefer the approach of
re-running their automated annotation pipelines than trying to remap
their existing annotation.
You may consider this to be burying one's head in the sand, and I
couldn't possibly comment. :-)