In an earlier posting Stephen asked;
"The announcement by Mark Gunnell of the Genpept rel.75 reminds me
of a question I have always had about Genpept. If the translation
of the ORF of a sequence in Genbank is within the description why
can't the search programs use this information to do protein
searches instead of having another database with all of its
redundant data (i.e. Genpept)? Does Genpept have more ORF
translations than exist in the nucleic acid database's description
section of each record?"
Mark A. Gunnell replied;
"No, GenPept is derived _entirely_ from the the "/translation="
features in Genbank. As an administrator of sequence data, I know
I am always looking for a way to conserve disk space, so I am
hoping the next generation of searching software will be able to
use the raw data of Genbank more intelligently. Until then, since
I'm creating GenPept anyway, I thought I'd make it available to
anyone else who needs it."
Actually in some cases the hypothetical amino acid sequence is supplied by
the authors. Usually this sequence is the same as that generated by programs
which translate the nucleotide sequence but there are examples where the
two do not agree. Sometimes the authors incorrectly identify splice junction
sites at intron/exon boundaries when they submit their sequence but they
get the reading frame right when they do the translation. In this case the
hypothetical amino acid sequence serves as a check on the accuracy of the
nucleotide sequence. Another common error is mistakes in transposing the
nucleotide sequence to the file submitted to Los Alamos. The amino acid
sequence might be correct but not directly obtainable from the nucleotide
A more serious problem arises when curators recognize that the nucleotide
sequence contains probable errors that shift the reading frame. There are
several examples among the heat shock genes that I look at. In such cases
the probable errors could be noted in the annotations and a "corrected"
version of the hypothetical amino acid sequence could be included in
There are also examples of genes where normal translation requires frame-
shifting or RNA editing. It is useful to have the predicted amino acid
sequence of the protein product in such cases in spite of the fact that
they cannot easily be derived from the nucleotide sequence. In addition,
some functional proteins are produced as a result of cleavage of a larger
polypeptide and these can also be included in GenPept (with appropriate
Of course, all of these problems could be handled by eliminating GenPept
as long as the hypothetical amino acid sequence is attached to the nucleotide
sequence record. (Including those cases where the hypothetical sequence is
not the same as a direct translation of the nucleotide sequence.) In this
case your software must be capable of searching all "/translation=" entries
in the larger database as well as all "/predicted protein sequence="
entries (or whatever designation is appropriate). Isn't it easier and
faster to search a separate database?
Laurence A. Moran (Larry)