IUBio GIL .. BIOSCI/Bionet News .. Biosequences .. Software .. FTP

GenPept, do we need it?

L.A. Moran lamoran at gpu.utcc.utoronto.ca
Mon Feb 15 13:06:57 EST 1993

In an earlier posting Stephen asked;

     "The announcement by Mark Gunnell of the Genpept rel.75 reminds me 
      of a question I have always had about Genpept. If the translation 
      of the ORF of a sequence in Genbank is within the description why 
      can't the search programs use this information to do protein 
      searches instead of having another database with all of its 
      redundant data (i.e. Genpept)? Does Genpept have more ORF 
      translations than exist in the nucleic acid database's description
      section of each record?"

Mark A. Gunnell replied;
     "No, GenPept is derived _entirely_ from the the "/translation=" 
      features in Genbank. As an administrator of sequence data, I know 
      I am always looking for a way to conserve disk space, so I am 
      hoping the next generation of searching software will be able to 
      use the raw data of Genbank more intelligently. Until then, since
      I'm creating GenPept anyway, I thought I'd make it available to 
      anyone else who needs it."

Actually in some cases the hypothetical amino acid sequence is supplied by
the authors. Usually this sequence is the same as that generated by programs
which translate the nucleotide sequence but there are examples where the
two do not agree. Sometimes the authors incorrectly identify splice junction
sites at intron/exon boundaries when they submit their sequence but they
get the reading frame right when they do the translation. In this case the
hypothetical amino acid sequence serves as a check on the accuracy of the
nucleotide sequence. Another common error is mistakes in transposing the
nucleotide sequence to the file submitted to Los Alamos. The amino acid
sequence might be correct but not directly obtainable from the nucleotide 
sequence ORF.

A more serious problem arises when curators recognize that the nucleotide
sequence contains probable errors that shift the reading frame. There are
several examples among the heat shock genes that I look at. In such cases
the probable errors could be noted in the annotations and a "corrected"
version of the hypothetical amino acid sequence could be included in

There are also examples of genes where normal translation requires frame-
shifting or RNA editing. It is useful to have the predicted amino acid
sequence of the protein product in such cases in spite of the fact that
they cannot easily be derived from the nucleotide sequence. In addition,
some functional proteins are produced as a result of cleavage of a larger
polypeptide and these can also be included in GenPept (with appropriate

Of course, all of these problems could be handled by eliminating GenPept
as long as the hypothetical amino acid sequence is attached to the nucleotide 
sequence record. (Including those cases where the hypothetical sequence is
not the same as a direct translation of the nucleotide sequence.) In this
case your software must be capable of searching all "/translation=" entries
in the larger database as well as all "/predicted protein sequence="
entries (or whatever designation is appropriate). Isn't it easier and 
faster to search a separate database?

Laurence A. Moran (Larry)

More information about the Bio-soft mailing list

Send comments to us at archive@iubioarchive.bio.net