The protein sequence file, rel150.fsa_aa.gz, dated Oct 14, 2005,
for the current GenBank release 150 contains several hundred
fasta entries that have fasta headers but no peptide residue data.
For examples, see any of these entries in the rel150.fsa_aa.gz file:
>gi|263833|gb|AAB24990.1| No definition line found
>gi|263835|gb|AAB24992.1| No definition line found
>gi|263837|gb|AAB24994.1| No definition line found
>gi|263839|gb|AAB24996.1| No definition line found
>gi|263841|gb|AAB24998.1| No definition line found
When we attempted to process the file with existing GCG v10.3 fasta
file handling utilities (e.g., fastatogcg), those programs became
confused because they assume that there will be at least one line of
sequence data following each sequence header. We had to remove the
null sequence entries with a preprocessing step in order to complete
the installation of release 150.
We have been processing each GenBank protein release in this way
for about seven years and this is the first time we seen this
problem.
Cordially,
Garry Martin
Mendel Biotechnology, Inc.