Doug Eernisse DEernisse at fullerton.edu
Tue Mar 21 15:05:36 EST 1995

> readseq drops most sequence documentation because the task
> of interconverting it among sequence formats is much larger
> (and less important) than of converting sequence data.  I have 
> it as an eventual goal to teach readseq to do this, but there 
> is no time estimate for this feature.  Always keep your original
> sequence file around if you need the documentation.
I agree with Don about keeping your original file around. In my
GenBank/EMBL format conversion, I have faced an ongoing challenge
to not only extract the sequence data (solved, more or less),
but also to extract the features into a consistent spreadsheet
format. My feature extraction routines are intended for construction
of dynamic graphical gene maps that can be created and displayed
within the same utility that does the extraction (DNA Translator
HyperCard stack). The extraction routine is not intended for
import to GDE, but perhaps it could serve this purpose. It would
certainly be easier than scrolling through the original documentation,
something that I have certainly done (& am still doing) my share of,
but would probably still need to be checked. As an example of how
GenBank seems to be a moving target, I just added 14 complete metazoan
mtDNA gene maps to DNA Translator (now there are 23 total) and I
noticed that different files retrieved from GeneBank put their
"gene" names in "gene", "product" or "notes" descriptors alternately.
This can be quite frustrating to try to support. I have started
to add the Euglena gracilis cpDNA map to make it comparable to the 
other 3 cpDNA maps, but this may take some help from those who know more 
about cpDNA genomes (any volunteers?). It would also be fun to get some
comparable plasmid or virus gene map cards supported. One neat feature
of the gene maps is the ability to extract any comparable (e.g., metazoan
mtDNA) gene or inferred amino acid sequences directly to the associated
HyperCard manual alignment editor stack called "Aligner." This was
described in a 1992 CABIOS article, but having 23 metazoan mtDNAs now
makes the phylogenetic question more interesting than fly-urchins-verts.
To give you an example of the matrix format, here is a small
portion of the E. gracilis matrix (importable to Excel as "CSV" format):

C3a,2.575923,2.74495,8,+,3688,3930,CDS,3a,PSII D2-polypeptide-a
C4b,3.512558,3.534909,9,+,5029,5061,CDS,4b,PSII D2-polypeptide-b
C5c,3.789847,3.850613,10,+,5426,5513,CDS,5c,PSII D2-polypeptide-c

If you would like to try to use this utility, let me know and I will
attach a current copy to you.  Because I am in between versions at
the moment, it has been quite awhile since I posted a version to Don's 
ftp server (as ftp://ftp.bio.indiana.edu:/molbio/mac/dnastacks-1xx.hqx 
where xx is version number). The extraction and matrix building conversion 
is performed from the "Convert" menu of a utility card, menu item
"String Nucleotide Options..." and "GenBank/EMBL to Strings" radio button.
I also use another one of my stacks called "File Combiner" so that I
can do entire folders of downloaded sequences simultaneously (in theory).

Good luck,


Doug Eernisse <DEernisse at fullerton.edu>
Dept. Biological Science MH282
California State University
Fullerton, CA 92634

