wfischer at bio.indiana.edu (Will Fischer) writes:
:: I need to extract given pieces of sequence from a set of EMBL/GenBank
: entries, using ranges defined in the features table. For example, I'd
: like to be able to extract, from a set of entries, the DNA sequence for
: each exon, or again for every complete CDS feature (all exons
:: Surely not everyone does this manually?
:: What software exists that can actually parse the (eminently parsible)
: joint features table format? Please post reviews of programs you have
: used, or mail me directly and I will summarize.
:: -- Will Fischer
The question has been raised about reporting the segments of sequence
associated with a feature location (e.g., a coding region) in the GenBank
feature table. At NCBI we find it more convenient to do this using the
ASN.1 form of the same data, which is distributed on the Entrez CD-ROM and
via the Entrez network service. It is a matter of using a few function
calls from the NCBI Software Toolkit (available for most hardware/software
platforms by anonymous ftp). A demo program called getfeat.c has been part
of the toolkit distribution for some time. It retrieves a sequence entry
from the Entrez CD-ROM, finds all the coding regions, then prints out the
regions of sequence referenced by those features.
A new demo, scantest.c, has just been added to the /demo directory. The
complete program combines a number of previously existing demos. It will
scan through all of the Entrez ASN.1 sequence files, processing each
SeqEntry and extracting the nucleotide or protein sequence or the nucleotide
coding region. The DESIRED symbol should be defined as ONLY_DNA, ONLY_PRT or
ONLY_CDS to indicate the desired result. A command-line argument version
could easily be written, and it could be extended to extract sequences based
on any feature type. The path to the Entrez: Sequences CD-ROM must be in
the ROOT entry of the [NCBI] section of the ncbi configuration file (created
by EntrezCf, the Entrez configuration program).
You can remove the scanning sections and just pass an individual SeqEntryPtr
to the ProcessSeqEntry function. See the /doc/access.doc file or the
printed documentation for instructions on how to retrieve a SeqEntry given an
accession number or unique identifier, or how to load a SeqEntry that has
been saved to a file.
Note that although these functions do not work on the GenBank flat file, the
software toolkit contains code to produce the flat file from the ASN.1 data,
so you can easily have both views if you use the Entrez disc. The Entrez
application allows records to be saved in GenBank, FASTA or ASN.1 form.
If there is sufficient interest in this sort of funtionality, we would like
to hear comments and recommendations from the community about what the best
way to present and/or use the data this way would be. If a reasonable
consensus developed, we would consider incorporating it into an existing end
user product such as Entrez itself.
kans at ncbi.nlm.nih.gov
National Center for Biotechnology Information