Greetings GenBank Users,
GenBank Release 134.0 is now available via ftp from the National Center
for Biotechnology Information (NCBI):
Ftp Site Directory Contents
---------------- --------- ---------------------------------------
ftp.ncbi.nih.gov genbank GenBank Release 134.0 flatfiles
ncbi-asn1 ASN.1 data used to create Release 134.0
Uncompressed, the Release 134.0 flatfiles require approximately 96.94 GB
(sequence files only) or 109.9 GB (including the 'short directory' and
'index' files). The ASN.1 version requires approximately 86.36 GB. From the
release notes:
Release Date Base Pairs Entries
133 Dec 2002 28507990166 22318883
134 Feb 2003 29358082791 23035823
Close-of-data was 02/10/2003. Four working days were required to prepare
this release. In the six week period between the close dates for GenBank
releases 133.0 and 134.0, GenBank grew by 850,092,625 basepairs and by
716,940 sequence records. During that same period, 64,040 records were
updated. Combined, this yields an average of about 18,600 new/updated
records per day.
We would like to remind our users that GenBank mirrors are available
at ftp://genbank.sdsc.edu/pub and ftp://bio-mirror.net/biomirror/genbank .
Those who experience slow FTP transfers of large files (entire releases, the
GenBank Cumulative Update, etc) might realize an improvement in transfer
rates from these alternate sites when traffic at the NCBI is heavy.
For additional release information, see the README files in either of the
directories mentioned above, and the release notes (gbrel.txt) in the
genbank directory. Sections 1.3 and 1.4 of the release notes (Changes in
Release 134.0 and Upcoming Changes) have been appended below.
* * * IMPORTANT * * *
As described in the October and December 2002 release notes, the
GenBank Cumulative Update (GBCU) data products are no longer supported as
of this February 2003 release. Details about this change are available in
Section 1.3.2 of the release notes. Note that NCBI will continue to generate
the GBCU products for an additional three weeks, on an unsupported basis,
as an aid to those who need additional time to transition to our incremental
update products.
Release 134.0 data, and subsequent updates, are available now via NCBI's
Entrez and Blast services.
New GenBank cumulative update files (gbcu.flat.Z and gbcu.aso.Z), containing
only those entries new/updated since the Release 134.0 close-of-data, should be
available by about 10:00am EST, February 15. Please note that the new CUs will
be
smaller than previous versions you might have obtained after Release 133.0 was
posted.
If you encounter problems while ftp'ing or uncompressing Release 134.0,
please send email outlining your difficulties to info at ncbi.nlm.nih.gov .
Mark Cavanaugh, Vladimir Alekseyev, Anton Butanaev, Michael Kimelman
GenBank
NCBI/NLM/NIH
1.3 Important Changes in Release 134.0
1.3.1 Organizational changes
The total number of sequence data files increased by 10 with this release:
- the EST division is now comprised of 240 files (+5)
- the GSS division is now comprised of 66 files (+3)
- the HTC division is now comprised of 4 files (+1)
- the HTG division is now comprised of 58 files (+1)
1.3.2 * * Cumulative GenBank Update Products Discontinued * *
As of GenBank Release 134.0, the cumulative GenBank Update (GBCU)
products have been discontinued:
ftp://ftp.ncbi.nih.gov/ncbi-asn1/daily/gbcu.aso.gzftp://ftp.ncbi.nih.gov/genbank/daily/gbcu.flat.gzftp://ftp.ncbi.nih.gov/genbank/daily/gbcu.fsa_nt.gzftp://ftp.ncbi.nih.gov/genbank/daily/gbcu.gnp.gzftp://ftp.ncbi.nih.gov/genbank/daily/gbcu.qscore.gzftp://ftp.ncbi.nih.gov/genbank/daily/gpcu.fsa.gz
In the eight weeks between typical GenBank Releases, it was not uncommon
for GBCU products to approach 20% of the total database size. The flatfile
version, for example, reached sizes in excess of 17 GB in late 2002.
From a user perspective, repeatedly obtaining and processing such a
large update product makes inefficient use of both bandwidth and local
resources, compared to the much smaller incremental GbUpdate products.
And in order to reliably generate the GBCU in the face of such explosive
growth, NCBI would have to invest significant resources to increase the
performance of a large body of software.
Given these factors, plus the questionable value of an "update" product,
generated daily, and approaching 20GB in size, NCBI has discontinued
support for the GBCU products.
However, as an aid to those users who may not yet have completed
transitioning to the use of incremental update products, GBCU files
will continue to be generated, on an _unsupported_ basis, for approximately
three more weeks. After that time, the GBCU files will be removed from
the NCBI FTP site.
1.3.3 Third-Party Annotation Data Collection
Pursuant to agreements made at their 2002 Collaborative Meeting,
DDBJ/EMBL/GenBank have undertaken the collection of a new class of
sequence data : Third-Party Annotation (TPA).
The TPA data-collection complements the existing DDBJ/EMBL/GenBank
comprehensive database of primary nucleotide sequences, which typically
result from direct sequencing of cDNAs, ESTs, genomic DNAs, etc.
'Primary data' are defined to be data for which the submitting group has
done the sequencing and annotation, and hence, as owner of the data,
has privileges to update/correct the associated sequence records. In
contrast, non-primary (TPA) sequences are defined as sequences which:
a) consist exclusively of sequence data from one, or several,
previously-existing primary entries owned by other groups, or
b) consist of a mixture of previously-existing primary entries,
some owned by the TPA submittor and the rest by one or more other
groups
Complete details regarding TPA sequence submission can be found
at the NCBI website:
http://www.ncbi.nlm.nih.gov/Genbank/tpa.html
TPA categories and requirements
-------------------------------
Users can submit new annotation of single sequences or assemblies
of sequences that are owned by other groups to the TPA data
collection.
The primary sequences must be available in the DDBJ/EMBL/GenBank
databases, and submitters to the TPA database must provide the
accession numbers of the primary sequences in their TPA submission.
TPA sequences based on primary data available only in proprietary
databases are not accepted.
Some examples of data submissions accepted for TPA include:
1. analysis and re-annotation of DDBJ/EMBL/GenBank sequences
owned by other groups
2. gap-filling, in which a TPA submittor might utilize HTG or
EST data to complete an otherwise incomplete sequence
3. TPA sequences based on NCBI/Ensembl trace archive data
4. TPA sequences based on Whole Genome Shotgun (WGS) sequences
Sequences based on primary data from multiple organisms are not
accepted.
Sequences will not be accepted for TPA in lieu of an update to
primary records. A submittor who owns a primary record is expected
to update that record as new sequence is determined, or sequencing
ambiguities/errors are resolved.
Any newly-determined sequence data that is to be part of a TPA
record must first be submitted as a new primary sequence to
DDBJ/EMBL/GenBank.
The TPA dataset is intended to present sequence data and annotation
in support of actual biological discoveries that are published in
the scientific literature, without requiring that the sequence be
determined by the authors/submitters.
In order to assure that the sequence annotation is of high quality,
it is required that TPA records be associated with a study published
in a peer-reviewed journal before the data is released to the public.
TPA records include a mandatory 'PRIMARY' block, which documents the
relationships between spans of the TPA sequence and the primary
(non-TPA) sequences that contributed to it. The elements of the
PRIMARY block are:
a) TPA-SPAN base span on TPA sequence
b) PRIMARY_IDENTIFIER acc.version of contributing sequence(s)
c) PRIMARY_SPAN base span on contributing primary sequence
d) COMP 'c' is used to indicate that contributing
sequence is originating from complementary
strand in primary sequence entry
Example:
TPA_SPAN PRIMARY_IDENTIFIER PRIMARY_SPAN COMP
1-426 AC004528.1 18665-19090
427-526 AC001234.2 1-100 c
TPA data products
-----------------
TPA update products became available at the NCBI FTP site on Friday,
January 31, 2003. Daily, incremental update files for all new/updated
TPA records are located in:
ftp://ftp.ncbi.nih.gov/tpa/updates
TPA updates have filename prefixes of:
tpa_upd.YYYY.MMDD.
Filename suffixes for these updates are:
.bbs : binary Bioseq-set (ASN.1)
.gbff : GenBank flatfile
.gnp : GenPept flatfile
.fsa_nt : Nucleotide FASTA
.fsa_aa : Protein FASTA
We do not expect to generate complete releases (similar to GenBank
releases) for TPA until the volume of TPA records has substantially
increased. Until that time, a set of cumulative TPA update files
containing all TPA records is available in:
ftp://ftp.ncbi.nih.gov/tpa/release
Cumulative TPA update files have filename prefixes of:
tpa_cu.
and utilize the same filename suffixes that are listed above. Note
that the cumulative TPA products will be *discontinued* once TPA
releases are being built.
1.3.4 GSS File Header Problem
GSS sequences at GenBank are maintained in one of two different systems,
depending on their origin. One recent change to release processing involves
the parallelization of the dumps from those systems. Because the second dump
(for example) has no prior knowledge of exactly how many GSS files will be
dumped from the first, it doesn't know how to number it's own output files.
There is thus a discrepancy between the filenames and file headers of nine
GSS flatfiles in Release 134.0. Consider the gbgss56.seq file:
GBGSS1.SEQ Genetic Sequence Data Bank
February 15 2003
NCBI-GenBank Flat File Release 134.0
GSS Sequences (Part 1)
88066 loci, 66600405 bases, from 88066 reported sequences
Here, the filename and part number in the header is "1", though the file
has been renamed as "56" based on the files dumped from the other system.
We will work to resolve this discrepancy in future releases, but the
priority is certainly much lower than many other tasks.
1.4 Upcoming Changes
1.4.1 New /mol_type qualifier
As of the April 2003 GenBank Release (134.0), a new source feature
qualifier called /mol_type will begin to be used for source features.
This qualifier will be used to indicate the in-vivo biological state
of the sequence presented in a database record.
The preliminary definition for /segment is :
Qualifier /mol_type=
Definition in vivo molecule type
Value format "text"
Example /mol_type="genomic DNA",
Comment text limited to "genomic DNA", "genomic RNA", "mRNA"
(incl EST),
"tRNA", "rRNA", "snoRNA", "snRNA", "scRNA", "pre-mRNA",
"other RNA" (incl. synthetic), "other DNA" (incl.
synthetic),
"unassigned DNA" (incl. unknown),"unassigned RNA" (incl.
unknown)
In-vivo molecule type information is already presented on the LOCUS
line of the GenBank flatfile format. However, introducing /mol_type
in the Feature Table will make the exchange of this information among
DDBJ, EMBL, and GenBank more complete and accurate.
NOTE: /mol_type will eventually be a mandatory qualifier for the source
feature, probably by June 2003.
1.4.2 New /segment qualifier
As of the April 2003 GenBank Release (134.0), a new source feature
qualifier called /segment will begin to be used for source features.
In the absence of a more suitable way to annotate viral segments, this
information had either not been included in database entries, or had been
annotated incorrectly (e.g. using /chromosome, /map etc). This new
qualifier addresses that lack.
The preliminary definition for /segment is :
Qualifier /segment=
Definition name of viral or phage segment sequenced
Value format "text"
Example /segment="6"
1.4.3 New /locus_tag qualifier
As of the April 2003 GenBank Release (134.0), a new source feature
qualifier called /locus_tag will begin to be used.
Many complete-genome sequencing projects use solely computational
methods to predict coding regions and genes. The /locus_tag qualifier
provides a method for identifying and tracking the results of such
computations, without utilizing existing qualifiers such as /gene .
These 'locus tags' are systematically assigned, and do not necessarily
reflect gene name/symbol conventions in experimental literature. Hence
the introduction of this new qualifier.
The preliminary definition for /locus_tag is :
Qualifier: /locus_tag
Definition: feature tag assigned for tracking purposes
Value Format: "text" (single token)
Example: /locus_tag="RSc0382"
/locus_tag="YPO0002"
Comment: /locus_tag can be used with any feature where /gene
is valid;
---
- gttaacaattaaagagtgtttatcgaaattcattatatagtggtttatatagaccacttc
-
- GenBank newsgroup see: http://www.bio.net/hypermail/genbankb/
- GENBANKB e-mail: messages sent to genbankb at net.bio.net
- subscribe: e-mail biosci-server at net.bio.net with: subscribe genbankb
- unsub: e-mail biosci-server at net.bio.net with: unsubscribe genbankb
- GenBank on the WWW, see: http://www.ncbi.nlm.nih.gov/Genbank/
- problems with GENBANKB? E-mail moderator: francis at cmmt.ubc.ca