Greetings GenBank Users,
GenBank Release 133.0 is now available via ftp from the National Center
for Biotechnology Information (NCBI):
Ftp Site Directory Contents
---------------- --------- ---------------------------------------
ftp.ncbi.nih.gov genbank GenBank Release 133.0 flatfiles
ncbi-asn1 ASN.1 data used to create Release 133.0
Uncompressed, the Release 133.0 flatfiles require roughly 94.33 GB
(sequence files only) or 107.07 GB (including the 'short directory' and
'index' files). The ASN.1 version requires roughly 84.21 GB. From the
release notes:
Release Date Base Pairs Entries
132 Oct 2002 26525934656 19808101
133 Dec 2002 28507990166 22318883
Close-of-data was 12/31/2002. Four working days were required to prepare
this release. In the eight week period between close-of-data for GenBank
releases 132.0 and 133.0, GenBank grew by 1.982 billion basepairs and by
2,510,782 sequence records. During that same period, 133,297 records were
updated. Combined, this yields an average of about 44,000 new/updated
records per day.
The growth in the number of records is the largest experienced in
GenBank's history. The growth in the number of basepairs is the third
largest.
We would like to remind our users that GenBank mirrors are available
at ftp://genbank.sdsc.edu/pub and ftp://bio-mirror.net/biomirror/genbank .
Those who experience slow FTP transfers of large files (entire releases, the
GenBank Cumulative Update, etc) might realize an improvement in transfer
rates from these alternate sites when traffic at the NCBI is heavy.
For additional release information, see the README files in either of the
directories mentioned above, and the release notes (gbrel.txt) in the
genbank directory. Sections 1.3 and 1.4 of the release notes (Changes in
Release 133.0 and Upcoming Changes) have been appended below.
* * * IMPORTANT * * *
As described in the October release notes, the GenBank Cumulative Update
data products will be discontinued in February of 2003. We strongly urge users
of the GBCU to review Section 1.4.1 of the release notes for further
information.
Release 133.0 data, and subsequent updates, are available now via NCBI's
Entrez and Blast services.
New GenBank cumulative update files (gbcu.flat.Z and gbcu.aso.Z), containing
only those entries new/updated since the Release 133.0 close-of-data, should be
available by about 10:00am EST, January 6. Please note that the new CUs will be
smaller than previous versions you might have obtained after Release 132.0 was
posted.
If you encounter problems while ftp'ing or uncompressing Release 133.0,
please send email outlining your difficulties to info at ncbi.nlm.nih.gov .
Mark Cavanaugh, Vladimir Alekseyev, Anton Butanaev, Michael Kimelman
GenBank
NCBI/NLM/NIH
1.3 Important Changes in Release 133.0
1.3.1 Organizational changes
The total number of sequence data files increased by 55 with this release:
- the EST division is now comprised of 235 files
- the GSS division is now comprised of 63 files
- the HTC division is now comprised of 3 files
- the HTG division is now comprised of 57 files
- the PAT division is now comprised of 7 files
- the PLN division is now comprised of 7 files
- the PRI division is now comprised of 24 files
- the ROD division is now comprised of 6 files
However, note that there was a special supplemental file (gbsup.seq) for
full-length-insert cDNA sequences in Release 132.0.
The fli-cDNA sequences *should* have been present in the standard
divisional files. The problem that required the use of the supplemental
file has been corrected, so this release does not include gbsup.seq .
If the removal of gbsup.seq is taken into account, the total number of
sequence data files increased by 54 .
1.3.2 New SET Data File For The ASN.1 Representation
Some phylogenetic and mutational studies involve sequences from more
than one of the 'taxonomic' divisions of GenBank. For example, a phylogenetic
study might involve sequences obtained from human (PRI) and non-primate
mammalian (MAM) sources.
Such studies, often with associated sequence alignments, are maintained
and edited as a single unit in the underlying data representation (ASN.1)
utilized by the NCBI.
When generating GenBank flatfiles from such studies, the component
sequences are processed in such a way that they are individually directed
to an appropriate divisional file. For example, the human sequences to a
PRI division file, the other mammalian sequences to the MAM division file.
In the past, we have mimicked this behavior for the ASN.1 version of
GenBank releases by splitting the studies into their components, and
splicing them into an appropriate divisional ASN.1 file (eg, gbpri1.aso
and gbmam.aso) .
This practice has clear disadvantages: the components of the studies
really should *not* be separated; and the post-processing of these special
studies adds considerable overhead to release processing.
Starting with this December 2002 release, we have ceased this practice
and have introduced a new ASN.1 data file for such multi-divisional
studies at ftp://ftp.ncbi.nih.gov/ncbi-asn1 . The new file is :
gbset.aso
1.3.3 Reduction In The Number Of ASN.1 Data Files
The sizes of many of the files for the ASN.1 version of GenBank releases
(see ftp://ftp.ncbi.nih.gov/ncbi-asn1 ) used to be well below the 250 MB
utilized
for the GenBank flatfile version. For example, the PRI division ASN.1 files
were about 140 MB apiece, and the HTG division files are about 190 MB apiece,
for GenBank Release 132.0 .
This was due to the fact that the ASN.1 representation was originally
used to create the flatfile version, on a file-by-file basis, during release
generation. Since the ASN.1 version is more compact than the flatfile version,
the ASN.1 file sizes had to be less than 250 MB to yield 250 MB flatfiles.
Now that the ASN.1 and flatfile versions are created independently, the
sizes of the ASN.1 files can be increased without consequences for the
flatfiles.
Starting with this December 2002 release, the file size limit for all
ASN.1 files has been increased to 250MB, and as a result, the total number of
ASN.1 files has been significantly reduced.
1.3.4 GSS File Header Problem
GSS sequences at GenBank are maintained in one of two different systems,
depending on their origin. One recent change to release processing involves
the parallelization of the dumps from those systems. Because the second dump
(for example) has no prior knowledge of exactly how many GSS files will be
dumped from the first, it doesn't know how to number it's own output files.
There is thus a discrepancy between the filenames and file headers of nine
GSS flatfiles in Release 133.0. Consider the gbgss55.seq file:
GBGSS1.SEQ Genetic Sequence Data Bank
December 15 2002
NCBI-GenBank Flat File Release 133.0
GSS Sequences (Part 1)
88062 loci, 66597557 bases, from 88062 reported sequences
Here, the filename and part number in the header is "1", though the file
has been renamed as "55" based on the files dumped from the other system.
We will work to resolve this discrepancy in future releases, but the
priority is certainly much lower than many other tasks.
1.4 Upcoming Changes
1.4.1 * * Cumulative GenBank Update Products To Be Discontinued * *
As of GenBank Release 134.0 in February of 2002, the cumulative
GenBank Update (GBCU) products will be discontinued:
ftp://ftp.ncbi.nih.gov/ncbi-asn1/daily/gbcu.aso.gzftp://ftp.ncbi.nih.gov/genbank/daily/gbcu.flat.gzftp://ftp.ncbi.nih.gov/genbank/daily/gbcu.fsa_nt.gzftp://ftp.ncbi.nih.gov/genbank/daily/gbcu.gnp.gzftp://ftp.ncbi.nih.gov/genbank/daily/gbcu.qscore.gzftp://ftp.ncbi.nih.gov/genbank/daily/gpcu.fsa.gz
In the eight weeks between typical GenBank Releases, it is not uncommon
for GBCU products to approach 20% of the total database size. The flatfile
version, for example, has reached sizes in excess of 17 GB in recent weeks.
From a user perspective, repeatedly obtaining and processing such a
large update product makes inefficient use of both bandwidth and local
resources, compared to the much smaller incremental GbUpdate products.
And in order to reliably generate the GBCU in the face of such explosive
growth, NCBI would have to invest significant resources to increase the
performance of a large body of software.
Given these factors, plus the questionable value of an "update" product,
generated daily, which will soon approach 20GB in size, we have decided
that the GBCU should be discontinued. We will analyze FTP logs and
proactively contact the larger centers which utilize the GBCU, to suggest
alternate processing strategies.
If large numbers of users are unable to switch to processing incremental
updates by February 2002, there is a possibility that the date for
discontinuing the GBCU might be pushed back to April.
We will keep users informed of the timetable for this important change
via these release notes and the GenBank newsgroup. And of course, we
welcome discussions of this change via the newsgroup.
1.4.2 New /mol_type qualifier
As of the April 2003 GenBank Release (134.0), a new source feature
qualifier called /mol_type will begin to be used for source features.
This qualifier will be used to indicate the in-vivo biological state
of the sequence presented in a database record.
The preliminary definition for /segment is :
Qualifier /mol_type=
Definition in vivo molecule type
Value format "text"
Example /mol_type="genomic DNA",
Comment text limited to "genomic DNA", "genomic RNA", "mRNA"
(incl EST),
"tRNA", "rRNA", "snoRNA", "snRNA", "scRNA", "pre-mRNA",
"other RNA" (incl. synthetic), "other DNA" (incl.
synthetic),
"unassigned DNA" (incl. unknown),"unassigned RNA" (incl.
unknown)
In-vivo molecule type information is already presented on the LOCUS
line of the GenBank flatfile format. However, introducing /mol_type
in the Feature Table will make the exchange of this information among
DDBJ, EMBL, and GenBank more complete and accurate.
NOTE: /mol_type will eventually be a mandatory qualifier for the source
feature,
probably by June 2003.
1.4.3 New /segment qualifier
As of the April 2003 GenBank Release (134.0), a new source feature
qualifier called /segment will begin to be used for source features.
In the absence of a more suitable way to annotate viral segments, this
information had either not been included in database entries, or had been
annotated incorrectly (e.g. using /chromosome, /map etc). This new
qualifier addresses that lack.
The preliminary definition for /segment is :
Qualifier /segment=
Definition name of viral or phage segment sequenced
Value format "text"
Example /segment="6"
1.4.4 New /locus_tag qualifier
As of the April 2003 GenBank Release (134.0), a new source feature
qualifier called /locus_tag will begin to be used.
Many complete-genome sequencing projects use solely computational
methods to predict coding regions and genes. The /locus_tag qualifier
provides a method for identifying and tracking the results of such
computations, without utilizing existing qualifiers such as /gene .
These 'locus tags' are systematically assigned, and do not necessarily
reflect gene name/symbol conventions in experimental literature. Hence
the introduction of a new qualifier.
The preliminary definition for /locus_tag is :
Qualifier: /locus_tag
Definition: feature tag assigned for tracking purposes
Value Format: "text" (single token)
Example: /locus_tag="RSc0382"
/locus_tag="YPO0002"
Comment: /locus_tag can be used with any feature where /gene
is valid;
1.4.5 Third-Party Annotation and Consensus Sequences (TPA)
Pursuant to agreements made at the 2002 Collaborative Meeting,
DDBJ/EMBL/GenBank
have undertaken the collection of a new class of sequence data : Third-Party
Annotation and Consensus Sequences (TPA).
The TPA data-collection will complement the existing DDBJ/EMBL/GenBank
comprehensive database of primary nucleotide sequences, which typically result
from direct sequencing of cDNAs, ESTs, genomic DNAs, etc.
'Primary data' are defined to be data for which the submitting group has done
the sequencing and annotation, and as 'owner' of these data has privileges to
update/correct the associated sequence records.
In contrast, non-primary (TPA) sequences are defined as sequences which:
a) consist exclusively of sequence data from one, or several,
previously-existing entries 'owned' by other groups, or
b) consist of a mixture of new & previously-existing sequences
TPA categories and requirements
-------------------------------
Users can submit re-annotations/re-assemblies of sequences already
present in DDBJ/EMBL/GenBank and owned by other groups to be
included in the Third Party Annotation (TPA) data-collection.
Categories of data submissions accepted for TPA include:
1. re-annotation/analysis of sequence(s) from DDBJ/EMBL/GenBank
2. mixtures of primary/non-primary sequences, including regions of
new and existing sequence (e.g. filling gaps in a sequence
with data from HTG or EST projects, or newly sequenced data)
3. TPA sequences based on NCBI/Ensembl trace archive data
4. TPA sequences based on Whole Genome Shotgun (WGS) sequences
Consensus sequences from multiple organisms are not accepted.
The TPA dataset is primarily intended as a means to present sequence
and annotation in support of actual biological discoveries, published
in the scientific literature, without requiring that every basepair
has actually been sequenced by the authors/submittors.
In order to assure that the sequence annotation is of high quality,
it is required that TPA records be associated with a study published
in a peer-reviewed journal before the data is released to the public.
Third Party Annotation (TPA) records include a mandatory 'TPA-block'
which documents the relationships between spans of the TPA sequence
and the primary (non-TPA) sequences that contributed to it. The
elements of the TPA-block are:
a) TPA-SPAN base span on TPA sequence
b) PRIMARY_IDENTIFIER acc.version of contributing sequence(s)
c) PRIMARY_SPAN base span on contributing primary sequence
d) COMP 'c' is used to indicate that contributing
sequence is originating from complementary
strand in primary sequence entry
Example:
TPA_SPAN PRIMARY_IDENTIFIER PRIMARY_SPAN COMP
1-426 AC004528.1 18665-19090
427-526 AC001234.2 1-100 c
Preliminary exchange of TPA records among DDBJ/EMBL/GenBank are
underway. Within two months, data products will be made available at
the GenBank FTP site for TPA sequences. Details about those products,
sample records, and instructions for submission of TPA data, will
be communicated via the GenBank newsgroup:
http://net.bio.net/hypermail/genbankb/
---
- gttaacaattaaagagtgtttatcgaaattcattatatagtggtttatatagaccacttc
-
- GenBank newsgroup see: http://www.bio.net/hypermail/genbankb/
- GENBANKB e-mail: messages sent to genbankb at net.bio.net
- subscribe: e-mail biosci-server at net.bio.net with: subscribe genbankb
- unsub: e-mail biosci-server at net.bio.net with: unsubscribe genbankb
- GenBank on the WWW, see: http://www.ncbi.nlm.nih.gov/Genbank/
- problems with GENBANKB? E-mail moderator: francis at cmmt.ubc.ca