GenBank Release 140.0 Now Available

Mark Cavanaugh cavanaug at ncbi.nlm.nih.gov
Wed Feb 25 19:07:30 EST 2004

Greetings GenBank Users,

  GenBank Release 140.0 is now available via ftp from the National
Center for Biotechnology Information (NCBI):

  Ftp Site           Directory   Contents
  ----------------   ---------   ---------------------------------------
  ftp.ncbi.nih.gov   genbank     GenBank Release 140.0 flatfiles
                     ncbi-asn1   ASN.1 data used to create Release 140.0

  Uncompressed, the Release 140.0 flatfiles require approximately 127 GB
(sequence files only) or 143 GB (including the 'short directory' and
'index' files).  The ASN.1 version requires approximately 113 GB. From
the release notes:

   Release  Date       Base Pairs   Entries

   139      Dec 2003   36553368485  30968418
   140      Feb 2004   37893844733  32549400

  Close-of-data was 02/20/2004. Four business days were required to prepare
this release. In the eight week period between the close dates for GenBank
releases 139.0 and 140.0, the non-WGS portion of GenBank grew by 1,340,476,248
basepairs and by 1,580,982 sequence records. During that same period, 86,681
records were updated. Combined, this yields an average of about 26,470 new
and/or updated records per day.

  Between releases 139.0 and 140.0, the WGS component of GenBank grew by
8,280,691,017 basepairs and by 641,660 sequence records. Note that WGS
growth statistics are now available in Section 2.2.8 of the GenBank release

  We would like to remind our users that GenBank mirrors are available
at ftp://genbank.sdsc.edu/pub and ftp://bio-mirror.net/biomirror/genbank.
Those who experience slow FTP transfers due to a high volume of traffic at
NCBI might realize an improvement in transfer rates from these alternate sites.

  For additional release information, see the README files in either of
the directories mentioned above, and the release notes (gbrel.txt) in
the genbank directory. Sections 1.3 and 1.4 of the release notes
(Changes in Release 140.0 and Upcoming Changes) have been appended

  *NOTE* Section 1.4.1 discusses a very important change : the removal
of sequence length limits for all classes of GenBank sequence records,
as of June 2004. We strongly encourage all users to review this

  Release 140.0 data, and subsequent updates, are available now via
NCBI's Entrez and Blast services.

  If you encounter problems while ftp'ing or uncompressing Release
140.0, please send email outlining your difficulties to
info at ncbi.nlm.nih.gov .

Mark Cavanaugh, Vladimir Alekseyev, Anton Butanaev, Michael Kimelman

1.3 Important Changes in Release 140.0

1.3.1 Organizational changes

  The total number of sequence data files increased by 20 with this release:

  - the EST division is now comprised of 296 files (+8)
  - the GSS division is now comprised of 102 files (+4)
  - the HTC division is now comprised of 4   files (+1)
  - the HTG division is now comprised of 62  files (+1)
  - the PAT division is now comprised of 15  files (+4)
  - the VRL division is now comprised of 4   files (+1)
  - the VRT division is now comprised of 5   files (+1)

1.3.2 Filename change : genpept.fsa

  A companion file is made available with every GenBank Release that contains
all of the protein sequences for the coding regions annotated on GenBank
records, in FASTA format. The name of this file has been changed from:


  The term 'GenPept' had been used for purely historical reasons. In fact, 
GenPept is the name of a (non-FASTA) flatfile format for protein sequences,
one that closely parallels the GenBank flatfile format for DNA sequences.

  With this rename, there will be less confusion about the contents of
the file.

1.3.3 WGS Statistics

  Although the Whole Genome Shotgun sequence data are not formally a part
of GenBank Release distributions :

        ftp://ftp.ncbi.nih.gov/ncbi-asn1/wgs (ASN.1 files)
        ftp://ftp.ncbi.nih.gov/genbank/wgs   (GenBank flatfiles)

WGS sequences are a fast-growing component of the GenBank database. So as of
this release, we are providing a table of WGS growth statistics in Section
2.2.8 of the release notes. The first WGS project was processed at GenBank
in April of 2002, so the table starts with Release 129.0 .

1.3.4 GSS File Header Problem

  GSS sequences at GenBank are maintained in one of two different systems,
depending on their origin. One recent change to release processing involves
the parallelization of the dumps from those systems. Because the second dump
(for example) has no prior knowledge of exactly how many GSS files will be
dumped from the first, it doesn't know how to number it's own output files.

  There is thus a discrepancy between the filenames and file headers of twelve
GSS flatfiles in Release 140.0. Consider the gbgss91.seq file:

GBGSS1.SEQ           Genetic Sequence Data Bank
                          February 15 2004

                NCBI-GenBank Flat File Release 140.0

                           GSS Sequences (Part 1)

   88249 loci,    65634155 bases, from    88249 reported sequences

  Here, the filename and part number in the header is "1", though the file
has been renamed as "91" based on the files dumped from the other system.

  We will work to resolve this discrepancy in future releases, but the
priority is certainly much lower than many other tasks.

1.4 Upcoming Changes

1.4.1 **Sequence Length Limitation To Be Removed In June 2004**

  At the May 2003 collaborative meeting among representatives of GenBank,
EMBL, and DDBJ, it was decided that the 350 kilobase limit on the sequence
length of database records will be removed as of June 2004.

  Individual, complete sequences are currently expected to be a maximum
of 350 kbp in length. One major reason for the existence of this limit is
as an aid to users of sequence analysis software, some of which might not
be capable of processing megabase-scale sequences.

  However, very significant exceptions to the 350 kbp limit have existed
for several years; Phase 1 (unordered, unoriented) and Phase 2 (ordered,
oriented) high-throughput genomic sequences (HTGS) generated by efforts
such as the Human Genome Project; large dispersed eukaryotic genes with
an intron/exon structure that spans more than 350 kbp; and sequences
which result from assemblies of Whole Genome Shotgun (WGS) project data.

  Given these exceptions, and the technological advances which have made
large-scale sequencing practical for an increasing number of researchers,
the collaboration has decided that the 350 kbp limit must be removed.

  As of June 2004, the length of database sequences will be limited only
by the natural structures of an organism's genome. For example, a single
record might be used to represent all of human chromosome 1, which is
approximately 245 Mbp in length.

  Software developers for some of the larger commercial sequence analysis
packages were recently asked what timeframe would be appropriate for this
change. Answers ranged from "immediately", to "several months", to "one year".
So the one-year timeframe was selected, to provide ample time to implement
changes which megabase-scale sequences may require.

  Some sample records with very large sequences have been made available
so that developers can begin to test their software modifications:


  Many changes are expected after the removal of the length limit. For 
example, complete bacterial genomes (typically on the order of several
megabases) will be re-assembled into single sequence records. The submission
process for such genomes will become much more streamlined, since database
staff will no longer have to split the genomes into pieces. BLAST services
will be enchanced, so that hits reported within very large sequences will
be presented in a meaningful context.

  All such changes will be discussed more fully in future release notes,
the NCBI newsletter, and the GenBank newsgroup.

1.4.2 Rename of File 'Last.Release' and Deletion of /daily Subdirectories

The files named Last.Release which are located at:


contain the number of the GenBank release which is currently installed
on the NCBI FTP site. As of Release 142.0 in June 2004, these files will
be moved and renamed as:


The /daily subdirectories, which had been used for cumulative update
products that are no longer supported, will be deleted at that time.


- gttaacaattaaagagtgtttatcgaaattcattatatagtggtttatatagaccacttc
- GenBank newsgroup see: http://www.bio.net/hypermail/genbankb/       
- GENBANKB e-mail: messages sent to genbankb at net.bio.net
- subscribe: e-mail biosci-server at net.bio.net with: subscribe genbankb
- unsub: e-mail biosci-server at net.bio.net with: unsubscribe genbankb      
- GenBank on the WWW, see:  http://www.ncbi.nlm.nih.gov/Genbank/
GenBank on the WWW, see:  http://www.ncbi.nlm.nih.gov/Genbank/                  

