IUBio GIL .. BIOSCI/Bionet News .. Biosequences .. Software .. FTP

GenBank Release 141.0 Now Available

Roels, Steven Steven.Roels at mpi.com
Mon Apr 26 06:30:13 EST 2004

Hi Mark,
Regarding the two missing index files for GenBank 141.0:
=A0=A0=A0=A0=A0=A0=A0 gbacc.idx
=A0=A0=A0=A0=A0=A0=A0 gbjou.idx
Will they be release in a few days, or are they simply not going to be rele=
ased this time around?

      -----Original Message-----
      From: owner-genbankb at hgmp.mrc.ac.uk=A0on behalf of=A0Mark Cavanaugh
      Sent: Sat 4/24/2004 10:11 PM
      To: genbank at net.bio.net
      Subject: GenBank Release 141.0 Now Available

      Greetings GenBank Users,

      =A0 GenBank Release 141.0 is now available via ftp from the National
      Center for Biotechnology Information (NCBI):

      =A0 Ftp Site=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 Directory=A0=A0 Contents
      =A0 ----------------=A0=A0 ---------=A0=A0 --------------------------=
      =A0 ftp.ncbi.nih.gov=A0=A0 genbank=A0=A0=A0=A0 GenBank Release 141.0 =
      =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 ncbi-asn=
1=A0=A0 ASN.1 data used to create Release 141.0

      =A0 Uncompressed, the Release 141.0 flatfiles require approximately 1=
31 GB
      (sequence files only) or 146 GB (including the 'short directory' and
      'index' files).=A0 The ASN.1 version requires approximately 115 GB. F=
      the release notes:

      =A0=A0 Release=A0 Date=A0=A0=A0=A0=A0=A0 Base Pairs=A0=A0 Entries

      =A0=A0 140=A0=A0=A0=A0=A0 Feb 2004=A0=A0 37893844733=A0 32549400
      =A0=A0 141=A0=A0=A0=A0=A0 Apr 2004=A0=A0 38989342565=A0 33676218

      =A0 Close-of-data was 04/20/2004. Five working days were required to =
      this release. In the eight week period between the close dates for Ge=
      releases 140.0 and 141.0, the non-WGS portion of GenBank grew by 1,09=
      basepairs and by 1,126,818 sequence records. During that same period,=
      records were updated. Combined, this yields an average of about 28,50=
0 new
      and/or updated records per day.

      =A0 Between releases 140.0 and 141.0, the WGS component of GenBank gr=
ew by
      1,954,410,330 basepairs and by 923,778 sequence records.

      =A0 *NOTE* Problems were encountered during release processing which
      prevented the generation of the gbacc.idx and gbjou.idx 'index' files
      for GenBank 141.0. Please see Section 1.3.1 of the release notes for
      further details.

      =A0 We would like to remind our users that GenBank mirrors are availa=
      at ftp://genbank.sdsc.edu/pub and ftp://bio-mirror.net/biomirror/genb=
      Those who experience slow FTP transfers due to a high volume of traff=
ic at
      NCBI might realize an improvement in transfer rates from these altern=
ate sites.

      =A0 For additional release information, see the README files in eithe=
r of
      the directories mentioned above, and the release notes (gbrel.txt) in
      the genbank directory. Sections 1.3 and 1.4 of the release notes
      (Changes in Release 141.0 and Upcoming Changes) have been appended

      =A0 *NOTE* Section 1.4.1 discusses a very important change : the remo=
      of sequence length limits for all classes of GenBank sequence records=
      as of June 2004. We strongly encourage all users to review this

      =A0 Release 141.0 data, and subsequent updates, are available now via
      NCBI's Entrez and Blast services.

      =A0 If you encounter problems while ftp'ing or uncompressing Release
      141.0, please send email outlining your difficulties to
      info at ncbi.nlm.nih.gov .

      Mark Cavanaugh, Vladimir Alekseyev, Anton Butanaev, Michael Kimelman

      1.3 Important Changes in Release 141.0

      1.3.1 Two 'index' files unavailable for GenBank 141.0

      =A0=A0=A0 An unexpected software problem prevented the generation of =
two of
      =A0 the 'index' files which normally accompany GenBank releases:

      =A0=A0=A0=A0=A0=A0=A0 gbacc.idx
      =A0=A0=A0=A0=A0=A0=A0 gbjou.idx

      =A0=A0=A0 Resolving the problem would have delayed release processing=
 by three
      =A0 days, with a concommitant delay in the indexing of recently proce=
      =A0 records for Entrez. So it was decided to make GenBank 141.0 avail=
      =A0 without these index files. Our apologies for any inconvenience th=
at this
      =A0 might cause.

      1.3.2 Organizational changes

      =A0 The total number of sequence data files increased by 16 with this=

      =A0 - the BCT division is now comprised of=A0=A0 9 files (+1)
      =A0 - the EST division is now comprised of 305 files (+9)
      =A0 - the GSS division is now comprised of 104 files (+2)
      =A0 - the PLN division is now comprised of=A0 11 files (+1)
      =A0 - the ROD division is now comprised of=A0 12 files (+1)
      =A0 - the STS division is now comprised of=A0=A0 4 files (+1)
      =A0 - the VRT division is now comprised of=A0=A0 6 files (+1)

      1.3.3 GSS File Header Problem

      =A0 GSS sequences at GenBank are maintained in one of two different s=
      depending on their origin. One recent change to release processing in=
      the parallelization of the dumps from those systems. Because the seco=
nd dump
      (for example) has no prior knowledge of exactly how many GSS files wi=
ll be
      dumped from the first, it doesn't know how to number it's own output =

      =A0 There is thus a discrepancy between the filenames and file header=
s for
      thirteen GSS flatfiles in Release 141.0. Consider the gbgss92.seq fil=

      GBGSS1.SEQ=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 Genetic Sequence Data Bank
=A0=A0=A0 April 15 2004

      =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 NCBI-GenBank Flat File =
Release 141.0

=A0=A0=A0 GSS Sequences (Part 1)

      =A0=A0 88249 loci,=A0=A0=A0 65635323 bases, from=A0=A0=A0 88249 repor=
ted sequences

      =A0 Here, the filename and part number in the header is "1", though t=
he file
      has been renamed as "92" based on the files dumped from the other sys=

      =A0 We will work to resolve this discrepancy in future releases, but =
      priority is certainly much lower than many other tasks.

      1.4 Upcoming Changes

      1.4.1 **Sequence Length Limitation To Be Removed In June 2004**

      =A0 At the May 2003 collaborative meeting among representatives of Ge=
      EMBL, and DDBJ, it was decided that the 350 kilobase limit on the seq=
      length of database records will be removed as of June 2004.

      =A0 Individual, complete sequences are currently expected to be a max=
      of 350 kbp in length. One major reason for the existence of this limi=
t is
      as an aid to users of sequence analysis software, some of which might=
      be capable of processing megabase-scale sequences.

      =A0 However, very significant exceptions to the 350 kbp limit have ex=
      for several years; Phase 1 (unordered, unoriented) and Phase 2 (order=
      oriented) high-throughput genomic sequences (HTGS) generated by effor=
      such as the Human Genome Project; large dispersed eukaryotic genes wi=
      an intron/exon structure that spans more than 350 kbp; and sequences
      which result from assemblies of Whole Genome Shotgun (WGS) project da=

      =A0 Given these exceptions, and the technological advances which have=
      large-scale sequencing practical for an increasing number of research=
      the collaboration has decided that the 350 kbp limit must be removed.

      =A0 As of June 2004, the length of database sequences will be limited=
      by the natural structures of an organism's genome. For example, a sin=
      record might be used to represent all of human chromosome 1, which is
      approximately 245 Mbp in length.

      =A0 Software developers for some of the larger commercial sequence an=
      packages were recently asked what timeframe would be appropriate for =
      change. Answers ranged from "immediately", to "several months", to "o=
ne year".
      So the one-year timeframe was selected, to provide ample time to impl=
      changes which megabase-scale sequences may require.

      =A0 Some sample records with very large sequences have been made avai=
      so that developers can begin to test their software modifications:

      =A0=A0=A0=A0=A0=A0=A0 ftp://ftp.ncbi.nih.gov/genbank/LargeSeqs

      =A0 Many changes are expected after the removal of the length limit. =
      example, complete bacterial genomes (typically on the order of severa=
      megabases) will be re-assembled into single sequence records. The sub=
      process for such genomes will become much more streamlined, since dat=
      staff will no longer have to split the genomes into pieces. BLAST ser=
      will be enchanced, so that hits reported within very large sequences =
      be presented in a meaningful context.

      =A0 All such changes will be discussed more fully in future release n=
      the NCBI newsletter, and the GenBank newsgroup.

      1.4.2 Rename of File 'Last.Release' and Deletion of /daily Subdirecto=

      The files named Last.Release which are located at:

      =A0=A0=A0=A0=A0=A0=A0 ftp://ftp.ncbi.nih.gov/genbank/daily/Last.Relea=
      =A0=A0=A0=A0=A0=A0=A0 ftp://ftp.ncbi.nih.gov/ncbi-asn1/daily/Last.Rel=

      contain the number of the GenBank release which is currently installe=
      on the NCBI FTP site. As of Release 142.0 in June 2004, these files w=
      be moved and renamed as:

      =A0=A0=A0=A0=A0=A0=A0 ftp://ftp.ncbi.nih.gov/genbank/GB_Release_Numbe=
      =A0=A0=A0=A0=A0=A0=A0 ftp://ftp.ncbi.nih.gov/ncbi-asn1/GB_Release_Num=

      The /daily subdirectories, which had been used for cumulative update
      products that are no longer supported, will be deleted at that time.

      - gttaacaattaaagagtgtttatcgaaattcattatatagtggtttatatagaccacttc
      - GenBank newsgroup see: http://www.bio.net/hypermail/genbankb/=A0=A0=
      - GENBANKB e-mail: messages sent to genbankb at net.bio.net
      - subscribe: e-mail biosci-server at net.bio.net with: subscribe genbank=
      - unsub: e-mail biosci-server at net.bio.net with: unsubscribe genbankb=
      - GenBank on the WWW, see:=A0 http://www.ncbi.nlm.nih.gov/Genbank/
      - problems with GENBANKB? E-mail moderator: francis at bioinformatics.ub=

This e-mail, including any attachments, is a confidential business communic=
ation, and may contain information that is confidential, proprietary and/or=
 privileged.  This e-mail is intended only for the individual(s) to whom it=
 is addressed, and may not be saved, copied, printed, disclosed or used by =
anyone else.  If you are not the(an) intended recipient, please immediately=
 delete this e-mail from your computer system and notify the sender.  Thank=

- gttaacaattaaagagtgtttatcgaaattcattatatagtggtttatatagaccacttc
- GenBank newsgroup see: http://www.bio.net/hypermail/genbankb/      =20
- GENBANKB e-mail: messages sent to genbankb at net.bio.net
- subscribe: e-mail biosci-server at net.bio.net with: subscribe genbankb
- unsub: e-mail biosci-server at net.bio.net with: unsubscribe genbankb     =
- GenBank on the WWW, see:  http://www.ncbi.nlm.nih.gov/Genbank/
- problems with GENBANKB? E-mail moderator: francis at bioinformatics.ubc.ca  =

More information about the Genbankb mailing list

Send comments to us at archive@iubioarchive.bio.net