Greetings GenBank Users,
GenBank Release 146.0 is now available via ftp from the National
Center for Biotechnology Information (NCBI):
Ftp Site Directory Contents
---------------- --------- ---------------------------------------
ftp.ncbi.nih.gov genbank GenBank Release 146.0 flatfiles
ncbi-asn1 ASN.1 data used to create Release 146.0
Close-of-data was 02/16/2005. Five business days were required to build
Release 146.0. Uncompressed, the Release 146.0 flatfiles require approximately
162 GB (sequence files only) or 180 GB (including the 'short directory' and
'index' files). The ASN.1 version requires approximately 140 GB. From
the release notes:
Release Date Base Pairs Entries
145 Dec 2004 44575745176 40604319
146 Feb 2005 46849831226 42734478
In the nearly nine week period between the close dates for GenBank Releases 145.0
and 146.0, the non-WGS portion of GenBank grew by 2,274,086,050 basepairs
and by 2,130,159 sequence records. During that same period, 489,419 records
were updated. Combined, this yields an average of about 42,250 new and/or
updated records per day.
Between releases 145.0 and 146.0, the WGS component of GenBank grew by
3,067,043,856 basepairs and by 701,367 sequence records.
As a general guideline, we suggest first transferring the GenBank release
notes (gbrel.txt) whenever a release is being obtained. Check to make sure
that the date and release number in the header of the release notes are
current (eg: February 15 2005, 146.0). If they are not, interrupt the
remaining transfers and then request assistance from the NCBI Service Desk.
A comprehensive check of the headers of all release files after your
transfers are complete is also suggested. Here's how one might go about
this on a unix platform with csh/tcsh :
set files = `ls gb*.*`
foreach i ($files)
head -10 $i | grep Release
end
Or, if the files are compressed, perhaps:
gzcat $i | head -10 | grep Release
For additional release information, see the README files in either of
the directories mentioned above, and the release notes (gbrel.txt) in
the genbank directory. Sections 1.3 and 1.4 of the release notes
(Changes in Release 146.0 and Upcoming Changes) have been appended
below.
Release 146.0 data, and subsequent updates, are available now via
NCBI's Entrez and Blast services.
If you encounter problems while ftp'ing or uncompressing Release
146.0, please send email outlining your difficulties to:
info at ncbi.nlm.nih.gov
Mark Cavanaugh, Vladimir Alekseyev, Aleksey Vysokolov, Michael Kimelman
GenBank
NCBI/NLM/NIH
1.3 Important Changes in Release 146.0
1.3.1 Organizational changes
The total number of sequence data files increased by 37 with this release:
- the EST division is now comprised of 377 files (+22)
- the GSS division is now comprised of 138 files (+6)
- the HTG division is now comprised of 63 files (+1)
- the PLN division is now comprised of 15 files (+2)
- the ROD division is now comprised of 16 files (+1)
- the STS division is now comprised of 9 files (+4)
- the VRT division is now comprised of 8 files (+1)
1.3.2 Continuous ranges of secondary accessions
With the removal of sequence length limits, some genomes (typically
bacterial) that had been split into many pieces are gradually being
replaced by a single sequence record. U00096 is a good example.
When this happens, the accessions of the former small pieces become
secondary accessions for the single large sequence record. When each
secondary is separately listed, the ACCESSION line becomes excessively
lengthy.
As of this February 2005 GenBank Release, continuous ranges of secondary
accessions (represented by a start accession, a dash character, and an end
accession) will begin to appear, initially within the GenBank Updates. In
the case of U00096, the ACCESSION line would look like:
ACCESSION U00096 AE000111-AE000510
1.3.3 GSS File Header Problem
GSS sequences at GenBank are maintained in two different systems, depending
on their origin, and the dumps from those systems occur in parallel. Because
the second dump (for example) has no prior knowledge of exactly how many GSS
files will be dumped from the first, it does not know how to number its own
output files.
There is thus a discrepancy between the filenames and file headers for
twenty-four GSS flatfiles in Release 146.0. Consider gbgss115.seq :
GBGSS1.SEQ Genetic Sequence Data Bank
February 15 2005
NCBI-GenBank Flat File Release 146.0
GSS Sequences (Part 1)
87937 loci, 65332512 bases, from 87937 reported sequences
Here, the filename and part number in the header is "1", though the file
has been renamed as "115" based on the number of files dumped from the other
system. We will work to resolve this discrepancy in future releases, but the
priority is certainly much lower than many other tasks.
1.4 Upcoming Changes
1.4.1 New ENV Division in April 2005
A new division for sequences obtained via environmental sampling methods
will be introduced with GenBank Release 147.0 in April 2005 . Records in this
new division will have these characteristics:
1. ENV division code on the LOCUS line
2. ENV keyword
3. /environmental_sample qualifier in the source feature
This new division will segregate sequences for which the source organism is
unknown, or can only be inferred by sequence comparison.
Sequences from WGS projects that involve environmental sampling will *not*
be distributed via this new division. All WGS projects will continue to be
distributed using project-specific data files at the NCBI FTP site:
ftp://ftp.ncbi.nih.gov/ncbi-asn1/wgsftp://ftp.ncbi.nih.gov/genbank/wgs
Additional information about the new ENV division will be provided via
these release notes and the GenBank newsgroup.
1.4.2 Removal of MEDLINE linetype in April 2005
The PUBMED linetype was introduced in December of 1997, as a means of
linking references in sequence records to the PubMed biomedical literature
database, based on a PubMed ID (PMID) .
Since then, we have been displaying both the PMID and its predecessor
(Medline Unique ID / MUID) for all references. For example :
LOCUS ECOGUABA 3531 bp DNA linear BCT
09-FEB-2005
DEFINITION Escherichia coli guaBA operon operon, complete sequence.
ACCESSION M10101 M10102
VERSION M10101.1 GI:146274
....
REFERENCE 1 (bases 1768 to 3531)
AUTHORS Tiedeman,A.A., Smith,J.M. and Zalkin,H.
TITLE Nucleotide sequence of the guaA gene encoding GMP synthetase of
Escherichia coli K12
JOURNAL J. Biol. Chem. 260 (15), 8676-8679 (1985)
MEDLINE 85261223
PUBMED 3894345
Subsequent to 1997, PMID article identifiers subsumed MUIDs. Some background
information about that evolution can be found at:
http://www.nlm.nih.gov/pubs/techbull/mj01/mj01_medline_ui.html
Starting with GenBank Release 147.0 in April of 2005, the older MEDLINE
linetype will be displayed in GenBank sequence records only for (very rare)
articles that lack a PMID identifier.
For the vast majority of cases, this means that the MEDLINE linetype will
no longer be displayed; only the PUBMED identifier will be presented.
---
- gttaacaattaaagagtgtttatcgaaattcattatatagtggtttatatagaccacttc
-
- GenBank newsgroup see: http://www.bio.net/hypermail/genbankb/
- GENBANKB e-mail: messages sent to genbankb at net.bio.net
- subscribe: e-mail biosci-server at net.bio.net with: subscribe genbankb
- unsub: e-mail biosci-server at net.bio.net with: unsubscribe genbankb
- GenBank on the WWW, see: http://www.ncbi.nlm.nih.gov/Genbank/
- problems with GENBANKB? E-mail moderator: francis at bioinformatics.ubc.ca