Greetings GenBank Users,
GenBank Release 117.0 is now available via ftp from the National Center
for Biotechnology Information:
Ftp Site Directory Contents
---------------- --------- ---------------------------------------
ncbi.nlm.nih.gov genbank GenBank Release 117.0 flatfiles
ncbi-asn1 ASN.1 data used to create Release 117.0
Uncompressed, the Release 117.0 flatfiles require roughly 24684 MB
(sequence files only) or 29031 MB (including the 'index' files). The
ASN.1 version requires roughly 20487 MB. From the release notes:
Release Date Base Pairs Entries
116 Feb 2000 5805414935 5691170
117 Apr 2000 7376080723 6215002
In the nine-week period between close-of-data for GenBank 116.0 and
GenBank 117.0, GenBank grew by a record 1.570 billion basepairs.
Close-of-data was 04/25/2000. Eight days were required to prepare this
release. Like Release 116.0, 117.0 has been delayed by two weeks, this time
due to changes in the systems used to generate the GenBank flatfiles that
comprise releases. These changes should be complete by June, so we expect
no delays for 118.0 . The release date for 117.0 remains "April 15",
chiefly as a convenience for recording growth stats.
PLEASE NOTE: Problems were encountered once again building the author-name
index file (gbaut.idx) for GenBank 117.0 . The version now available on
our ftp site, while complete (and exceeding 3.6 GB in size), has not been
converted to the tabular format required for this file. Rather than delay
GenBank 117.0 even further, we are making the index available in its raw form.
If resources permit, we will convert gbaut.idx to its proper format; our
apologies for any inconvenience that this causes.
(See also Section 1.4.2 of the release notes, below)
For additional release information, see the README files in either of the
directories mentioned above, and the release notes (gbrel.txt) in the
genbank directory. Sections 1.3 and 1.4 of the release notes (Changes in
Release 117.0 and Upcoming Changes) have been appended below.
Release 117.0 data are currently available via NCBI's Entrez and Blast
servers, and the 'query' email server.
New GenBank cumulative update files (gbcu.flat.Z and gbcu.aso.Z), containing
only those entries new/updated since the Release 117.0 close-of-data, should be
available by 6:00am EDT, May 3. Please note that the new CUs will be
smaller than previous versions you might have obtained after Release 116.0 was
posted.
If you encounter problems while ftp'ing or uncompressing Release 117.0,
please send email outlining your difficulties to info at ncbi.nlm.nih.gov .
Mark Cavanaugh
GenBank
NCBI/NLM/NIH
1.3 Important Changes in Release 117.0
1.3.1 Organizational changes
Due to database growth, the EST division is now being split into fifty-five
pieces.
Due to database growth, the GSS division is now being split into nineteen
pieces.
Due to database growth, the HTG division is now being split into thirty-one
pieces.
Due to database growth, the INV division is now being split into three pieces.
Due to database growth, the VRL division is now being split into two pieces.
1.3.2 Mutation and Allele features discontinued
Agreement was reached at the May 1999 collaborative DDBJ/EMBL/GenBank
meeting that the functionality provided by the variation, mutation, and
allele features can be represented by just a single feature, variation.
With GenBank Release 117.0, all existing mutation and allele features
have been converted to variation; mutation and allele are no longer legal
feature keys.
Reminder: complete Feature Table documentation is available at this URL:
http://www.ncbi.nlm.nih.gov/collab/FT/index.html
1.4 Upcoming Changes
1.4.1 New PUBMED linetype for REFERENCEs
Starting with GenBank Release 119.0 in August 2000, a new PUBMED
linetype will be legal for the REFERENCE block of GenBank flatfiles:
LOCUS AF245949 558 bp RNA VRL 30-APR-2000
DEFINITION Hepatitis C virus isolate P11 clone A41 polyprotein precursor,
E1/E2 region, gene, partial cds.
ACCESSION AF245949
VERSION AF245949.1 GI:7670856
....
REFERENCE 1 (bases 1 to 558)
AUTHORS Farci,P., Shimoda,A., Coiana,A., Diaz,G., Peddis,G.,
Melpolder,J.C., Strazzera,A., Chien,D.Y., Munoz,S.J.,
Balestrieri,A., Purcell,R.H. and Alter,H.J.
TITLE The outcome of acute hepatitis C predicted by the evolution of the
viral quasispecies
JOURNAL Science 288 (5464), 339-344 (2000)
MEDLINE 20230065
PUBMED 10764648
The PUBMED identifier is the record identifier for article abstracts
in the PubMed database :
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
Abstracts in PubMed that do not fall within Medline's scope will have only
a PUBMED identifier. Similarly, abstracts that *are* in Medline's scope but
which have not yet been assigned Medline UIs will have only a PUBMED identifier.
If an abstract is present in both the PubMed and Medline databases, both Medline UI
and PubMed ID will be provided.
1.4.2 New format for GenBank Index files
Starting with GenBank Release 119.0 in August 2000, the format of the
"index" files for releases will change from a tabular, fixed-column
format to a TAB-delimited, line-oriented format. The header information
at the start of index files will no longer be provided.
In general, index file entries consist of a line containing the indexed
term, followed by a table containing LOCUS/DIVISION/ACCESSION triplets.
For example:
GBKEY.IDX Genetic Sequence Data Bank
15 April 2000
NCBI-GenBank Flat File Release 117.0
Keyword Phrase Index
6215002 loci, 7376080723 bases, from 6215002 reported sequences
....
ZONA PELLUCIDA 2 GLYCOPROTEIN
AB000929 ROD AB000929 CATFZP2G MAM D45067 CJZPG2 PRI Y10767
DOGCZP2G MAM D45069 MRZPG2 PRI Y10690 PIGPZP2G MAM D45064
Notice that the "fixed" format is already broken, due to the presence of
eight-character accession numbers. Rather than define a new fixed format
that will break at some point in the future, and at the expense of slightly
larger files, the new index files for the above example will look like so:
ZONA PELLUCIDA 2 GLYCOPROTEIN
AB000929 ROD AB000929
CATFZP2G MAM D45067
CJZPG2 PRI Y10767
DOGCZP2G MAM D45069
MRZPG2 PRI Y10690
PIGPZP2G MAM D45064
A series of LOCUS/DIVISION/ACCESSION triplets, TAB-delimited (and with
a leading TAB), one per line, will follow each indexed value.
Complete details about the changes to the index files will be provided
via the GenBank newsgroup (bionet.molbio.genbank) and future release notes.
1.4.3 STS division will be split into multiple files
The STS GenBank division (gbsts.seq) will soon be split into multiple
files, since its size exceeds 300MB. This is likely to occur by GenBank
Release 118.0 (June 2000). The resulting files for STS will be:
gbsts1.seq and gbsts2.seq .
1.4.4 File-naming convention for ASN.1 data files will be changed.
Starting with GenBank Release 119.0 in August 2000, the filename
convention for the ASN.1 data files used to create GenBank flatfile
releases will be changed. These ASN.1 files can be found at the NCBI
ftp site:
ftp://ncbi.nlm.nih.gov/ncbi-asn1/
The naming convention for these files is currently:
DIV-CODE.aso.Z
For example:
bct1.aso.Z
bct2.aso.Z
This convention will be changed so that the ASN.1 filenames and the
GenBank flatfile names match more closely:
gbDIV-CODE.aso.Z
For example:
gbbct1.aso.Z
gbbct2.aso.Z
1.4.5 Selenocysteine representation
Selenocysteine residues within the protein translations of coding
region features have been represented in GenBank via the letter 'X'
and a /transl_except qualifier. At the May 1999 DDBJ/EMBL/GenBank
collaborative meeting, it was learned that IUPAC plans to adopt the
letter 'U' for selenocysteine.
DDBJ, EMBL, and GenBank will thus use this new amino acid abbreviation
for its /translation qualifiers. Although a timetable for its appearance
has not been finalized, we are mentioning this now because the introduction
of a new residue abbreviation is a fairly fundamental change.
Details about the use of 'U' will be made available via these release
notes and the GenBank newsgroup as they become available.
1.4.6 New REFERENCE type for on-line journals
Agreement was reached at the May 1999 collaborative DDBJ/EMBL/GenBank
meeting that an effort should be made to accomodate references which are
published only on-line. Until specifications for such references are
available from library organizations, GenBank will present them in a manner
like this:
REFERENCE 1 (bases 1 to 2858)
AUTHORS Smith, J.
TITLE Cloning and expression of a phospholipase gene
JOURNAL Online Publication
REMARK Online-Journal-name; Article Identifier; URL
This format is still tentative; additional information about this new
reference type will be made available via these release notes.
---
- gttaacaattaaagagtgtttatcgaaattcattatatagtggtttatatagaccacttc
-
- GenBank newsgroup see: http://www.bio.net/hypermail/genbankb/
- GENBANKB e-mail: messages sent to genbankb at net.bio.net
- subscribe: e-mail biosci-server at net.bio.net with: subscribe genbankb
- unsub: e-mail biosci-server at net.bio.net with: unsubscribe genbankb
- GenBank on the WWW, see: http://www.ncbi.nlm.nih.gov/Genbank/
- problems with GENBANKB? E-mail moderator: francis at cmmt.ubc.ca