Greetings GenBank Users,
GenBank Release 185.0 is now available via FTP from the
National Center for Biotechnology Information (NCBI):
Ftp Site Directory Contents
---------------- --------- ---------------------------------------
ftp.ncbi.nih.gov genbank GenBank Release 185.0 flatfiles
ncbi-asn1 ASN.1 data used to create Release 185.0
Close-of-data for GenBank 185.0 occurred on 08/14/2011. Uncompressed,
the Release 185.0 flatfiles require roughly 511 GB (sequence files only)
or 550 GB (including the 'short directory', 'index' and the *.txt
files). The ASN.1 data require approximately 420 GB.
Recent statistics for non-WGS, non-CON sequences:
Release Date Base Pairs Entries
184 Jun 2011 129178292958 140482268
185 Aug 2011 130671233801 142284608
Recent statistics for WGS sequences:
Release Date Base Pairs Entries
184 Jun 2011 200487078184 63735078
185 Aug 2011 208315831132 64997137
During the 46 days between the close dates for GenBank Releases 184.0
and 185.0, the non-WGS/non-CON portion of GenBank grew by 1,492,940,843
basepairs and by 1,802,340 sequence records. During that same period,
970,764 records were updated. An average of 60,285 non-WGS/non-CON
records were added and/or updated per day.
Between releases 184.0 and 185.0, the WGS component of GenBank grew by
7,828,752,948 basepairs and by 1,262,059 sequence records.
For additional release information, see the README files in either of
the directories mentioned above, and the release notes (gbrel.txt) in
the genbank directory. Sections 1.3 and 1.4 of the release notes
(Changes in Release 185.0 and Upcoming Changes) have been appended
below for your convenience.
** Important Notes **
* GenBank 'index' files are now provided without any EST content, and
without most GSS content. See Section 1.3.3 of the release notes for
further details.
NCBI is considering ceasing support for the index files, so we
encourage affected users to review that section and provide feedback.
Release 185.0 data, and subsequent updates, are available now via
NCBI's Entrez and Blast services.
As a general guideline, we suggest first transferring the GenBank
release notes (gbrel.txt) whenever a release is being obtained. Check
to make sure that the date and release number in the header of the
release notes are current (eg: August 15 2011, 185.0). If they are
not, interrupt the remaining transfers and then request assistance from
the NCBI Service Desk.
A comprehensive check of the headers of all release files after your
transfers are complete is also suggested. Here's how one might go about
this on a Unix or Linux platform, using csh/tcsh :
set files = `ls gb*.*`
foreach i ($files)
head -10 $i | grep Release
end
Or, if the files are compressed, perhaps:
gzcat $i | head -10 | grep Release
If you encounter problems while ftp'ing or uncompressing Release
185.0, please send email outlining your difficulties to:
info from ncbi.nlm.nih.gov
Mark Cavanaugh, Michael Kimelman, Ilya Dondoshansky, Sergey Zhdanov
GenBank
NCBI/NLM/NIH/HHS
1.3 Important Changes in Release 185.0
1.3.1 Organizational changes
The total number of sequence data files increased by 19 with this release:
- the BCT division is now composed of 75 files (+3)
- the ENV division is now composed of 42 files (+2)
- the EST division is now composed of 447 files (+2)
- the GSS division is now composed of 248 files (+1)
- the HTG division is now composed of 135 files (-1)
- the INV division is now composed of 31 files (+1)
- the PAT division is now composed of 168 files (+4)
- the PLN division is now composed of 50 files (+2)
- the TSA division is now composed of 35 files (+5)
On rare occasions, the number of HTG files decreases when a significant
number of HTG records 'graduate' to Phase 3, at which point they move to
a non-HTG division.
The total number of 'index' files increased by 2 with this release:
- the AUT (author name) index is now composed of 89 files (+1)
- the KEY (keyword) index is now composed of 6 files (+1)
1.3.2 Changes in the content of index files
As described in the GB 153 release notes, the 'index' files which accompany
GenBank releases (see Section 3.3) are considered to be a legacy data product by
NCBI, generated mostly for historical reasons. FTP statistics from January 2005
seemed to support this: the index files were transferred only half as frequently as
the files of sequence records. The inherent inefficiencies of the index file
format also lead us to suspect that they have little serious use by the user
community, particularly for EST and GSS records.
The software that generated the index file products received little
attention over the years, and finally reached its limitations in
February 2006 (Release 152.0). The required multi-server queries which
obtained and sorted many millions of rows of terms from several different
databases simply outgrew the capacity of the hardware used for GenBank
Release generation.
Our short-term solution is to cease generating some index-file content
for all EST sequence records, and for GSS sequence records that originate
via direct submission to NCBI.
The three gbacc*.idx index files continue to reflect the entirety of the
release, including all EST and GSS records, however the file contents are
unsorted.
These 'solutions' are really just stop-gaps, and we will likely pursue
one of two options:
a) Cease support of the 'index' file products altogether.
b) Provide new products that present some of the most useful data from
the legacy 'index' files, and cease support for other types of index data.
If you are a user of the 'index' files associated with GenBank releases, we
encourage you to make your wishes known, either via the GenBank newsgroup,
or via email to NCBI's Service Desk:
info from ncbi.nlm.nih.gov
Our apologies for any inconvenience that these changes may cause.
1.3.3 GSS File Header Problem
GSS sequences at GenBank are maintained in two different systems, depending
on their origin, and the dumps from those systems occur in parallel. Because
the second dump (for example) has no prior knowledge of exactly how many GSS
files will be dumped by the first, it does not know how to number its own
output files.
There is thus a discrepancy between the filenames and file headers for
103 of the GSS flatfiles in Release 185.0. Consider gbgss146.seq :
GBGSS1.SEQ Genetic Sequence Data Bank
August 15 2011
NCBI-GenBank Flat File Release 185.0
GSS Sequences (Part 1)
87126 loci, 64015147 bases, from 87126 reported sequences
Here, the filename and part number in the header is "1", though the file
has been renamed as "146" based on the number of files dumped from the other
system. We hope to resolve this discrepancy at some point, but the priority
is certainly much lower than many other tasks.
1.4 Upcoming Changes
1.4.1 Implementation of /whole_replicon qualifier abandoned
The introduction of a /whole_replicon qualifier was approved by the
International Nucleotide Sequence Database Collaboration during their
annual collaborative meeting in May 2010. However, implementation of the
new qualifer proved more difficult than expected, with a growing and
complex list of conditions under which /whole_replicon would *not* be
appropriate. Rather than continue to define what /whole_replicon is
not intended for, the INSDC has decided to make use of improved submission
processes which allow users to explicitly identify the "genome-level"
molecules (eg, chromosomes) that should be shown in the topmost view
of an organism's genome. Furthermore, given the implementation of
BioProject databases within the INSD, the exchange of project data
among the INSD members will include provision for indicating, explicitly,
the sequence records which represent "genome-level" molecules. With
these plans in place, it was agreed to abandon plans for the
/whole_replicon qualifier at the May 2011 INSDC annual meeting.
1.4.2 New centromere and telomere features
Telomeres and centromeres are essential features of chromosomes and
disrupting their structure affects the viability and life span of an
organism. Centromeric sequence varies from a compact, non-repetitive,
less than 150 base pair region in S. cerevisiae to a highly repetitive
and complex region of several hundred thousands of base pairs in
eukaryote genomes. The sequence at the telomeric ends is unique compared
to the rest of the chromosome and protects the chromosome ends from
recombination, fusion to other chromosomes or degradation by nucleases.
Currently telomere and centromere features may be under-annotated since
there are no specific feature keys for them, hence the INSDC approved
the creation of two new features at the May 2011 INSDC annual meeting:
Feature Key centromere
Definition region of biological interest identified as a centromere
and which have been experimentally characterized;
Optional qualifiers /note="text"
Comment the centromere feature describes the interval of DNA
that corresponds to a region where chromatids are held
and a kinetochore is formed;
Feature Key telomere
Definition region of biological interest identified as a telomere
and which have been experimentally characterized;
Optional qualifiers /note="text"
/rpt_unit_seq
/rpt_unit_range
/rpt_type
/mobile_element
Comment the telomere feature describes the interval of DNA
that corresponds to a specific structure at the end of
the linear eukaryotic chromosome which is required for
the integrity and maintenance of the end; this region is
unique compared to the rest of the chromosome and
represent the physical end of the chromosome;
These two features are intended for use when the centromere or telomere
have been actually been sequenced. These two new features will be legal as
of the GenBank Release 186.0 (October 15 2011).
1.4.3 New assembly_gap feature, and /gap_type and /linkage_evidence qualifiers
Complete genomes are often submitted to the INSDC via a small (or large)
set of independent sequence records, which can be assembled into chromosomes
and/or scaffolds. The CON-division records representing these scaffolds
and chromosomes are usually built using information provided in "AGP files"
provided by the submitter. See:
http://www.ncbi.nlm.nih.gov/genome/assembly/agp/AGP_Update.shtml
The AGP 2.0 specification includes provisions for a variety of different
gap types, as well as information about whether a gap between two
scaffold or chromosome components is an unspanned gap or a spanned gap.
There is also biological gap-types: telomere, centromere and repeat.
AGP 2.0 also supports terminology to describe the type of evidence used
to establish the linkage connecting the components on either side of a
spanned gap within a scaffold or chromosome. Unfortunately, there is no
mechanism to represent any of this information in the Feature Table.
To address this, the INSDC has decided to implement an assembly_gap
feature, and /gap_type and /linkage_evidence qualifiers, all of which
will be legal as of October 15 2011 (GenBank Release 186.0).
Preliminary definitions of the two new qualifiers are as follows:
Qualifier /gap_type=
Definition kind of gap connecting components, or the type of biological gaps
Value format "TYPE"
Example /gap_type="between scaffolds"
/gap_type="within scaffold"
Comment The qualifier is just for gap features. TYPE is a controlled
vocabulary:
"between scaffolds"
"within scaffold"
"telomere"
"centromere"
"short arm"
"heterochromatin"
"repeat within scaffold"
"repeat between scaffolds"
Qualifier /linkage_evidence=
Definition kind of evidence establishing linkage across a gap
Value format "TYPE"
Example /linkage_evidence="paired-ends"
/linkage_evidence="within_clone"
Comment The qualifier is just for gap features of type "within
scaffold" or "repeat within scaffold". TYPE is a controlled
vocabulary, from the new AGP Specification version 2.0 :
"paired_ends" - paired sequences from the two ends of a DNA fragment.
"align_genus" - alignment to a reference genome within the same genus.
"align_xgenus" - alignment to a reference genome within another genus.
"align_trnscpt" - alignment to a transcript from the same species.
"within_clone" - sequence on both sides of the gap is derived from the
same clone, but the gap is not spanned by paired-ends.
The adjacent sequence contigs have unknown order and
orientation.
"clone_contig" - linkage is provided by a clone contig in the tiling path
(TPF). For example, a gap where there is a known clone, but
there is not yet sequence for that clone.
"map" - linkage asserted using a non-sequence based map such as RH,
linkage, fingerprint or optical.
"strobe" - strobe sequencing (PacBio).
"unspecified" - used when converting old AGPs that lack a field for linkage
evidence into the new format.
Because there are existing CON-division records with gaps that are not
based on information derived from an AGP file, it was agreed that a new
feature should be introduced that will make use of these new qualifiers:
assembly_gap
A complete definition for this feature is not yet available, but we will
inform GenBank users as soon as it is finalized. Both /gap_type and
/linkage_evidence are expected to be mandatory for the assembly_gap feature.
The new centromere and telomere features (see Section 1.4.2) should
only be used when the actual sequence of a centromere/telomere has been
determined. If this is not the case, then an assembly_gap feature with
a /gap_type of "centromere" or "telomere" should be used instead.