Greetings GenBank Users,
GenBank Release 211.0 is now available via FTP from the National Center
for Biotechnology Information (NCBI):
Ftp Site Directory Contents
---------------- --------- ---------------------------------------
ftp.ncbi.nih.gov genbank GenBank Release 211.0 flatfiles
ncbi-asn1 ASN.1 data used to create Release 211.0
Close-of-data for GenBank 211.0 occurred on 12/14/2015. Uncompressed,
the Release 211.0 flatfiles require roughly 749 GB (sequence files only).
The ASN.1 data require approximately 613 GB.
Recent statistics for 'traditional' sequences (including non-bulk-oriented
TSA, and excluding WGS, bulk-oriented TSA, and the CON-division):
Release Date Base Pairs Entries
210 Oct 2015 202237081559 188372017
211 Dec 2015 203939111071 189232925
Recent statistics for WGS sequencing projects:
Release Date Base Pairs Entries
210 Oct 2015 1222635267498 309198943
211 Dec 2015 1297865618365 317122157
Recent statistics for bulk-oriented TSA sequencing projects:
Release Date Base Pairs Entries
210 Oct 2015 70917172944 81790031
211 Dec 2015 77583339176 87488539
During the 60 days between the close dates for GenBank Releases 210.0
and 211.0, the 'traditional' portion of GenBank grew by 1,702,029,512
basepairs and by 860,908 sequence records. During that same period,
1,626,191 records were updated. An average of 41,451 'traditional' records
were added and/or updated per day.
Between releases 210.0 and 211.0, the WGS component of GenBank grew by
75,230,350,867 basepairs and by 7,923,214 sequence records.
Between releases 210.0 and 211.0, the TSA component of GenBank grew by
6,666,166,232 basepairs and by 5,698,508 sequence records.
For additional release information, see the README files in either of
the directories mentioned above, and the release notes (gbrel.txt) in
the genbank directory. Sections 1.3 and 1.4 of the release notes
(Changes in Release 211.0 and Upcoming Changes) have been appended
below for your convenience.
* * * Important * * *
A significant change is described in Section 1.4.1 of the release
notes: Removal of NCBI GI sequence identifiers from GenBank, GenPept,
and FASTA sequence formats. Users who make use of GIs in their information
systems and analysis pipelines should take particular note of that section.
Release 211.0 data, and subsequent updates, are available now via
NCBI's Entrez and Blast services.
As a general guideline, we suggest first transferring the GenBank
release notes (gbrel.txt) whenever a release is being obtained. Check
to make sure that the date and release number in the header of the
release notes are current (eg: December 15 2015, 211.0). If they are
not, interrupt the remaining transfers and then request assistance from
the NCBI Service Desk.
A comprehensive check of the headers of all release files after your
transfers are complete is also suggested. Here's how one might go about
this on a Unix or Linux platform, using csh/tcsh :
set files = `ls gb*.*`
foreach i ($files)
head -10 $i | grep Release
end
Or, if the files are compressed, perhaps:
gzcat $i | head -10 | grep Release
If you encounter problems while ftp'ing or uncompressing Release
211.0, please send email outlining your difficulties to:
info from ncbi.nlm.nih.gov
Mark Cavanaugh, Michael Kimelman, Ilya Dondoshansky, Sergey Zhdanov,
GenBank
NCBI/NLM/NIH/HHS
1.3 Important Changes in Release 211.0
1.3.1 Organizational changes
The total number of sequence data files increased by 29 with this release:
- the BCT division is now composed of 216 files (+8)
- the CON division is now composed of 327 files (+3)
- the ENV division is now composed of 88 files (+2)
- the INV division is now composed of 135 files (+3)
- the PAT division is now composed of 242 files (+7)
- the PLN division is now composed of 122 files (+3)
- the PRI division is now composed of 50 files (+1)
- the ROD division is now composed of 32 files (+1)
- the TSA division is now composed of 194 files (-1)
- the VRL division is now composed of 39 files (+1)
- the VRT division is now composed of 60 files (+1)
Note: The 'loss' of gbtsa195.seq is due to improved packaging of
sequence records, which resulted in one less file for that division.
1.3.2 Expansion of /rpt_unit controlled vocabulary and conversion of LTR features
The /rpt_unit qualifier, often used for repeat_region features, currently
has a very limited number of allowed values:
Qualifier /rpt_type=
Definition organization of repeated sequence
Value format tandem, inverted, flanking, terminal, direct, dispersed, and other
As of this GenBank Release 211.0, this controlled vocabulary has been
expanded to include seven new terms:
long_terminal_repeat
non_LTR_retrotransposon_polymeric_tract
X_element_combinatorial_repeat
Y_prime_element
telomeric_repeat
centromeric_repeat
engineered_foreign_repetitive_element
With "long_terminal_repeat" now a supported value, this allows for the
conversion of existing LTR features into repeat_region features. The
definition for LTR is:
Feature Key LTR
Definition long terminal repeat, a sequence directly repeated at
both ends of a defined sequence, of the sort typically
found in retroviruses;
LTRs are just another variety of repetitive region, so representing
them via repeat_region features with an appropriate /rpt_type qualifier
will simplify the Feature Table. This change will also occur as of the
December 2015 GenBank Release.
1.3.3 GSS File Header Problem
GSS sequences at GenBank are maintained in two different systems, depending
on their origin, and the dumps from those systems occur in parallel. Because
the second dump (for example) has no prior knowledge of exactly how many GSS
files will be dumped by the first, it does not know how to number its own
output files.
There is thus a discrepancy between the filenames and file headers for 127
of the GSS flatfiles in Release 211.0. Consider gbgss173.seq :
GBGSS1.SEQ Genetic Sequence Data Bank
December 15 2015
NCBI-GenBank Flat File Release 211.0
GSS Sequences (Part 1)
87032 loci, 63853715 bases, from 87032 reported sequences
Here, the filename and part number in the header is "1", though the file
has been renamed as "173" based on the number of files dumped from the other
system. We hope to resolve this discrepancy at some point, but the priority
is certainly much lower than many other tasks.
1.4 Upcoming Changes
1.4.1 GI sequence identifiers to be removed from GenBank/GenPept/FASTA formats
As of 06/15/2016, the integer sequence identifiers known as "GIs" will no
longer be included in the GenBank, GenPept, and FASTA formats supported by
NCBI for the display of sequence records.
As first described in the Release Notes for GenBank 199.0 in December 2013,
NCBI is in the process of moving to storage solutions which utilize only
Accession.Version identifiers. See Section 1.4.3 of these release notes for
additional background information about those developments.
Although GI sequence identifiers served their purpose well for many years,
the Accession.Version system is completely equivalent (and much more
human-readable).
And given the shift to non-GI-based systems, the importance of using
Accession.Version identifiers cannot be overstated. So as an initial step, NCBI
will cease the display of GI identifiers in the flatfile and FASTA views of
all sequence records.
Previously-assigned GI identifiers will continue to exist 'behind the scenes',
and NCBI services (including URLs, APIs, etc) which accept GIs as inputs/arguments
will be supported, for those sequence records that have GIs, for the foreseeable
future.
Over the next year NCBI will identify all such services that do not yet
support Accession.Version identifiers, and add that support. Users of those
services will then be encouraged to make use of Accession.Version rather than GIs.
Of course, for those services that already support Accession.Version, NCBI
encourages users to begin transitioning away from GI as soon as is practical.
In the sample record below, nucleotide sequence AF123456 has been assigned a
GI of 6633795, and the protein translation of its coding region feature has
been assigned a GI of 6633796 :
LOCUS AF123456 1510 bp mRNA linear VRT 12-APR-2012
DEFINITION Gallus gallus doublesex and mab-3 related transcription factor 1
(DMRT1) mRNA, partial cds.
ACCESSION AF123456
VERSION AF123456.2 GI:6633795
....
CDS <1..936
/gene="DMRT1"
/note="cDMRT1"
/codon_start=1
/product="doublesex and mab-3 related transcription factor
1"
/protein_id="AAF19666.1"
/db_xref="GI:6633796"
/translation="PAAGKKLPRLPKCARCRNHGYSSPLKGHKRFCMWRDCQCKKCSL
IAERQRVMAVQVALRRQQAQEEELGISHPVPLPSAPEPVVKKSSSSSSCLLQDSSSPA
HSTSTVAAAAASAPPEGRMLIQDIPSIPSRGHLESTSDLVVDSTYYSSFYQPSLYPYY
NNLYNYSQYQMAVATESSSSETGGTFVGSAMKNSLRSLPATYMSSQSGKQWQMKGMEN
RHAMSSQYRMCSYYPPTSYLGQGVGSPTCVTQILASEDTPSYSESKARVFSPPSSQDS
GLGCLSSSESTKGDLECEPHQEPGAFAVSPVLEGE"
After June 15 2016, the GI value on the VERSION line and the GI /db_xref
qualifier for the coding region feature will no longer be displayed:
LOCUS AF123456 1510 bp mRNA linear VRT 12-APR-2012
DEFINITION Gallus gallus doublesex and mab-3 related transcription factor 1
(DMRT1) mRNA, partial cds.
ACCESSION AF123456
VERSION AF123456.2
....
CDS <1..936
/gene="DMRT1"
/note="cDMRT1"
/codon_start=1
/product="doublesex and mab-3 related transcription factor
1"
/protein_id="AAF19666.1"
/translation="PAAGKKLPRLPKCARCRNHGYSSPLKGHKRFCMWRDCQCKKCSL
IAERQRVMAVQVALRRQQAQEEELGISHPVPLPSAPEPVVKKSSSSSSCLLQDSSSPA
HSTSTVAAAAASAPPEGRMLIQDIPSIPSRGHLESTSDLVVDSTYYSSFYQPSLYPYY
NNLYNYSQYQMAVATESSSSETGGTFVGSAMKNSLRSLPATYMSSQSGKQWQMKGMEN
RHAMSSQYRMCSYYPPTSYLGQGVGSPTCVTQILASEDTPSYSESKARVFSPPSSQDS
GLGCLSSSESTKGDLECEPHQEPGAFAVSPVLEGE"
Similarly, the GI value will be removed from the VERSION line of the GenPept
format. Currently:
LOCUS AAF19666 311 aa linear VRT 12-APR-2012
DEFINITION doublesex and mab-3 related transcription factor 1, partial [Gallus
gallus].
ACCESSION AAF19666
VERSION AAF19666.1 GI:6633796
DBSOURCE accession AF123456.2
....
CDS 1..311
/gene="DMRT1"
/coded_by="AF123456.2:<1..936"
As of 06/15/2016:
LOCUS AAF19666 311 aa linear VRT 12-APR-2012
DEFINITION doublesex and mab-3 related transcription factor 1, partial [Gallus
gallus].
ACCESSION AAF19666
VERSION AAF19666.1
DBSOURCE accession AF123456.2
....
CDS 1..311
/gene="DMRT1"
/coded_by="AF123456.2:<1..936"
Note that the coding region feature for GenPept format has never included
the display of nucleotide GI values.
For FASTA format, GI values will be removed from the FASTA header/defline:
Currently:
>gi|6633795|gb|AF123456.2| Gallus gallus doublesex and mab-3 related transcription factor 1 (DMRT1) mRNA, partial cds
CCGGCGGCGGGCAAGAAGCTGCCGCGTCTGCCCAAGTGTGCCCGCTGCCGCAACCACGGCTACTCCTCGC
CGCTGAAGGGGCACAAGCGGTTCTGCATGTGGCGGGACTGCCAGTGCAAGAAGTGCAGCCTGATCGCCGA
[....]
>gi|6633796|gb|AAF19666.1| doublesex and mab-3 related transcription factor 1, partial
[Gallus gallus]
PAAGKKLPRLPKCARCRNHGYSSPLKGHKRFCMWRDCQCKKCSLIAERQRVMAVQVALRRQQAQEEELGI
SHPVPLPSAPEPVVKKSSSSSSCLLQDSSSPAHSTSTVAAAAASAPPEGRMLIQDIPSIPSRGHLESTSD
LVVDSTYYSSFYQPSLYPYYNNLYNYSQYQMAVATESSSSETGGTFVGSAMKNSLRSLPATYMSSQSGKQ
WQMKGMENRHAMSSQYRMCSYYPPTSYLGQGVGSPTCVTQILASEDTPSYSESKARVFSPPSSQDSGLGC
LSSSESTKGDLECEPHQEPGAFAVSPVLEGE
As of 06/15/2016:
>gb|AF123456.2| Gallus gallus doublesex and mab-3 related transcription factor 1 (DMRT1) mRNA, partial cds
CCGGCGGCGGGCAAGAAGCTGCCGCGTCTGCCCAAGTGTGCCCGCTGCCGCAACCACGGCTACTCCTCGC
CGCTGAAGGGGCACAAGCGGTTCTGCATGTGGCGGGACTGCCAGTGCAAGAAGTGCAGCCTGATCGCCGA
[....]
>gb|AAF19666.1| doublesex and mab-3 related transcription factor 1, partial
[Gallus gallus]
PAAGKKLPRLPKCARCRNHGYSSPLKGHKRFCMWRDCQCKKCSLIAERQRVMAVQVALRRQQAQEEELGI
SHPVPLPSAPEPVVKKSSSSSSCLLQDSSSPAHSTSTVAAAAASAPPEGRMLIQDIPSIPSRGHLESTSD
LVVDSTYYSSFYQPSLYPYYNNLYNYSQYQMAVATESSSSETGGTFVGSAMKNSLRSLPATYMSSQSGKQ
WQMKGMENRHAMSSQYRMCSYYPPTSYLGQGVGSPTCVTQILASEDTPSYSESKARVFSPPSSQDSGLGC
LSSSESTKGDLECEPHQEPGAFAVSPVLEGE
Please direct any inquiries about these changes to the NCBI Service Desk:
info from ncbi.nlm.nih.gov
1.4.2 GI sequence identifiers are being phased out at NCBI
The numeric GI sequence identifier that NCBI used to assign to all
nucleotide and protein sequences was first introduced for GenBank Release
products as of GenBank 81.0, in February 1994. See:
ftp://ftp.ncbi.nih.gov/genbank/release.notes/gb81.release.notes
These simple, uniform, integer-based unique identifiers (which predated the
introduction of Accession.Version sequence identifiers) were crucial to the
development of NCBI's Entrez retrieval system, and have served their purpose
very well for over 20 years.
However, as NCBI considers how best to address the expected increase in the
volume of submitted sequence data, it is clear that prior practices will need
to be re-thought. As an example, imagine 100,000 pathogen-related
genomes/samples, each with 5000 proteins, most of which are common to all. We
will be moving toward solutions that represent each unique protein *once*.
The coding region protein products for each genome will likely continue to be
assigned their own Accession.Version identifiers, but (within the NCBI data
model) they will simply *reference* the unique proteins. And, they will no
longer be issued GIs of their own.
Such a change will likely have a significant impact on NCBI users who
utilize GIs in their own information systems and analysis pipelines, so it is
being implemented gradually. Unannotated WGS projects consisting of millions
of contigs and scaffolds, and unannotated TSA projects, are the first two
classes of records for which GIs are no longer being assigned. But the practice
will ultimately expand to include other classes of records.
If GIs are central to your operations, NCBI strongly urges that you begin
planning a switch to the use of Accession.Version identifiers instead.
The contigs and scaffolds of the ALWZ04 WGS project are good examples of
sequences that lack GIs. Below are excerpts from the flatfile representation
of the first ALWZ04 contig, and the 'singleton scaffold' which is constructed
from it. Note the absence of a GI value on the VERSION line of these two
records:
LOCUS ALWZ040000001 1191 bp DNA linear PLN 13-MAR-2015
DEFINITION Picea glauca, whole genome shotgun sequence.
ACCESSION ALWZ040000001 ALWZ040000000
VERSION ALWZ040000001.1
DBLINK BioProject: PRJNA83435
BioSample: SAMN01120252
....
ORIGIN
1 ctataatacc cctatgccaa acgaacccaa ttgtaaatgt aaatgcaaat gtacttaggc
61 tggttagttg tttaatatca ttttttgtat gcaccttcca tggtataatg cgcacatgta
121 tagcgcacta aaattatgaa gtgtgcccat tccaagatat tgcgcgtaaa aaacttaagt
181 gtgcatgatt ttgagactag ggagactttg tgtatatgtt gtgttttata tgctggagag
241 acaattatta ttagttagga ggattatgtt ttgtactagg caagagagcc tagatgttaa
301 aggctagtga gcctattttt gtatatgtct catcattaat ataatacatc attgtgtgta
....
901 ttgttgggaa ttgatttcct gaatgtgtta aactgcattg atagggatct gagaattcct
961 ttctggccta ttgctgaagc tttggaaggg aggtggggca accgagggac tgttgagaag
1021 agaagggtca cacttcctgg ggtgggacaa gcatgtgggg aattagggat tgcaggatgt
1081 tagtttgaat tggcacctat gacagagtct ttcctattgt ctgagatatg tcagcttggt
1141 taggaaaccc tttacctggg tagagtttag tcccagctcg ggggtgaccc a
//
LOCUS ALWZ04S0000001 1191 bp DNA linear CON 13-MAR-2015
DEFINITION Picea glauca Pg-01r141201s0000001, whole genome shotgun sequence.
ACCESSION ALWZ04S0000001 ALWZ0400000000
VERSION ALWZ04S0000001.1
DBLINK BioProject: PRJNA83435
BioSample: SAMN01120252
....
CONTIG join(ALWZ040000001.1:1..1191)
//
Sample URLs from which ALWZ04 data may be obtained include:
http://www.ncbi.nlm.nih.gov/Traces/wgs/?val=ALWZ04#contigshttp://www.ncbi.nlm.nih.gov/Traces/wgs/?val=ALWZ04#scaffoldshttp://www.ncbi.nlm.nih.gov/Traces/wgs/?download=ALWZ04.gbff.1.gzhttp://www.ncbi.nlm.nih.gov/Traces/wgs/?download=ALWZ04S.gbff.1.gzftp://ftp.ncbi.nlm.nih.gov/genbank/wgs/wgs.ALWZ.*.gbff.gzftp://ftp.ncbi.nlm.nih.gov/genbank/wgs/wgs.ALWZ.scflds.*.gbff.gz