Greetings GenBank Users,
GenBank Release 189.0 is now available via FTP from the
National Center for Biotechnology Information (NCBI):
Ftp Site Directory Contents
---------------- --------- ---------------------------------------
ftp.ncbi.nih.gov genbank GenBank Release 189.0 flatfiles
ncbi-asn1 ASN.1 data used to create Release 189.0
Close-of-data for GenBank 189.0 occurred on 04/15/2012. Uncompressed,
the Release 189.0 flatfiles require roughly 545 GB (sequence files only)
or 586 GB (including the 'short directory', 'index' and the *.txt files).
The ASN.1 data require approximately 446 GB.
Recent statistics for non-WGS, non-CON sequences:
Release Date Base Pairs Entries
188 Feb 2012 137384889783 149819246
189 Apr 2012 139266481398 151824421
Recent statistics for WGS sequences:
Release Date Base Pairs Entries
188 Feb 2012 261370512675 78656704
189 Apr 2012 272693351548 80905298
During the 64 days between the close dates for GenBank Releases 188.0
and 189.0, the non-WGS/non-CON portion of GenBank grew by 1,881,591,615
basepairs and by 2,005,175 sequence records. During that same period,
816,460 records were updated. An average of 44,088 non-WGS/non-CON
records were added and/or updated per day.
Between releases 188.0 and 189.0, the WGS component of GenBank grew by
11,322,838,873 basepairs and by 2,248,594 sequence records.
For additional release information, see the README files in either of
the directories mentioned above, and the release notes (gbrel.txt) in
the genbank directory. Sections 1.3 and 1.4 of the release notes
(Changes in Release 189.0 and Upcoming Changes) have been appended
below for your convenience.
** Important Notes **
* GenBank 'index' files are now provided without any EST content, and
without most GSS content. See Section 1.3.3 of the release notes for
further details.
NCBI is considering ceasing support for the index files, so we
encourage affected users to review that section and provide feedback.
Release 189.0 data, and subsequent updates, are available now via
NCBI's Entrez and Blast services.
As a general guideline, we suggest first transferring the GenBank
release notes (gbrel.txt) whenever a release is being obtained. Check
to make sure that the date and release number in the header of the
release notes are current (eg: April 15 2012, 189.0). If they are
not, interrupt the remaining transfers and then request assistance from
the NCBI Service Desk.
A comprehensive check of the headers of all release files after your
transfers are complete is also suggested. Here's how one might go about
this on a Unix or Linux platform, using csh/tcsh :
set files = `ls gb*.*`
foreach i ($files)
head -10 $i | grep Release
end
Or, if the files are compressed, perhaps:
gzcat $i | head -10 | grep Release
If you encounter problems while ftp'ing or uncompressing Release
189.0, please send email outlining your difficulties to:
info from ncbi.nlm.nih.gov
Mark Cavanaugh, Michael Kimelman, Ilya Dondoshansky, Sergey Zhdanov
GenBank
NCBI/NLM/NIH/HHS
1.3 Important Changes in Release 189.0
1.3.1 Extension of position syntax for /anticodon qualifiers
Starting with the April 2012 GenBank Release 189.0, the format
of the /anticodon qualifier has been extended to allow the use
of join() and complement() for the location of a tRNA's anticodon.
Currently, the qualifier supports only a simple continuous
basepair range. For example:
/anticodon=(pos:34..36,aa:Phe)
But there are rare cases of intron-containing tRNAs, for which a
simple X..Y location will not suffice:
tRNA join(<1..5,495..544)
/gene="trnL"
/product="tRNA-Leu"
/note="codon recognized: UUA"
/anticodon=(pos:join(5,495..496),aa:Leu)
To support cases like these, the "pos" field is now allowed to make
use of the join() and complement() location operators. Anticodons are
usually three (sometimes four) bases in length, and this *remains*
true even though the join() operator could theoretically be (mis)used
to assert a much more complex/larger location than that.
1.3.2 New representation for Transcriptome Shotgun Assembly (TSA) records.
The TSA division of GenBank (see http://www.ncbi.nlm.nih.gov/genbank/tsa
for details) has grown much more quickly than expected. To accommodate the
increasing TSA submission volume, GenBank plans to use a WGS-like approach
for TSA sequencing projects. It is likely that we will start to provide
TSA project data in the new format *prior* to Release 190.0 in June 2012.
TSA projects will be assigned a four-letter project code, starting with
the letter "G" (for example, GAAA). Individual mRNA sequences within a
project will make use of the 4+2+6 accession number convention, familiar
to users of WGS data (for example, GAAA01000001). Unlike WGS, re-assembly
of the mRNAs for a TSA sequencing project is expected to be a very rare
occurrence, and we expect that the 2-digit assembly-version number will
almost always be "01" for TSA mRNAs. Similar to WGS, a TSA master record
will provide a convenient overview of a TSA project, with an 'all-zeroes'
accession number (eg: GAAA00000000) .
TSA projects that make use of this new representation will be provided
in a separate FTP directory at the NCBI FTP site: genbank/tsa . And like
WGS, the various data files for a TSA project will be grouped by the
4-letter project code, and they will be updated independently of the
GenBank release cycle.
Plans aren't yet finalized for the 5.2 million TSA records currently
provided in the divisional gbtsa* files of a GenBank release. Ideally,
all would be converted to the new WGS-like representation, so that all
TSA records in GenBank utilize a common approach. However, the resources
for such a conversion might not be readily available, in which case
older/legacy TSA records might remain as they are now.
Further details about this new approach for handling TSA data will be
made available via these release notes and the GenBank newsgroup, as we
get closer to implementation.
1.3.3 Organizational changes
The total number of sequence data files increased by 28 with this release:
- the BCT division is now composed of 85 files (+3)
- the CON division is now composed of 167 files (+1)
- the ENV division is now composed of 53 files (+3)
- the EST division is now composed of 461 files (+6)
- the INV division is now composed of 30 files (-2)
- the MAM division is now composed of 8 files (+1)
- the PAT division is now composed of 178 files (+2)
- the PLN division is now composed of 55 files (+2)
- the PRI division is now composed of 45 files (+1)
- the TSA division is now composed of 70 files (+10)
- the VRT division is now composed of 26 files (+1)
The decrease in the number of INV-division files was a consequence of
the removal of approximately 340,000 BarCode sequence records that
lacked tentative taxonomic identification, and hence did not satisfy the
terms of the iBOL/GenBank early-release agreement.
The total number of 'index' files increased by 1 with this release:
- the AUT (author name) index is now composed of 97 files (+1)
1.3.4 Project DBLINKs transitioning to BioProject
The Genome Project Database resource at the NCBI was redesigned in
recent months, culminating in the implementation of a new BioProject
resource:
http://www.ncbi.nlm.nih.gov/bioproject
An article that describes the goals of BioProject is available:
http://www.ncbi.nlm.nih.gov/books/NBK54015/
BioProject is a collaborative effort of the International Nucleotide
Sequence Database Collaboration (INSDC), and project data are exchanged
with NCBI's partner INSDC institutions, EBI and DDBJ. A BioProject
website is also available at DDBJ:
` http://trace.ddbj.nig.ac.jp/bioproject/index_e.shtml
BioProjects are uniquely identified by BioProject Accession Numbers,
which utilize this format:
"PRJ"
"E" or "N" or "D"
one letter
one or more digits
Examples of valid BioProject accessions are PRJNA12521 and PRJEB1 .
With BioProject now in operation, we are preparing to implement links
from sequence records to the new resource. Previously, links to the
Genome Project Database were provided by numeric 'Project' DBLINKs .
Here's an example for a bacterial complete-genome record:
LOCUS CP002497 1110245 bp DNA linear PLN 14-NOV-2011
DEFINITION Eremothecium cymbalariae DBVPG#7215 chromosome 1, complete
sequence.
ACCESSION CP002497
VERSION CP002497.1 GI:356887709
DBLINK Project: 60715
When this link is switched to a BioProject accession, the DBLINK line
will change slightly:
LOCUS CP002497 1110245 bp DNA linear PLN 14-NOV-2011
DEFINITION Eremothecium cymbalariae DBVPG#7215 chromosome 1, complete
sequence.
ACCESSION CP002497
VERSION CP002497.1 GI:356887709
DBLINK BioProject: PRJNA60715
In the coming months, many millions of sequence records will gradually
be modified, to make use of the new BioProject DBLINK. These modifications
will not be distributed via daily GenBank and RefSeq incremental-update
products.
However, the new BioProject links are gradually appearing on newly
submitted sequence records, and were present in GenBank and RefSeq
release and incremental-update products starting in December 2011.
In addition, the new BioProject links are visible via NCBI's Entrez:Nucleotide
resource.
1.3.5 Changes in the content of index files
As described in the GB 153 release notes, the 'index' files which accompany
GenBank releases (see Section 3.3) are considered to be a legacy data product by
NCBI, generated mostly for historical reasons. FTP statistics from January 2005
seemed to support this: the index files were transferred only half as frequently as
the files of sequence records. The inherent inefficiencies of the index file
format also lead us to suspect that they have little serious use by the user
community, particularly for EST and GSS records.
The software that generated the index file products received little
attention over the years, and finally reached its limitations in
February 2006 (Release 152.0). The required multi-server queries which
obtained and sorted many millions of rows of terms from several different
databases simply outgrew the capacity of the hardware used for GenBank
Release generation.
Our short-term solution is to cease generating some index-file content
for all EST sequence records, and for GSS sequence records that originate
via direct submission to NCBI.
The three gbacc*.idx index files continue to reflect the entirety of the
release, including all EST and GSS records, however the file contents are
unsorted.
These 'solutions' are really just stop-gaps, and we will likely pursue
one of two options:
a) Cease support of the 'index' file products altogether.
b) Provide new products that present some of the most useful data from
the legacy 'index' files, and cease support for other types of index data.
If you are a user of the 'index' files associated with GenBank releases, we
encourage you to make your wishes known, either via the GenBank newsgroup,
or via email to NCBI's Service Desk:
info from ncbi.nlm.nih.gov
Our apologies for any inconvenience that these changes may cause.
1.3.6 GSS File Header Problem
GSS sequences at GenBank are maintained in two different systems, depending
on their origin, and the dumps from those systems occur in parallel. Because
the second dump (for example) has no prior knowledge of exactly how many GSS
files will be dumped by the first, it does not know how to number its own
output files.
There is thus a discrepancy between the filenames and file headers for
103 of the GSS flatfiles in Release 189.0. Consider gbgss153.seq :
GBGSS1.SEQ Genetic Sequence Data Bank
April 15 2012
NCBI-GenBank Flat File Release 189.0
GSS Sequences (Part 1)
87113 loci, 63992234 bases, from 87113 reported sequences
Here, the filename and part number in the header is "1", though the file
has been renamed as "153" based on the number of files dumped from the other
system. We hope to resolve this discrepancy at some point, but the priority
is certainly much lower than many other tasks.
1.4 Upcoming Changes
1.4.2 New /pseudogene qualifier; /pseudo will be deprecated
A new controlled-vocabulary /pseudogene qualifier has been under
discussion within the INSDC since the last collaborative INSD meeting
in May 2011. The goal of the new qualifier is to use it for the annotation
of certain well-defined classes of pseudogenes. And at the same time,
to cease using the poorly-defined /pseudo qualifier, which has been
used for a variety of different situations by each INSDC member.
Although a formal definition of /pseudogene is not yet available, we
do have a tentative list of the values for the new qualifier:
"processed" - the pseudogene has arisen by reverse
transcription of a mRNA into cDNA, followed by reintegration into the
genome. Therefore, it has lost any intron/exon structure, and it will
have a pseudo-polyA-tail (if a young pseudogene).
"unprocessed" - the pseudogene has arisen from a copy of
the parent gene by means other than reverse transcription. This covers
usually duplication (transposition [not retrotransposition] and perhaps
recombination) followed by accumulation of random mutation. The changes,
compared to their functional homolog, include insertion, deletions,
premature stop codons, frameshifts and a higher proportion of
non-synonymous versus synonymous substitutions.
"unitary" - the pseudogene has no parent. It is the
original gene, which is functional is some species but disrupted in some
way (indels, mutation, recombination) in another species or strain. In a
lot of cases, such changes would kill the organism, particularly with
house-keeping genes.
"allelic" - a (unitary) pseudogene that is stable in the
population but importantly it has a functional alternative allele also
in the population. i.e., one strain may have the gene, another strain
may have the pseudogene. MHC haplotypes have allelic pseudogenes.
"unknown" - would imply that the submitter does not know
the method of pseudogenisation
If a final definition of /pseudogene can be arrived at within the
next few weeks, then the new qualifier would be legal as of GenBank
Release 190.0 in June of 2012. We will keep users posted about this
new qualifier via the GenBank newsgroup and these release notes.