IUBio GIL .. BIOSCI/Bionet News .. Biosequences .. Software .. FTP

[Genbank-bb] GenBank 194.0 : Catalog Files Available

Cavanaugh, Mark (NIH/NLM/NCBI) [E] via genbankb%40net.bio.net (by cavanaug from ncbi.nlm.nih.gov)
Tue Feb 26 17:05:54 EST 2013


Greetings GenBank Users,

As described in the announcement for GenBank 194.0 availability,
we now provide files which catalog the contents of a release.

The genbank/catalog directory at the NCBI FTP site contains these
files:

gb194.catalog.est.txt.gz
gb194.catalog.gss.txt.gz
gb194.catalog.other.txt.gz

gb194.gene_list.gss.txt.gz
gb194.gene_list.other.txt.gz

gb194.pmid_list.est.txt.gz
gb194.pmid_list.gss.txt.gz
gb194.pmid_list.other.txt.gz

The format and content of these files is described in Section 1.3.2
of the GenBank 194.0 release notes (gbrel.txt).

Note that there is no gene_list file for EST, because EST records
at the NCBI are not annotated with anything other than source
features.

With Release 194.0, we now report RNA molecule types in the fourth
column more precisely: RNA, mRNA, rRNA, tRNA, ncRNA. In 
addition, dna sequences are now reported as "DNA", and nucleic
acid sequences of unspecified type are reported as "NA" .

We previously failed to mention that it is possible (though rare)
for one sequence to be associated with multiple BioProjects or
multiple BioSamples. In such cases, there will be multiple 
BioProject Accession Numbers (comma-separated) in the ninth field
of the catalog file or multiple BioSample Accession Numbers
(comma-separated) in the tenth field of the catalog file.

Here are two examples:

GL629710        GL629710.2      355002998       DNA     1167526 Bos taurus      9913    CON     PRJNA12555,PRJNA20275

KA307634        KA307634.1      400083130       mRNA    317     Capra hircus    9925    TSA     PRJNA170226   SAMN01086905,SAMN01086906,SAMN01086907

There is one further change planned which will impact the
eighth column (Division Code) of the catalog for GenBank 195.0.

We expect to provide two division codes, the first indicating
the sequence type (CON, EST, GSS, HTG, etc), and the second
reflecting the division code associated with the organism's
lineage (BCT, PRI, ROD, PLN, etc). The division codes will be
comma-separated. For standard sequence records (not-EST, not-GSS,
not-HTG, etc), division code "STD" will be used for the sequence
type.

The legacy "index" file products (gbacc*.idx, gbaut*.idx, gbgen*.idx,
gbkey*.idx, and gbsec*.idx) are now officially discontinued.
February's Release 194.0 is the last GenBank release for which
they are provided.

If you have any suggestions for changes or additions to the 
new catalog-related products, now is an exceedingly good time
to make your wishes known, via an email to the NCBI Help Desk
(info from ncbi.nlm.nih.gov) .

Regards,

Mark Cavanaugh
GenBank
NCBI/NLM/NIH/HHS




More information about the Genbankb mailing list

Send comments to us at archive@iubioarchive.bio.net