Greetings GenBank Users,
Starting on January 12th of 2009, a new type of data file will be
made available for GenBank WGS (Whole Genome Shotgun) projects,
in the WGS areas of our FTP site.
Since their inception in 2002, WGS projects have had an associated
'WGS-master' record, which summarizes the content of a project. Here
is a link to the master for project ABRT (Philippine tarsier) :
http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nuccore&id=203287470
And here is an excerpt from that master record:
LOCUS ABRT010000000 1201173 rc DNA linear PRI
18-NOV-2008
DEFINITION Tarsius syrichta, whole genome shotgun sequence.
ACCESSION ABRT000000000
VERSION ABRT000000000.1 GI:203287470
PROJECT GenomeProject:20339
KEYWORDS WGS.
SOURCE Tarsius syrichta (Philippine tarsier)
....
WGS ABRT010000001-ABRT011201173
WGS_SCAFLD GG299110-GG500513
//
This flatfile representation of the ABRT WGS-master does *not*
conform to the specifications for normal GenBank flatfiles.
For example:
- It has neither sequence data nor a CONTIG join() statement.
- The 'rc' (record count) value on the LOCUS line represents the
number of sequence-overlap contig records in the project, rather
than a basepair count.
- Undocumented linetypes 'WGS' and 'WGS_SCAFLD' exist, which
provide the ranges of accession numbers for the 1,201,173
sequence-overlap contig sequences in the project, and for
the 201,404 CON-division records that have been constructed
from the ABRT01 contigs.
Nonetheless, a WGS-master record has utility because it provides
an overview of many important characteristics of a WGS project,
in a simple and concise way.
The ASN.1 version of WGS-master records will be placed in:
ftp://ftp.ncbi.nih.gov/ncbi-asn1/wgs
and the file naming convention will be:
wgs.XXXX.mstr.bse.gz
These files will contain a gzip-compressed, binary ASN.1 Seq-entry
value. 'XXXX' represents a four-character WGS Project Code, such as
ABYH.
The GenBank flatfile representation of WGS-master records will be
placed in:
ftp://ftp.ncbi.nih.gov/genbank/wgs
and the file naming convention will be:
wgs.XXXX.mstr.gbff.gz
Here is an example of the filenames that one would encounter for
the ABYH project in the /genbank/wgs area, as of January 12:
wgs.ABYH.1.gbff.gz
wgs.ABYH.1.gnp.gz
wgs.ABYH.1.qscore.gz
wgs.ABYH.mstr.gbff.gz
If you process the GenBank flatfile representation of WGS projects,
and you are *not* interested in WGS-masters, you may need to add
a filtration step to remove the master files from automated FTP
transfers (due to similarities in filename patterns).
Mark Cavanaugh
GenBank
NCBI/NLM/NIH/HHS