Greetings GenBank Users,
The gbcon133.seq.gz and gbcon315.seq.gz files of GenBank Release 209.0
contained a total of 10 records with corrupted CONTIG-line contents:
con133 : GG745414 GG745416 GG745418 GG745422 GG745423 GG745425 GG745427 GG745428
con315 : KQ257329 KQ257330
The original sizes and timestamps of the affected files were:
-r--r--r-- 1 ftp anonymous 49004277 Aug 16 19:42 gbcon133.seq.gz
-r--r--r-- 1 ftp anonymous 39983633 Aug 16 19:42 gbcon315.seq.gz
In most cases, an underlying numeric GI sequence identifier was not
converted to an Accession.Version value. But the problem also led to
completely erroneous content (see KQ257329), and to subtly-incorrect
accession values (see KQ257330). Sample diffs are provided below
which illustrate each class of the invalid CONTIG lines.
As of approximately 4:30pm EDT on Wednesday August 26 2015, the files
containing these records were patched and reinstalled at the NCBI FTP
site:
-r--r--r-- 1 ftp anonymous 49004254 Aug 26 16:31 gbcon133.seq.gz
-r--r--r-- 1 ftp anonymous 39983639 Aug 26 16:31 gbcon315.seq.gz
The ASN.1 version of GenBank 209.0 was not affected.
We would like to thank several users at Chemical Abstracts Services (www.cas.org)
for detecting this problem in the GenBank 209.0 release files. We appreciate the
scrutiny of GenBank data products that our users provide.
Independent of the CONTIG-line issue, we *also* discovered that one of the
CON-division GenBank flatfiles failed to be installed when release 209.0
was made available : gbcon324.seq.gz .
That particular file provides a CON-division representation of GenBank
records which are members of "segmented sets" (a legacy convention that
is no longer used). The file has also just been installed :
-r--r--r-- 1 ftp anonymous 2388132 Aug 26 16:31 gbcon324.seq.gz
And finally, because there was no mention of gbcon324.seq.gz in the release
notes, they have also been patched and re-installed:
-r--r--r-- 1 ftp anonymous 408593 Aug 26 16:32 gbrel.txt
--- /am/ftp-genbank/gbrel.txt 2015-08-16 19:42:26.924492000 -0400
+++ gbrel.txt 2015-08-26 16:25:17.874142000 -0400
2451. gbvrt9.seq - Other vertebrate sequence entries, part 9.
+2452. gbcon324.seq - Constructed sequence entries, part 324.
77359244 gbcon323.seq
+ 18870571 gbcon324.seq
Our apologies for any inconvenience that these problems may have caused.
Mark Cavanaugh
GenBank
NCBI/NLM/NIH/HHS
Examples of CONTIG line problems in GenBank 209.0 :
LOCUS GG745414 16080 bp DNA linear CON 31-JUL-2015
DEFINITION Allomyces macrogynus ATCC 38327 genomic scaffold supercont3.87,
whole genome shotgun sequence.
ACCESSION GG745414 ACDU01000000
VERSION GG745414.1 GI:289160277
DBLINK BioProject: PRJNA20563
BioSample: SAMN02953744
< CONTIG join(gi|284183230:1..621,gap(2755),gi|284183229:1..1096,gap(2665),
< gi|284183228:1..1344,gap(2537),gi|284183227:1..839,gap(unk100),
< gi|284183226:1..739,gap(2640),gi|284183225:1..744)
---
> CONTIG join(ACDU01008927.1:1..621,gap(2755),ACDU01008928.1:1..1096,
> gap(2665),ACDU01008929.1:1..1344,gap(2537),ACDU01008930.1:1..839,
> gap(unk100),ACDU01008931.1:1..739,gap(2640),ACDU01008932.1:1..744)
LOCUS KQ257329 1032592 bp DNA linear CON 23-JUL-2015
DEFINITION Vibrio cholerae 2740-80 genomic scaffold supercont1.1, whole genome
shotgun sequence.
ACCESSION KQ257329 AAUT02000000
VERSION KQ257329.1 GI:902711342
DBLINK BioProject: PRJNA18253
BioSample: SAMN02435841
< CONTIG join(}"¨#0"¾#"Þ#521969028.1:1..393878,gap(210),
< AAUT02000002.1:1..199900,gap(2128),AAUT02000003.1:1..152904,
< gap(3592),AAUT02000004.1:1..5544,gap(1570),AAUT02000005.1:1..24957,
< gap(363),CRO19455.1:1..176535,gap(32),AAUT02000007.1:1..70979)
---
> CONTIG join(AAUT02000001.1:1..393878,gap(210),AAUT02000002.1:1..199900,
> gap(2128),AAUT02000003.1:1..152904,gap(3592),
> AAUT02000004.1:1..5544,gap(1570),AAUT02000005.1:1..24957,gap(363),
> AAUT02000006.1:1..176535,gap(32),AAUT02000007.1:1..70979)
LOCUS KQ257330 684254 bp DNA linear CON 23-JUL-2015
DEFINITION Vibrio cholerae 2740-80 genomic scaffold supercont1.2, whole genome
shotgun sequence.
ACCESSION KQ257330 AAUT02000000
VERSION KQ257330.1 GI:902711341
DBLINK BioProject: PRJNA18253
BioSample: SAMN02435841
< gap(154),AAUT02000012.1:1..12782,gap(333),GH1875755:1..609033,
--- ^^^^^^^^^
"2+7" is not a legal
accession number format.
> gap(154),AAUT02000012.1:1..12782,gap(333),AAUT02000013.1:1..609033,