[Genbank-bb] GenBank Release 209.0 Problems : Ten records with corrupted CONTIG lines, and one missing CON-division file

Cavanaugh, Mark (NIH/NLM/NCBI) [E] via genbankb%40net.bio.net (by cavanaug from ncbi.nlm.nih.gov)
Wed Aug 26 15:43:17 EST 2015

Greetings GenBank Users,

The gbcon133.seq.gz and gbcon315.seq.gz files of GenBank Release 209.0
contained a total of 10 records with corrupted CONTIG-line contents:

con133 : GG745414 GG745416 GG745418 GG745422 GG745423 GG745425 GG745427 GG745428
con315 : KQ257329 KQ257330

The original sizes and timestamps of the affected files were:

-r--r--r--   1 ftp      anonymous 49004277 Aug 16 19:42 gbcon133.seq.gz
-r--r--r--   1 ftp      anonymous 39983633 Aug 16 19:42 gbcon315.seq.gz

In most cases, an underlying numeric GI sequence identifier was not
converted to an Accession.Version value. But the problem also led to
completely erroneous content (see KQ257329), and to subtly-incorrect 
accession values (see KQ257330). Sample diffs are provided below
which illustrate each class of the invalid CONTIG lines.

As of approximately 4:30pm EDT on Wednesday August 26 2015, the files
containing these records were patched and reinstalled at the NCBI FTP

-r--r--r--   1 ftp      anonymous 49004254 Aug 26 16:31 gbcon133.seq.gz
-r--r--r--   1 ftp      anonymous 39983639 Aug 26 16:31 gbcon315.seq.gz

The ASN.1 version of GenBank 209.0 was not affected.

We would like to thank several users at Chemical Abstracts Services (www.cas.org)
for detecting this problem in the GenBank 209.0 release files. We appreciate the
scrutiny of GenBank data products that our users provide.

Independent of the CONTIG-line issue, we *also* discovered that one of the
CON-division GenBank flatfiles failed to be installed when release 209.0
was made available : gbcon324.seq.gz .

That particular file provides a CON-division representation of GenBank
records which are members of "segmented sets" (a legacy convention that
is no longer used). The file has also just been installed :

-r--r--r--   1 ftp      anonymous 2388132 Aug 26 16:31 gbcon324.seq.gz

And finally, because there was no mention of gbcon324.seq.gz in the release
notes, they have also been patched and re-installed: 

-r--r--r--   1 ftp      anonymous 408593 Aug 26 16:32 gbrel.txt

	--- /am/ftp-genbank/gbrel.txt   2015-08-16 19:42:26.924492000 -0400
	+++ gbrel.txt   2015-08-26 16:25:17.874142000 -0400

	 2451. gbvrt9.seq - Other vertebrate sequence entries, part 9.
	+2452. gbcon324.seq - Constructed sequence entries, part 324.

	   77359244     gbcon323.seq
	+  18870571     gbcon324.seq

Our apologies for any inconvenience that these problems may have caused.

Mark Cavanaugh

Examples of CONTIG line problems in GenBank 209.0 :

LOCUS       GG745414               16080 bp    DNA     linear   CON 31-JUL-2015
DEFINITION  Allomyces macrogynus ATCC 38327 genomic scaffold supercont3.87,
            whole genome shotgun sequence.
ACCESSION   GG745414 ACDU01000000
VERSION     GG745414.1  GI:289160277
DBLINK      BioProject: PRJNA20563
            BioSample: SAMN02953744

< CONTIG      join(gi|284183230:1..621,gap(2755),gi|284183229:1..1096,gap(2665),
<             gi|284183228:1..1344,gap(2537),gi|284183227:1..839,gap(unk100),
<             gi|284183226:1..739,gap(2640),gi|284183225:1..744)
> CONTIG      join(ACDU01008927.1:1..621,gap(2755),ACDU01008928.1:1..1096,
>             gap(2665),ACDU01008929.1:1..1344,gap(2537),ACDU01008930.1:1..839,
>             gap(unk100),ACDU01008931.1:1..739,gap(2640),ACDU01008932.1:1..744)

LOCUS       KQ257329             1032592 bp    DNA     linear   CON 23-JUL-2015
DEFINITION  Vibrio cholerae 2740-80 genomic scaffold supercont1.1, whole genome
            shotgun sequence.
ACCESSION   KQ257329 AAUT02000000
VERSION     KQ257329.1  GI:902711342
DBLINK      BioProject: PRJNA18253
            BioSample: SAMN02435841

< CONTIG      join(}"¨#0"¾#"Þ#521969028.1:1..393878,gap(210),
<             AAUT02000002.1:1..199900,gap(2128),AAUT02000003.1:1..152904,
<             gap(3592),AAUT02000004.1:1..5544,gap(1570),AAUT02000005.1:1..24957,
<             gap(363),CRO19455.1:1..176535,gap(32),AAUT02000007.1:1..70979)
> CONTIG      join(AAUT02000001.1:1..393878,gap(210),AAUT02000002.1:1..199900,
>             gap(2128),AAUT02000003.1:1..152904,gap(3592),
>             AAUT02000004.1:1..5544,gap(1570),AAUT02000005.1:1..24957,gap(363),
>             AAUT02000006.1:1..176535,gap(32),AAUT02000007.1:1..70979)

LOCUS       KQ257330              684254 bp    DNA     linear   CON 23-JUL-2015
DEFINITION  Vibrio cholerae 2740-80 genomic scaffold supercont1.2, whole genome
            shotgun sequence.
ACCESSION   KQ257330 AAUT02000000
VERSION     KQ257330.1  GI:902711341
DBLINK      BioProject: PRJNA18253
            BioSample: SAMN02435841

<             gap(154),AAUT02000012.1:1..12782,gap(333),GH1875755:1..609033,
---                                                     ^^^^^^^^^
                                                        "2+7" is not a legal
                                                        accession number format.

>             gap(154),AAUT02000012.1:1..12782,gap(333),AAUT02000013.1:1..609033,

