IUBio GIL .. BIOSCI/Bionet News .. Biosequences .. Software .. FTP

FTP Products for TPA Sequences

Mark Cavanaugh cavanaug at ncbi.nlm.nih.gov
Thu Jan 30 17:52:05 EST 2003

Greetings GenBank Users,

As described in the GenBank 133.0 release notes:


a new class of sequence data is now being collected by GenBank, EMBL, and
DDBJ : Third-Party Annotation (TPA) data. Enclosed below is a slightly
updated version of the TPA announcement. More information about the TPA
effort can be found at the NCBI website:


Starting Friday, January 31, TPA update products will be made available
at the NCBI FTP site. Daily, incremental update files for all new/updated
TPA records will be located in:


The TPA updates will have filename prefixes of:


Filename suffixes for these updates will be:

	.bbs     : binary Bioseq-set (ASN.1)
	.gbff    : GenBank flatfile
	.gnp     : GenPept flatfile
	.fsa_nt  : Nucleotide FASTA
	.fsa_aa  : Protein FASTA

We do not expect to generate complete releases (similar to GenBank
releases) for TPA until the volume of TPA records has substantially
increased. Until that time, a set of cumulative TPA update files
containing all TPA records will be made available in:


The cumulative TPA update files will have filename prefixes of:


They will utilize the same filename suffixes that are listed above.

NOTE: The cumulative TPA products will be *discontinued* once TPA
releases are being built.

Initially, the TPA records included in these update files will be
limited to those submitted to GenBank. EMBL and DDBJ TPAs will be
added no later than Friday, February 7 2003.

READMEs for the TPA directories will also be installed by that date.

Mark Cavanaugh

TPA Announcement

  The Third-Party Annotation Data Collection

  Pursuant to agreements made at their 2002 Collaborative Meeting,
  DDBJ/EMBL/GenBank have undertaken the collection of a new class of
  sequence data : Third-Party Annotation (TPA).

  The TPA data-collection will complement the existing DDBJ/EMBL/GenBank
  comprehensive database of primary nucleotide sequences, which typically
  result from direct sequencing of cDNAs, ESTs, genomic DNAs, etc.

  'Primary data' are defined to be data for which the submitting group has
  done the sequencing and annotation, and hence, as owner of the data,
  has privileges to update/correct the associated sequence records.

  In contrast, non-primary (TPA) sequences are defined as sequences which:

  a) consist exclusively of sequence data from one, or several,
     previously-existing primary entries owned by other groups, or

  b) consist of a mixture of previously-existing primary entries,
     some owned by the TPA submittor and the rest by one or more other

  TPA categories and requirements  

  Users can submit new annotation of single sequences or assemblies
  of sequences that are owned by other groups to the TPA data

  The primary sequences must be available in the DDBJ/EMBL/GenBank
  databases, and submitters to the TPA database must provide the
  accession numbers of the primary sequences in their TPA submission.

  TPA sequences based on primary data available only in proprietary
  databases are not accepted.

  Some examples of data submissions accepted for TPA include:

     1. analysis and re-annotation of DDBJ/EMBL/GenBank sequences
        owned by other groups
     2. gap-filling, in which a TPA submittor might utilize HTG or
        EST data to complete an otherwise incomplete sequence
     3. TPA sequences based on NCBI/Ensembl trace archive data
     4. TPA sequences based on Whole Genome Shotgun (WGS) sequences

  Sequences based on primary data from multiple organisms are not

  Sequences will not be accepted for TPA in lieu of an update to
  primary records. A submittor who owns a primary record is expected
  to update that record as new sequence is determined, or sequencing
  ambiguities/errors are resolved.

  Any newly-determined sequence data that is to be part of a TPA
  record must first be submitted as a new primary sequence to
  The TPA dataset is intended to present sequence data and annotation
  in support of actual biological discoveries that are published in
  the scientific literature, without requiring that the sequence be
  determined by the authors/submitters.
  In order to assure that the sequence annotation is of high quality, 
  it is required that TPA records be associated with a study published
  in a peer-reviewed journal before the data is released to the public.

  TPA records include a mandatory 'PRIMARY' block, which documents the
  relationships between spans of the TPA sequence and the primary
  (non-TPA) sequences that contributed to it. The elements of the
  PRIMARY block are:
  a) TPA-SPAN             base span on TPA sequence  
  b) PRIMARY_IDENTIFIER   acc.version of contributing sequence(s) 
  c) PRIMARY_SPAN         base span on contributing primary sequence
  d) COMP                 'c' is used to indicate that contributing 
                          sequence is originating from complementary 
                          strand in primary sequence entry

  1-426          AC004528.1             18665-19090         
  427-526        AC001234.2             1-100            c


- gttaacaattaaagagtgtttatcgaaattcattatatagtggtttatatagaccacttc
- GenBank newsgroup see: http://www.bio.net/hypermail/genbankb/       
- GENBANKB e-mail: messages sent to genbankb at net.bio.net
- subscribe: e-mail biosci-server at net.bio.net with: subscribe genbankb
- unsub: e-mail biosci-server at net.bio.net with: unsubscribe genbankb      
- GenBank on the WWW, see:  http://www.ncbi.nlm.nih.gov/Genbank/
- problems with GENBANKB? E-mail moderator: francis at cmmt.ubc.ca                  

More information about the Genbankb mailing list

Send comments to us at archive@iubioarchive.bio.net