New BCM Gene-Finder service:
==========================================================================
TSSG and TSSW Recognition of human PolII promoter region and start of
transcription
==========================================================================
(TSS - Transcription Start Site)
Department of Cell Biology, Baylor College of Medicine
Analysis of uncharacterized human sequences is available
through WWW:
http://dot.imgen.bcm.tmc.edu:9331/gene-finder/gf.html
or by sending your file containing a sequence
(the sequence format is described below) to University of Houston
and soon to Weizmann Institute of Science Email services:
service at bchs.uh.edu or services at bioinformatics.weizmann.ac.il
Examples:
mail -s tssg service at bchs.uh.edu < test.seq
mail -s tssg services at bioinformatics.weizmann.ac.il < test.seq
where test.seq a file with the sequence.
The same for TSSW program.
METHOD DESCRIPTION (TSSG):
Algorithm predicts potential transcription start positions by linear
discriminant function combining characteristics describing functional motifs
and oligonucleotide composition of these sites. TSSG uses promoter.dat file
with selected factor binding sites (TFD, Ghosh,1993) developed by Dan Prestridge
to calculate the density of functional sites as in (J.Mol.Biol.,1995,249,923-932).
In addition to the parameters of Prestridge's method we use oligonucleotide
composition around the start of transcription, that permits us to increase an
accuracy of TSS (transcription start site) defining.
For approximately 50-55% level of true promoter region recognition,
the TSSG program will give one false positive prediction for about 5000 bp. (this
accuracy is similar with the test sequences anlysis by Prestridge's method).
We estimate an accuracy of defining TSS position on 10 test genes where both
(our and Prestridge's) algorithms found promoter region:
Deviation of predicted TSS from the real TSS:
_____________________________________________________________________
Method/deviation I 5b I 50 b I 150 b I mean of observed
________________________I_______I_______I_______I___deviations_______
Prestridge's I 0 I 3 I 7 I 81.2 base
________________________I_______I_______I_______I_____________________
TSSG I 7 I 3 I 0 I 7.3 base
________________________I_______I_______I_______I_____________________
METHOD DESCRIPTION (TSSW):
Algorithm predicts potential transcription start positions by linear
discriminant function combining characteristics describing functional motifs
and oligonucleotide composition of these sites. TSSW uses file
with selected factor binding sites from currently supported functional site data
base of (E.Wingender, J.of Biotechnology,1994, 35, 273-280).
In addition to the parameters of Prestridge's method (J.Mol.Biol.,1995,249,923-932)
we use some oligonucleotide composition characteristics around the start of
transcription and within promoter region.
For approximately 50-55% level of true promoter region recognition,
the TSSW program will give one false positive prediction for about 4000 bp.
SUBMITTING SEQUENCES VIA EMAIL:
For email submission the sequences must have the following format:
Name of your sequence
ccatctctgtcttgcaggacaatgccgtcttctgtctcgtggggcatcctcctgctggca
ggcctgtgctgcctggtccctgtctccctggctgaggatccccagggagatgctgcccag
aagacagatacatcccaccatgatcaggatcacccaaccttcaacaagatcacccccaac
ctggctgagttcgccttcagcctataccgccagctggcacaccagtccaacagcaccaat
atcttcttctccccagtgagcatcg...............
(The line length must be less than 80 letters).
You have to send the file containing the sequence to:
service at theory.bchs.uh.edu
Subject line must be:tssg
EXAMPLE: mail -s tssg service at bchs.uh.edu < test.seq
TSSG output:
1st line - name of your sequence; 2nd and 3d lines - LDF threshold and the
length of presented sequence
4th line - The number of predicted promoter regions
Next lines - positions of predicted sites, their 'weights' and TATA box
position (if found)
Position shows the first nucleotide of the transcript (TSS position).
After that functional motifs are given for each predicted region;
(+) or (-) reflects the direct or complementary chain; S... means a
particular motif identificator from the Ghosh data base.
FOR EXAMPLE: (TSSG)
HSCALCAC 7637 bp DNA PRI 14-MAR-1995
Length of sequence- 7637
Threshold for LDF- 4.00
1 promoter(s) were predicted
Pos.: 1820 LDF- 16.65 TATA box predicted at 1804
Transcription factor binding sites:
for promoter at position - 1820
1764 (-) S00098 AACCAAT
1608 (-) S01152 AAGTGA
1741 (+) S01153 AARKGA
1608 (-) S01153 AARKGA
1657 (+) S01090 AATGA
1617 (-) S01027 ACGCCC
1577 (+) S00534 ACGTCA
1580 (-) S00534 ACGTCA
1580 (-) S01257 ACGTCAT
..............................
EXAMPLE: (TSSW)
HSCALCAC 7637 bp DNA PRI 14-MAR-1995
Length of sequence- 7637
Threshold for LDF- 4.00
2 promoter(s) were predicted
Pos.: 1834 LDF- 11.08 TATA box predicted at 1804
Pos.: 7031 LDF- 4.64 TATA box predicted at 7001
Transcription factor binding sites:
for promoter at position - 1834
1752 (+) CHICK$ACRA CCGCCC
1762 (-) HS$BAC_03 CCAAT
1764 (-) RAT$ALBU_2 AACCAAT
1757 (-) HS$APOE_08 GGGCGG
1575 (+) HS$ACHGON_ TGACGTCA
1582 (-) HS$ACHGON_ TGACGTCA
1758 (+) MOUSE$A21C ATTGG
1745 (+) MOUSE$A21C gcccagccctcccATTGGtggagacg
1609 (+) Y$CYC1_09 ctcatttggcgagcGTTGGt
1724 (+) AD$E2L_04 TGACgcA
1577 (+) AD$E4_16 ACGTCA
1580 (-) AD$E4_16 ACGTCA
1580 (-) AD$E4_18 ACGTCAT
1655 (+) HS$EGFR_15 TCAAT
..............................
HS$EGFR_15 and etc. are particular motif identificators from the
Wingender data base.
Reference:
Solovyev V.V., Salamov A.A., Lawrence C.B. Recognition
of PolII promoter region and start of transcription position in human genes.
(1995) (in preparation).
Questions:solovyev at cmb.bcm.tmc.edu
===============================================================
The other services are
===============================================================
FGENEH - search for gene structure with exons assembling by dynamic programming
FEXH - search for 5'-, internal and 3'-exons
HEXON - search for internal exons
HSPL - search for splice sites
RNASPL - prediction exon-exon junctions in cDNA sequences
CDSB - prediction of Bacterial coding regions
HBR - recognition of human and bacterial sequences to test a library
for E. coli contamination by sequencing example clones
TSSG - recognition of human promoter regions (Ghosh/Prestridge motif data)
TSSW - recognition of human promoter regions (Weingender motif data base)
POLYAH - recognition of of 3'-end cleavage and polyadenilation region
of human mRNA precursors
SSP - prediction of a-helix and b-strand in globular proteins
by segment-oriented approach.
NSSP - prediction of a-helix and b-strand segments in globular proteins
by nearest-neighbor algorithm.