Randall Smith rsmith%mbcrr at HARVARD.HARVARD.EDU
Wed Jul 25 16:52:16 EST 1990

Announcing Rel. 4.0 of the MBCRR's Protein Pattern Library 
               and Search Tool (PLSEARCH)

     The MBCRR Protein Pattern  Library  is  a  database  of
"consensus-like"  protein  sequence  patterns,  each pattern
derived from a set of homologous sequences in the SWISS-PROT
Protein  Sequence  Database.   Families  of  related protein
sequences are identified by running  the  entire  SWISS-PROT
database  against  itself  (using BLAST,  the NLM/NCBI's new
high-speed similarity search tool);  the  resulting  set  of
pair-wise  scores  are  then clustered into families using a
maximal-linkage clustering algorithm.  A  pattern  construc-
tion  algorithm  (Smith  and Smith 1990, PNAS 87:118-122) is
then used to generate a single pattern for each family;  the
patterns,  which  we  call  amino acid class covering (AACC)
patterns, are functionally equivalent  to  'regular  expres-
sion'  patterns and represent the conserved primary sequence
elements common to all members of  each  family.   This  new
release of the pattern library (based on SWISS-PROT rel. 13)
contains 5199 entries: 2026 patterns derived from all  fami-
lies  of  2 or more members (encompassing 10664 of the 13837
sequences in SWISS-PROT rel. 13)  plus  the  remaining  3173
"non-related"  sequences  (i.e. from those loci that did not
cluster into any family).

     The  MBCRR  distributes  the  pattern  library  with  a
dynamic  programming-based search tool (PLSEARCH) for match-
ing and aligning newly generated protein  sequences  against
the  pattern database.  We have shown that covering patterns
can be more diagnostic for family membership than any of the
individual  sequences used to construct a pattern (see Smith
and Smith, 1990) thus pattern searches can be a more  sensi-
tive search technique than traditional sequence vs. sequence
database search tools.

     Also included in the package is our new  multi-sequence
alignment  program  (PIMA: Pattern-Induced Multi-Alignment).
This program is now being used routinely by the Human Retro-
virus  and  AIDS  Sequence  Database Group (Los Alamos Natl.
Labs) to multi-align HIV protein sequences for  phylogenetic

     PLSEARCH is written in 'C' and can run under both  Unix
and  VMS  operating systems; PIMA employs Unix shell scripts
and thus is currently a Unix-only implementation.

     The entire package is available electronically  and  is
free of charge to non-profit organizations (commercial users
must arrange payment of a distribution fee).  Copies can  be

1) directly from the MBCRR via INTERNET anonymous ftp:
   mbcrr.harvard.edu =; the  package  is  in  a
   single   compressed  tar  file  in  the  'plsearch'  sub-

2) by electronic mail from the Univ. of Houston Genbank-
   Server: genbank-server at bchs.uh.edu (INTERNET) or
   genbank-server%bchs.uh.edu at cunyvm  (BITNET/EARN).
   Send a mail message containing the line "SEND UNIX  HELP"
   to  start;  the  files are in the Unix area and are uuen-
   coded, compressed text files of approximately 300K  each.
   The  package  is  also  available  in  the  same form via
   anonymous   FTP   to   lavaca.uh.edu,,    in
   ~ftp/pub/genbank-server/Unix,   as   plsrchaa,  plsrchab,
   plsrchac, etc.

or 3) by electronic mail via the EMBL File Server:
   send the message "HELP SOFTWARE"  to  netserv at embl.bitnet
   to obtain specifics on retrieving the files.

When using anonymous FTP or e-mail, remember to be  sure  to
transfer files during off-hours (after 5 PM, machine's local
time); when e-mailing, ask for only a few files at  once  to
avoid filling up your mail spool area or mailbox.
Randall Smith and Temple Smith
Molecular Biology Computer Research Resource,
Galleria Level 1
Dana-Farber Cancer Institute and School of Public Health
Harvard University
44 Binney St., Boston MA 02115 USA
INTERNET: rsmith at mbcrr.harvard.edu
BITNET: rsmith%mbcrr at husc6.bitnet

