IUBio GIL .. BIOSCI/Bionet News .. Biosequences .. Software .. FTP

Massively Parallel Applications in Sequence Analysis

A. Parsons mbpcr at s-crim1.dl.ac.uk
Thu Apr 1 09:31:46 EST 1993

In article <CMM. at maud.ifi.uio.no> Geir Egil Hauge <geirha at ifi.uio.no> writes:
>>    I should note also, since some readers of this group may be
>>interested, that I now have a version of our parallel "platform" for
>>sequence comparison ( Despande, Richards, and Pearson (1991) CABIOS "A
>>platform for biological sequence comparison on parallel computers"
>>7:237-247) running on networks of workstations using PVM (parallel
>>virtual machine), a freely available package for almost any machine.
>>If you are doing lots of sequence comparisons, I can provide you with 
>>PVM versions for FASTA and Smith-Waterman, with BLAST to be available 
>>in about a month.
>My shareware package dtask v1.1s (no previous versions available) for running 
>UNIX workstations in parallel when comparing biological sequences, is to be 
>released in about 1 month (as soon as it is cleared by my supervisors). 
>A sequence comparison program using the Smith-Waterman-1981 algorithm is
>included in the package.
>I have tested the package on as much as 96 UNIX workstations in parallel.
>The speed was then measured to be 42 million matrix cells updates per second,
>using a 801 residue long protein query sequence against Swiss-Prot #21. (It 
>took 151 seconds). 
>The speedup was measured to be 32 against a Sun Sparc-10 station. This is 82% 
>of "perfect speedup". The speedup will be better on heavier jobs (longer query 
>sequences) and smaller on lighter jobs (short query sequences). The speedup 
>will also be better when a smaller number than 96 machines are run in 
>Among the 96 machines were machines like: SUN 3/50, SUN 3/60, Sparc-2,
>Sparc-10, DEC3100, DEC5000/200, and some SGI and HP machines running System V 
>derived UNIX systems. (In version 1.1s of dtask, BSD signals are needed on 
>system V systems). The machines have to depart in a common file system like
>NFS, and must be able to do UNIX socket(2)/AF_INET communication.
>The programs are built in such a way that the programs may detect if a 
>workstation is heavily used by other users, and then stop using that/those 
>machines for a specified time before the machine(s) is/are tried again.
>I use indexfiles in such a way that the programs are quite independent of
>library format. Only the program that creates indexfiles has to be
>altered. A program for making indexfiles from Pearson/FASTA-format
>libraries are included.
>The package, containing complete C-source, documentation and tests, will 
>be available from anonymous ftp "ftp.ifi.uio.no" in about a month. 
>Geir Egil Hauge

This sounds very impresive performance, and I dont want to sound facetious
but HOW much would a network of 96 Un*x workstations cost?  And more to the 
point - how much floor space would they take up!

Also - my understanding is that when it comes to symmetric (or even asymmetric)
multiprocessing that performance starts to degrade after a certain number of
processors are added due to interprocessor communication and synchronisation.

I suppose what I am saying is - if more is better - and parallelism is the
preferred paradigm - then massively parallel is surely the ultimate solution??

I heard Donald Lindberg give a lecture at the Royal Society in London on Monday
and this was very much the thinking behind his talk so presumably this is the
preferred route that the NCBI/NLM are going to take?

The question I originally asked also had the caveat (which noone to date has
commented on) "How much longer can we do WITHOUT data parallel solutions for
searching the masses of data being generated by the HGMP?"

As this thread is starting to die down i will summarise all responses soon.

Tony Parsons (mbpcr at seqnet.dl.ac.uk  AKA parsons_a at snd01.pcr.co.uk)
Pfizer Central Research - Sandwich, UK

More information about the Bio-soft mailing list

Send comments to us at archive@iubioarchive.bio.net