IUBio GIL .. BIOSCI/Bionet News .. Biosequences .. Software .. FTP

improvement suggestions

etzold etzold at ebi.ac.uk
Mon Jul 28 08:49:28 EST 1997

Don Gilbert wrote:
> Based on using SRS for Genbank and related sequence data at IUBio and for
> the Drosophila data of quite a variety, I have several suggestions
>  -- Sequence output format
>     -- default should always be the native format, untouched by SRS (current
>       return of genbank data is a bogus format that can't be interpreted well,
>       it is missing the ORIGIN line, the sequence data is in EMBL not GENBANK
>       style; maybe part of this is icarus indexing mistakes).
>     -- offer GENBANK and PIR/CODATA output formats as primary standard
>       sequence formats

we are working on that at the moment - currently it is not possible to
the native format - the sequence is always first converted into internal
structure and then converted back

>  -- Query symbol neutrality
>     The symbols that SRS now requires in queries for operations and parsing
>     clash with symbols used above (in unix and http command strings) and below
>     (in biological data).  Especially because of the latter, it is difficult
>     to use escape characters to do the kinds of queries needed.

that can be hard in some cases but i think most problematic are the
logical and link
operators ....see below

>     There should be query-time switches for getz, wgetz and such that let the
>     caller set symbols used for query parsing, including &|![]={}-.
>     At the least offer query-time symbol swapping, so that any single parsing
>     symbol can be changed to another in meaning.  The high ascii set would
>     make a good option.  But it would also be nice to allow strings, such
>     as _AND_ for &, _OR_ for |, _OPEN_PHRASE_ for [, _CLOSE_PHRASE_ for ],
>     etc. in queries.

that is quite possible, i even thought of using AND OR BUTNOT and LINKL
and LINKR for
& | ! < > respectively, changing that is quite easy since the query
language is 
parsed by icarus - have a look at "SRSICA/srsquery.i". I will try to
include the
operators for the next release 

>  -- Case sensitive searches
>     This should be available as a query-time, user choice for any field.
>     Perhaps there should be an index-time switch that will say if a field
>     has case sensitive potential, if it is compute expensive at query-time.

ok, that should be useful in some cases and is definetely possible to

>  -- Index numeric ranges
>     For example, a map range such as "123-456" should be indexed so it
>     can be queried as a numeric range. Query such as 124, 234, 345 should
>     all match such a range. Several ranges per field must be possible.
>     In WAIS, we just stored the text string of such a field, and did a numeric
>     range test at query time.

you can do that even now ...but it may look a little strange. 

index ('range' index) for each entry exactly two numbers ...or none
which are highbound and lowbound eg
50 and 200. 

to find all the ranges that include 124

do '[db-range#:124] & [db-range#124:]' 

the assumption is that to get a match you need a hit in anything lesser
the queyr and anything
greater - the inersection with & makes sure you have both - if one of
the boundaries is exactly
124 then this is ok if it is high or low bound

>  -- Cache query results and use that for quick lookups of next page data.
>     wgetz, and other srs query drivers, offer a page of results for a given
>     query, plus additional page links.  These additional page links redo
>     the same query at a sometimes large cpu cost.  It would be nice to have
>     the full match set for each query cached (for maybe an hour, in SRSTMP:)
>     and used to serve multipage requests of same query.

ok, feasible - could make this conditionally up to a certain set size

>  -- Relevance ranking
>     Allow fields to store word counts per record in indexes,
>     and use these counts for one form of relevance calculation.  Relevance
>     ranking can markedly improve the usability of query results, where those
>     with the most query words (or however defined as most relevant) are sorted
>     to the top of the results list.  Relevance ranking has been standard in
>     WAIS and related text indexing.

relevance ranking could be done by storing a word count in the index
whih is fine
with simple queries searching a single word, however the 
problem is with complex queries - each query part contributes a score,
how do
you compute the overall score, how do deal with 'but not' queries?

>  -- Lists of words to ignore in indexing
>     Use lists/files of common words to ignore at indexing (a, and, the, ...).
>     Let the icarus parsing script read such a list from common file/data and
>     apply to storing indices from any particular fields.  Maybe we
>     can do this now in the rich icarus; if so an example would be nice.

if you don't have too many you could include them into the .is file

$stoplist={a:1 and:1 the:1 ...}

#the 'ones' are arbitrary ...there has to be a value after the name

indexword:  .... {if:$stoplist.$Ct==0 $Wrt}

# $stoplist.$Ct is by default 0 if not defined

thanks for your suggestions. I think most of them will be implemented at
some stage. The only
problematic one is the ranking


More information about the Bio-srs mailing list

Send comments to us at archive@iubioarchive.bio.net