>From: schwartz at latour.cs.colorado.edu (Mike Schwartz)
>Subject: Essence prototype announcement
>Message-ID: <1993Jan13.225150.12585 at colorado.edu>
>Sender: schwartz at cs.colorado.edu>Nntp-Posting-Host: latour.cs.colorado.edu
>Organization: University of Colorado, Boulder
>Date: Wed, 13 Jan 1993 22:51:50 GMT
Essence is a resource discovery system that exploits file semantics to
index both textual and binary files. Essence generates summaries that
can be used to browse files before retrieving them across slow network
links, as well as space efficient indexes. Essence understands nested
file structures (such as uuencoded, compressed, "tar" files), and
recursively unravels such files to generate summaries for them. These
features allow Essence to be used in a number of useful settings, such
as anonymous FTP archives. The prototype generates WAIS-compatible
indexes, allowing WAIS users to take advantage of the Essence indexing
WAIS users can try Essence using the ".src" file enclosed below. This file
also describes where to get the prototype source code and a paper about this
Darren Hardy and
Dept. of Computer Science
Univ. of Colorado - Boulder
:maintainer "hardy at cs.colorado.edu"
"You can use this WAIS server to search and retrieve files from the
anonymous ftp archive on ftp.cs.colorado.edu [184.108.40.206]. We
used Essence, a resource discovery system based on semantic file
indexing, to build the WAIS index for this server. As explained below,
Essence currently only allows the retrieval of file summaries through
WAIS. To retrieve entire files, use anonymous ftp on ftp.cs.colorado.edu.
Essence exploits file semantics to index both textual and binary
files. By exploiting semantics, Essence extracts keywords that
summarize a file, and generates a compact yet representative index.
Essence understands nested file structures (such as uuencoded,
compressed, ``tar'' files), and recursively unravels such files to
generate summaries for them. Essence generates indexes that are ten
times smaller than WAIS indexes, but retain the fine-grained
information access that WAIS's full-text indexes provide.
Furthermore, Essence generates WAIS-compatible indexes allowing WAIS
users to make use of Essence's indexing capabilities. This is one of
the ways that the Networked Resource Discovery Project at the
University of Colorado has extended the conceptual paradigm of the type
of information that WAIS handles.
If you would like to learn more about Essence, you can obtain the
source to the Essence prototype and a paper which appears in the 1993
Winter USENIX Technical Conference, San Diego, CA, January 1993,
pp. 361-374. Both the paper and the prototype are available via
anonymous ftp from ftp.cs.colorado.edu in /pub/cs/distribs/essence.
Or search for the keyword 'Essence' using this WAIS server to find all
of the files on ftp.cs.colorado.edu that are related to Essence; you
will find the files for both the paper and the prototype.
This WAIS server was created in December 1992 by Darren R. Hardy and
Michael F. Schwartz as part of the Networked Resource Discovery
Project. You may reach them at the Department of Computer Science,
University of Colorado, Boulder, CO 80309-0430, or via email at
hardy at cs.colorado.edu and schwartz at cs.colorado.edu.
Below is some more information about the WAIS interface to Essence.
Essence exports its indexes through WAIS's search and
retrieval interface, allowing users to use tools such as
waissearch and the X Windows-based graphical user interface
xwais. In order to generate WAIS-compatible indexes,
Essence uses WAIS's indexing software to index the Essence
summary files. This mechanism generates full-text WAIS
indexes from the Essence summary files.
We modified the WAIS indexing mechanism to understand the
format of the Essence summary files, so that it generates
meaningful WAIS headlines. These headlines provide users
with a short description of a single file, usually a
filename. With Essence, headlines represent a file's core
filename, its actual filename, and its file type.
To support additional file types, WAIS must be recompiled
with new procedures that understand these file types. With
Essence, one need only write a new summarizer, add its name
to a configuration file, and add new heuristics for
identifying the file type; no recompilation is necessary.
In this sense, Essence modularizes the typed-file indexing
extensions that WAIS can use, because it removes the
keyword extraction process from WAIS and places it instead
in Essence. Essence is better suited to incorporating new
file types, and can be quickly adapted to become a
comprehensive indexing system.
The following waissearch output shows an example search of
an index generated by Essence of the ftp.cs.colorado.edu
anonymous FTP file system. It shows an ordered list of the
ten files that best match the keyword netfind. Netfind is
an Internet user directory service. The headlines have up
to three fields representing the matching file: the core
filename, the filename (if different from the core
filename), and the file type.
csh% waissearch netfind
3: /cs/ftp/distribs/netfind/netfind3.10.tar.Z ServerShell/nsh.c C
4: /cs/ftp/distribs/netfind/README README
5: /cs/ftp/distribs/netfind/netfind3.10.tar.Z README README
6: /cs/ftp/distribs/netfind/netfind3.10.tar.Z Doc/netfind.1 ManPage
Consider the effectiveness of the example search shown
above. The best match is a PostScript paper that discusses
a number of techniques for distributed information systems,
with particular emphasis on techniques demonstrated by
Netfind; the second match is the same file, but found in
the compressed tar distribution ALL.PS.tar.Z. The third
match is the C source code for the interactive user
interface to Netfind. The fourth match is the README file
found in the Netfind distribution directory; the fifth
match is the same file, but found in the compressed tar
distribution netfind.3.10.tar.Z. The sixth match is the
UNIX manual page for Netfind. The remaining matches are
PostScript papers in which Netfind is discussed.
In WAIS, a user retrieves files by selecting a matching
headline. With Essence, if the headline represents a file
hidden within a nested file (such as the first headline in the
example), the summary file is retrieved, instead of retrieving
the hidden file itself. If the headline represents a plain
file (such as the fourth headline in the example), the summary
file is also retrieved. This functionality requires allocating
storage for both the required summary files and the index.
However, it allows users to browse through remote file systems
by retrieving and viewing small summary files without having to
retrieve complete files. This is useful when trying to decide
whether to transfer large files across a slow network.