LookSee Introduction


This note gives an introduction to the FrontBase LookSee full text indexing and search facilities implemented by LookSee version 2.4.4 in FrontBase version 4. LookSee version 2.4.4 first appeared in FrontBase version 4.2.7.


Version 2.4 introduces support for multi-column indices, i.e. indices defined over more than one column of a table. In version 2.4.2, support for changing the structure of tables (eg, dropping a column) with LookSee indices is improved. Version 2.4.3 offers improved handling of XML text. Version 2.4.4 introduces the chars argument, supporting user specification of the composition of LookSee index words. Version 2.4.4 also allocates a separate string cache for each LookSee index and offers support for specifying the size of this string cache. The string cache holds index words longer than 11 characters.

Version 2.3 introduces support for Collations, controlling comparison of index words. Version 2.3.2 adds the handling of disk zone specification in index creation and update

Version 2.2 completes support for regular expressions in LookSee index queries, including full-fledged support for Unicode characters. Furthermore support for wildcards in LookSee index queries is introduced.

Version 2.1 is the first implementation for FrontBase version 4. The underlying implementation of a LookSee index is replaced, resulting in a substantial improvement in the performance of index queries.


Basic Principles

LookSee is an optional part of the FrontBase system. It allows creation and searching of full text indexes. A full text index summarises information about words in documents found in particular columns of a database table, either directly as SQL string values (possibly CLOBs), or as the names of documents (files) to be found externally, outside the database.

Once set up, a full text index will track evolution of the database: document insertions, deletes and updates will be treated on the fly.

An index may be searched as part of a database query; there is a seamless integration between normal database query functionality and text index query functionality. In particular, full text searching blends smoothly with document-related queries based on (say) regular database table columns such as information about document dates, authors, etc.

Generally, in FrontBase, text (or strings) is handled as Unicode characters, and comparison of text parts may be controlled by Collations, which define the ordering of Unicode characters. LookSee is no exception to this, which means that index words are sequences of Unicode characters, and Collations may be specified to direct comparison of index words.

HTML and XML documents may be indexed according to tagging: it is possible to specify (nested patterns of) tags to include or exclude from indexing. It is also possible to force or avoid indexing of words found in (selected) tag attributes.

A full text index basically records which words are contained in what documents, and searches of documents correspondingly talk about words and phrases (sequences of words) which must, may or may not occur in documents. LookSee is oriented towards and optimized for "exact word match", but also supports "word pattern match" in the form of wildcard notations and regular expressions. Finally, it is possible to base indexing and queries on word stems (look is the stem of all of the words look, looks and looked, for example).

LookSee indices often become fairly large. Every word occurrence in a document in a table has a representation in the underlying data-structure, and the maintenance of LookSee indices during table updates may generate quite a lot of traffic in this data-structure. FrontBase therefore offers the possibility to be able to control the exact whereabouts on disk of a LookSee index. See also the description of Storage Management in section 5.8 of the Users Guide.

Example 1

Imagine a database having a table containing abstracts of E-mails: sender, receiver, message body and timestamp. The table could have been realized by the following SQL statement:

CREATE TABLE memo_mails (
   sender   VARCHAR(1000),
   receiver VARCHAR(1000),
   body     VARCHAR(1000000),
   dateSent TIMESTAMP
);

In order to create a full text index for such a table we could issue the following SQL statement:

ALTER TABLE memo_mails ADD INDEX quickx ON body;

If, at this point, the table memo_mails already contains E-mails, those will be indexed. E-mails added later will be indexed on the fly.

Using the index we can now formulate queries such as "Who received mails from QA management (prior to June) talking about the Periscope product and invoices?"

SELECT
   receiver FROM memo_mails 
WHERE 
   sender = 'QA' AND 
   CAST(dateSent AS DATE) < DATE '2001-06-01' AND
   SATISFIES(quickx, 'Periscope & invoice');

The example suggests some observations:

Example 2

LookSee generally assumes that documents contain simple text, from which words to be indexed can be found by simple means, but may be directed to think otherwise. In particular, XML, XHTML and HTML text is supported. For example, if column doc of table docTable contains XML text, a LookSee index may be created as follows:

CREATE TABLE docTable(doc VARCHAR(1000000));
ALTER TABLE docTable ADD INDEX xindex ON doc TYPE 'xml'; 

LookSee will now investigate the tagging structure of the documents and index (only) "body text" ie., tag names and attributes are not treated as indexable material. Also, indexing may be restricted to include or skip only certain (nested) "paths" of tags:

ALTER TABLE docTable ADD INDEX xindex('only="section",skip="note"') ON doc TYPE 'xml';

Such a request would direct LookSee to only pick words embedded somewhere inside section tags.

<section>
   ... 
</section>

but would avoid (inner) material, nonetheless, embedded in <note>...</note> tags.

Documents contained directly in the database as SQL string values are, as all string values, treated as Unicode character sequences in FrontBase. This does not imply, though, that national and other special characters are represented directly as Unicode characters. Indeed, an XML document may, as an SQL string in FrontBase, well consist entirely of ASCII characters and yet embody national characters by way of for example XML's escape notation. Unlike other FrontBase (server) features, LookSee cannot avoid a logical interpretation of data (documents, in this case). The TYPE feature is used for this purpose, as just illustrated. For flexibility, it is possible to state the "document type" per document, via a (string valued) column:

CREATE TABLE docTable(anyDoc VARCHAR(1000000), docType VARCHAR(100));
ALTER TABLE docs ADD INDEX xindex ON anyDoc TYPE IN docType;

The type may be, for example, 'text' or 'html'.

LookSee Queries

LookSee queries may range from simple word matching and phrase matching to more elaborate queries. The model to keep in mind is that of the popular and well-known Google advanced search form, looking somewhat like the following:

with all of the words
with the exact phrase
with any of the words
without the words

A LookSee query is expressed as an SQL string, as an argument to SATISFIES. To allow complex queries, LookSee has a syntax for such, in the form of query combinators:

&       and
|       or
~       not
NOT     not (alternative syntax)
(...)   grouping

Thus one could write:

SATISFIES(index, 'foo & bar & ~ (boo | hiss)')

to express the same as shown in the Google-like form above. One may also use a regular expression to search for, say, documents containing words starting with the prefix "photo":

SATISFIES(index, '"^photo"') 

The Gory Details

Generating Indexes

A LookSee full text index is generated by an <alter-table> command of the following form:

<alter-table> ::= 
   ALTER TABLE <table> ADD INDEX <index-name> [ ('<method>') ] ON <doc-list> [ <collation> ] <arguments>;
<method> ::=
   [ <method-id> : ] [ <method-args> ]
<doc-list> ::=
   <doc> [ , <doc-list> ]
<doc> ::=
   [ EXTERNAL ] <column-id> [ <doc-type> ]
<doc-type> ::=
     TYPE IN <column-id>
   | TYPE <type-expr> 
<collation> ::=
     COLLATE <qualified-name>
<arguments> ::=
   [ STOP WORDS IN <origin> ]
   [ CACHE ( <min> , <max> , <percent> ) ]
   [ CHAR CACHE ( <min> , <max> , <percent> ) ]
   [ BLOCK SIZE <expr> ]
   [ DISK ZONE <diskzone-name>|DEFAULT ]
<origin> ::=
     <row-value-constructor>
   | <qualified-name>

The type of a LookSee target column must be (VAR)CHAR or CLOB.

<doc> states where to find the document text; without the EXTERNAL keyword, the column <column-id> holds the document text directly as an SQL string value, whereas with the EXTERNAL keyword, the column (which must still be an SQL string type) holds the name of an external source in which the document text is to be found 4.

The <doc-type> determines how to interpret a document being indexed, by evaluating to one of these character values: text, html, xhtml or xml. If a TYPE IN is specified, the specified column must contain one of these values and if TYPE is specified, the <type-expr> must evaluate to one of these values. If no <doc-type> is given, text is assumed. The html, xhtml and xml types are similar to each other; for these, LookSee will avoid indexing on tag element names, will interpret [X][HT]ML character escape sequences correctly, and so on. External documents will be assumed to be ISO 8859-1 encoded.

A LookSee index may be associated with exactly zero or one Collation(s): If <collation>, then the Collation denoted by the <qualified-name>, otherwise, if there is only one <doc> in the <doc-list>, then the Collation associated with that column, if any, otherwise, if all <doc>s in the <doc-list> are associated with the same Collation, then that Collation, otherwise the Collation defined by Information_Schema.CaseInsensitive, if present (unless -casefold is specified, see below).

<arguments> makes it possible to specify certain basic properties of a LookSee index:

<method-id> defaults to ls 1. The ls method views a text (document) as consisting of words. A word begins with a Unicode letter or digit and is (by default) followed by a sequence of Unicode letters, digits and underscores ('_'); the chars argument expands the set of Unicode characters that may be part of a LookSee word (although not the first character). An ls index is simply a summary of word occurrences in the indexed documents.

The ls method supports fine-tuning by selection of method arguments (parameters):

<method-id> ::= ls
<method-args> ::= <method-arg>-comma-sequence
<method-arg> ::=
     -casefold
   | stop
   | -stop
   | stem
   | -stem
   | stemalso
   | -stemalso
   | attrs
   | -attrs
   | body
   | -body
   | minwordlen=<number>
   | maxwordlen=<number>
   | chars=<characters>
   | only=<tags>
   | skip=<tags>

-casefold suppresses the use of the default Information_Schema.CaseInsensitive Collation. It has no inverse. stop activates the default stop-words (the default), whereas -stop does not. stem means reducing words to their stem form before being indexed (stemalso means indexing both the stemmed and the unstemmed forms); -stem and -stemalso suppresses stemming and is the default 3. attrs means index words in XML and HTML attributes (default is not to, corresponding to -attrs); body means to index only words in XML and HTML body sections (the default), and -body means to index words from all text sections. minwordlen and maxwordlen defines the minimum and maximum number of characters a word can have in order to be indexed (the defaults are 1 and 2560). chars is a sequence of (Unicode) characters that are to be part of LookSee index words. only and skip specify matches for XML and HTML tag paths, in the form of regular expressions (explained in detail below).

When specifying only or skip arguments, LookSee will maintain a path of [X][HT]ML tag and attribute names during indexing and check it against those <tags> arguments whenever a candidate index word is found. A word is not indexed if a skip argument matches the path, nor if an only argument does not. The path is constructed from the nested structure of tags, with an initial and separating slash ( /) characters, e.g., /HTML/BODY. If (and only if) the attrs argument is specified, [X][HT]ML attribute values will be scanned for indexable words, with : and the attribute name appended to the path, e.g., /HTML/META:keywords. For the html type, the path will be upper-case folded, for the xml and xhtml types tag and attributes names are taken literally as appearing in the document. LookSee treats only and skip arguments as regular expressions when matching against tag and attribute name paths. Since the only and skip arguments are given as FrontBase SQL strings they are in principle sequences of Unicode characters and are, indeed, treated as Level 1 Unicode Regular Expressions, which are described in full on http://www.unicode.org/unicode/reports/tr18/.

Dropping Indexes

A text index is dropped by an <alter-table> command of the following form:

<alter-table> ::=
   ALTER TABLE <table> DROP INDEX <index-name>;

An index will also be dropped whenever the table it is related to is dropped 5.

Updating Indexes

Certain properties of an index may be changed dynamically. This is achieved by an <alter-table> command of the following form:

<alter-table> ::=
   ALTER TABLE <table> UPDATE INDEX <index-name> <arguments>;

Searching Indexes

An index may be searched by using the SATISFIES predicate in the <where-clause> of a <select-statement>. It has the following form:

<satisfies-predicate> ::=
   SATISFIES ( <index-name> , '<query>' [ , '<label>' ] )
<query> ::=
     <word>
   | <word-with-wildcards>
   | "<regular-expression>"
   | <word>-sequence
   | ( <query> )
   | <query> | <query>
   | <query> & <query>
   | ~ <query>
   | NOT <query>
<wildcard> ::=
     %
   | <character-spec>
<character-spec> ::=
     [ <character-set> ]

A <word> is a sequence of Unicode characters that is matched against all index words. A <word-with-wildcards> is like a <word>, except that it may contain one or more <wildcard>s: '%' denotes any sequence of zero or more (Unicode) characters and <character-spec> specifies a set of characters as described for regular expressions (note that the brackets denote themselves as characters, not optionality). A regular expression is enclosed in double quotes ('"') and is explained below. A phrase is a sequence of simple <word>s (ie, no wildcards or regular expressions). '|' is logical or, '&' is logical and, and '~' and 'NOT' are logical not. '&' has higher precedence than '|', while '~' (and 'NOT') has higher precedence than '&'. LookSee will reject queries not corresponding to a simple, normal form having 4 optional conjuncts: a set of mandatory words, a set of alternative words, a set of unwanted words, and a phrase; this is not unlike what can be expressed as an advanced query on the Google Search Engine.

<word> is simply a sequence of such Unicode characters that may be included in a LookSee index word.

<label> must be a simple identifier; it may be referenced in RANK and DETAILS expressions.

<regular expression> is explained below.

A SATISFIES predicate must be either the only expression in the <where-clause>, or it must be part of a top-level AND expression.

Inspecting Indexes

Certain properties of an index may be inspected by so-called pseudo-tables, which are just like ordinary database tables, except that they are constructed every time that they are referenced. Two kinds of pseudo-tables are defined for a LookSee index, one for inspecting the actual words in the index, and one for inspecting the stop-words defined for the index. The tables must be defined as follows:

CREATE PSEUDO ( <index> , <type> ) TABLE <name>
   "Word" VARCHAR(1000),
   "Count" INT
);

<index> is the name of the LookSee index and <type> is either "INDEXWORDS" or "STOPWORDS". The resulting table is called <name> and may be inspected with ordinary table queries, eg:

SELECT count(*) FROM <name> WHERE 500 < "Count";

which will return the number of words in the index that occur in more than 500 documents.

Finally,

EXTRACT TABLE <name> [ WITH SIZE ];

will display information about the table identified by <name>, including information about any LookSee indexes defined.

Getting Details

For use in a <select-column-list> of a <select-statement>, the following expressions are available:

RANK ( [ '<label>' ] )
DETAILS ( '<format>' [ , '<label>' ] )

They provide details about (each) document matched by SATISFIES in the <where-clause> of the <select-statement>. A <label> must be used to pin-point a matching SATISFIES, when more than one occurs in the <where-clause>.

The DETAILS expression produces an SQL string value constructed from the given <format> literal string 6 by expansion of any expansion pattern of the following form:

%t     the type of the document
%n     the name of the document
%q     the contents of the document
%h     the count of matches
%p     the (comma-separated) (minimum) matching positions in the document
%a     the (comma-separated) all matching positions in the document
%P     the (comma-separated) (minimum) matching position/length pairs in the document
%A     the (comma-separated) all matching position/length pairs in the document
%f     document marked up with s1/s2 at first match (s1/s2 are strings in between ';' after % as in: %;s1;s2;f)
%F     document marked up with s1/s2 at all matches (s1/s2 are strings in between ';' after % as in: %;s1;s2;F)
%%     the percent character

The expanders %p and %a accept one, and %P and %A two, "options": either of

(
[
{

to control which bracketing is used for position and length information, as in the following:

%[(A

leading to square brackets at the outer "list" level, and normal round parenthesis around position/length pairs.

The RANK expression produces a positive integer number being a heuristic measure for a document's relevance for a query.

The DETAILS and RANK expressions are not supported by a multi-column LookSee index.

Regular Expressions

Regular expressions provide a powerful notation for text matching. LookSee supports the use of regular expressions in index queries, where a word must match the specified regular expression, as well as in only and skip arguments, where tag and attribute paths are matched against the regular expression. The latter is used in the examples below, but regular expressions work in the same fashion when matching individual index words.

As a first approximation, regular expression matching is simple sub-string matching. Thus, 'HTML/BODY' would match all of the following tag paths:

/HTML/BODY
/HTML/BODY/SECTION
/MYTAG/MYHTML/BODYWORK

The special characters '^' and '$' are used as "start" and "end" anchors; thus '^/HTML/BODY' would not match the following:

/MYTAG/HTML/BODYWORK

whereas '^/HTML/BODY$' would not match the following:

/HTML/BODYWORK
/HTML/BODY/SECTION

The '.' (dot) character matches any character; the '?' character means "match zero or one", '*' means "match zero or more" and '+' means "match one or more". '|' separates alternatives and '( ...)' groups characters. Thus '^/HTML/BODY(/.+)?$' will match paths having HTML and BODY as the two outermost tags. Sets of alternative characters are given between '[' and ']', as in '[01234567]', which matches any octal digit. Ranges of such characters may be abbreviated by a hyphen, as in '[0-7]'. Sets of characters may be negated by a leading '^', as in '[^a-zA-Z]', which matches any character that is not a latin letter.

It should be noted that LookSee is designed and optimized for exact word matching. Regular expression queries pay a penalty for the need to "traverse" the entire LookSee index, with the exception of "start" anchored regular expressions, which (also) perform quite well. Wildcards in query words are similar to regular expressions; if a wildcard is not the first character of a word, it corresponds to a "start" anchored regular expression and performs well.


1. LookSee has been designed so that other distinct methods could eventually be added to FrontBase, making it possible for the user to specify which method should be used for a particular index.

2. LookSee by default treats the following as stopwords:

a an and are as at be but by
etc for if in into is it its
no not of on or s such
t that the their then there these
they this to
was were will with

3. In LookSee 2.3 stemming is English stemming (only).

4. Note that a document text may be processed more than once (ie., not necessarily just when initially being indexed), and that LookSee does not check for future inconsistencies caused by (incidental) midstream updates or deletions of 'external' documents. Also note that a file is accessed on the database server, which may have a filesystem quite different from that of the client machine issuing the ALTER statement. Cave!

5. Unlike tables themselves, indexes can usually be reconstructed. Hence, inadvertent deletion of an index will in itself not lead to loss of data. This is not true, however, if an index was created from external documents which were subsequently changed or deleted.

6. Application developers should find that DETAILS formats are actually little programming platforms on which it is easy to produce feedback in the form the application would like to see it.


Edited by ph@frontbase.com. Document Date: 2007.03.05, Issue 2.4.4.