PAT Data Structure and Signature File Structure
PAT Data Structures
Text searching methods may be classified as lexicographical indices (indices that are sorted), clustering techniques, and indices based on hashing. In this chapter we discuss two new lexicographical indices for text, called PAT trees and PAT arrays. Our aim is to build an index for the text of size similar to or smaller than the text.
Briefly, the traditional model of text used in information retrieval is that of a set of documents. Each document is assigned a list of keywords (attributes), with optional relevance weights associated to each keyword. This model is oriented to library applications, which it serves quite well.
For more general applications, it has some problems, namely:
- A basic structure is assumed (documents and words). This may be reasonable for many applications, but not for others.
- Keywords must be extracted from the text (this is called “indexing”). This task is not trivial and error prone, whether it is done by a person, or automatically by a computer.
- Queries are restricted to keywords.
For some indices, instead of indexing a set of keywords, all words except for those deemed to be too common (called stopwords) are indexed.
We prefer a different model. We see the text as one long string. Each position in the text corresponds to a semi-infinite string (sistring), the string that starts at that position and extends arbitrarily far to the right, or to the end of the text. It is not difficult to see that any two strings not at the same position are different. The main advantages of this model are:
- No structure of the text is needed, although if there is one, it can be used.
- No keywords are used. The queries are based on prefixes of substrings, that is, on any substring of the text.
THE PAT TREE STRUCTURE
In what follows, we will use a very simple model of text. Our text, or text data-base, will consist of a single (possibly very long) array of characters, numbered sequentially from one onward. Whether the text is already presented as such, or whether it can be viewed as such is not relevant. To apply our algorithms it is sufficient to be able to view the entire text as an array of characters.
A semi-infinite string is a subsequence of characters from this array, taken from a given starting point but going on as necessary to the right. In case the semi-infinite string (sistring) is used beyond the end of the actual text, special null characters will be considered to be added at its end, these characters being different than any other in the text. The name semi-infinite is taken from the analogy with geometry where we have semi-infinite lines, lines with one origin, but infinite in one direction. Sistrings are uniquely identified by the position where they start, and for a given, fixed text, this is simply given by an integer.
Text Once upon a time, in a far away land . . .
- substring 1 Once upon a time . . .
- substring 2 once upon a time . . .
- substring 8 on a time, in a . . .
- substring 11 a time, in a far . . .
- substring 22 a far away land . . .
For example, the above substrings will compare as follows:
22 < 11 < 2 < 8 < 1
Of the first 22 substrings (using ASCII ordering) the lowest substring is “a far away. . .” and the highest is “upon a time. . . .”
ALGORITHMS ON THE PAT TREE
5.3.1 Prefix Searching
- Notice that every subtree of the PAT tree has all the substrings with a given prefix, by construction. Then prefix searching in a PAT tree consists of searching the prefix in the tree up to the point where we exhaust the prefix or up to the point where we reach an external node.
- At this point we need to verify whether we could have skipped bits. This is done with a single comparison of any of the substrings in the subtree (considering an external node as a subtree of size one). If this comparison is successful, then all the sistrings in the subtree (which share the common prefix) are the answer; otherwise there are no substrings in the answer.
5.3.2 Proximity Searching
We define a proximity search as finding all places where a string S1 is at most a fixed (given by the user) number of characters away from another string S2. The simplest algorithm for this type of query is to search for sl and s2. Then, sort by position the smaller of the two answers. Finally, we traverse the unsorted answer set, searching every position in the sorted set, and checking if the distance between positions (and order, if we always want s1 before s2) satisfies the proximity condition. If m1 and m2 (m1 < m2) are the respective answer set sizes, this algorithm requires (m1 + m2) log m1 time. This is not very appealing if m1 or m2 are O(n). For the latter case (when one of the strings S1 or S2 is small), better solutions based on PAT arrays have been devised by Manber and Baeza-Yates (1991).
5.3.3 Range Searching
Searching for all the strings within a certain range of values (lexicographical range) can be done equally efficiently. More precisely, range searching is defined as searching for all strings that lexicographically compare between two given strings. For example, the range “abc” . . “acc” will contain strings like “abracadabra,” “acacia,” “aboriginal,” but not “abacus” or “acrimonious.”
5.3.4 Longest Repetition Searching
The longest repetition of a text is defined as the match between two different positions of a text where this match is the longest (the most number of characters) in the entire text. For a given text, the longest repetition will be given by the tallest internal node in the PAT tree, that is, the tallest internal node gives a pair of sistrings that match for the greatest number of characters. In this case, tallest means not only the shape of the tree but has to consider the skipped bits also. For a given text, the longest repetition can be found while building the tree and it is a constant, that is, will not change unless we change the tree (that is, the text).
5.3.5 “Most Significant” or “Most Frequent” Searching
- This type of search has great practical interest, but is slightly difficult to describe. By “most significant” or “most frequent” we mean the most frequently occurring strings within the text database. For example, finding the “most frequent” trigram is finding a sequence of three letters that appears most often within our text.
- In terms of the PAT tree, and for the example of the trigrams, the number of occurrences of a trigram is given by the size of the subtree at a distance 3 characters from the root. So finding the most frequent trigram is equivalent to finding the largest subtree at a distance 3 characters from the root. This can be achieved by a simple traversal of the PAT tree which is at most O(n/average size of answer) but is usually much faster.
- Searching for “most common” word is slightly more difficult but uses a similar algorithm. A word could be defined as any sequence of characters delimited by a blank space. This type of search will also require a traversal, but in this case the traversal is only done in a subtree (the subtree of all sistrings starting with a space) and does not have a constant depth; it traverses the tree to the places where each second blank appears.
- We may also apply this algorithm within any arbitrary subtree. This is equivalent to finding the most frequently occurring trigram, word, and the like, that follows some given prefix.
- In all cases, finding the most frequent string with a certain property requires a subtree selection and then a tree traversal which is at most O(n/k) but typically much smaller. Here, k is the average size of each group of strings of the given property. Techniques similar to alpha-beta pruning can be used to improve this search.
5.3.6 Regular Expression Searching
The main steps of the algorithm due by Baeza-Yates and Gonnet (1989) are:
- Convert the regular expression passed as a query into a minimized deterministic finite automation (DFA), which may take exponential space/time with respect to the query size but is independent of the size of the text (Hopcroft and Ullman 1979).
- Next eliminate outgoing transitions from final states (see justification in step (e). This may induce further minimization.
- Convert character DFAs into binary DFAs using any suitable binary encoding of the input alphabet; each state will then have at most two outgoing transitions, one labeled 0 and one labeled 1.
- Simulate the binary DFA on the binary digital trie from all sistrings of text using the same binary encoding as in step b. That is, associate the root of the tree with the initial state, and, for any internal node associated with state i, associate its left descendant with state j if i j for a bit 0, and associate its right descendant with state k if i k for a 1 (see Figure 5.3).
- For every node of the index associated with a final state, accept the whole subtree and halt the search in that subtree. (For this reason, we do not need outgoing transitions in final states).
- On reaching an external node, run the remainder of the automaton on the single string determined by this external node.
Signature File Structure
Text retrieval methods have attracted much interest recently. There are numerous applications involving storage and retrieval of textual data:
- Electronic office filing
- Computerized libraries. For example, the U.S. Library of Congress has been pursuing the “Optical Disk Pilot Program” (Price 1984; Nofel 1986), where the goal is to digitize the documents and store them on an optical disk. A similar project is carried out at the National Library of Medicine (Thoma et al. 1985).
- Automated law (Hollaar 1979) and patent offices (Hollaar et al. 1983). The U.S. Patent and Trademark Office has been examining electronic storage and retrieval of the recent patents on a system of 200 optical disks.
- Electronic storage and retrieval of articles from newspapers and magazines.
- Consumers’ databases, which contain descriptions of products in natural language.
- Electronic encyclopedias (Ewing et al. 1986; Gonnet and Tompa 1987).
- Indexing of software components to enhance reusabililty (Standish 1984).
- Searching databases with descriptions of DNA molecules (Lipman and Pearson 1985).
- Searching image databases, where the images are manually annotated (Christodoulakis et al. 1986). A similar approach could be used to search a database with animations, if scripted animation is used (Lewis 1989).
- The main operational characteristics of all the above applications are the following two:
- Text databases are traditionally large.
- Text databases have archival nature: there are insertions in them, but almost never deletions and updates.
- Test retrieval methods form the following large classes (Faloutsos 1985): Full text scanning, inversion, and signature files, on which we shall focus next.
Signature files are based on the idea of the inexact filter: They provide a quick test, which discards many of the nonqualifying items. The qualifying items definitely pass the test; some additional items (“false hits” or “false drops”) may also pass it accidentally. The signature file approach works as follows: The documents are stored sequentially in the “text file.” Their “signatures” (hash-coded bit patterns) are stored in the “signature file.” When a query arrives, the signature file is scanned and many nonqualifying documents are discarded. The rest are either checked (so that the “false drops” are discarded) or they are returned to the user as they are.
On the other hand, signature files may be slow for large databases, precisely because their response time is linear on the number of items N in the database. Thus, signature files have been used in the following environments:
1. PC-based, medium size db
3. parallel machines
4. distributed text db