There are a tremendous number of instances at IMSSS where unstructured files are searched for sets of various character strings. For some time Pentti Kanerva and I have been talking of a searching program that would write a string-specific program to search a file. The initial concept was that the generated program would be in pure PDP-10 machine code, and that it should do a minimum of back-tracking in its search. Since then we have decided two things: A) no back- tracking is necessary, and B) what we really were writing was not so much a PDP-10 program as a finite state machine to do the searching for us. The algorithm finally produced has roots in the "trie" search method (Knuth "The Art of Computer Programming" volume 3, section 6.3), and in fact for some words looks remarkably like a "trie" recognition structure (see below). In fact, the major difference between the finite state machine and a "trie" is that the finite state machine knows what to do if it fails to match a string, proceeding to the state that most completely represents its current knowledge about the input stream. MACHINE TO RECOGNIZE THE WORDS "CAB" AND "CAD" other recognized back state "A" "B" "C" "D" letters word pointer . . . (1) . . "" . (1) (2) . (1) . . "" . (2) . (3) (1) (4) . "" . (3) . . (1) . . "CAB" . (4) . . (1) . . "CAD" . "TRIE" FOR THE WORDS "CAB" AND "CAD" state "a" "b" "c" "d" "e" ... . --- --- (1) --- --- ... (1) (2) --- --- --- --- ... (2) --- "cab" --- "cad" --- ... Imagine each state represents some amount of knowledge about the input stream. As long as we follow any one string, our knowledge increases with each new character we see. However, if we see a character that does not match the next character of the string we are following, we should proceed to a state that "knows" as much of the input string as possible; it is this transition to the state of greatest knowledge that distinguishes the finite state machine of this algorithm from a simple "trie". In the previous example, the transition to maximal knowledge is demonstrated whenever any state (except ".", which should go there anyway) encounters a "C". Whenever this happens, rather than simply reporting a failure to match, as the "trie" would, the machine goes into state "(1)", which means "I know the input stream ends with the character "C". Since we are looking for all matches in the input stream, there are several special cases of search targets that must be dealt with in a non-trie fashion. All of these cases deal with search targets with common sub-strings. First, we have the case of one string being the initial sub- string of another word (as in the case of "father" and "fat"). The solution to this problem is fairly simple, we simply indicate that we have recognized the word "fat" when we receive the "t", and continue from the state that 'knows' "fat" to the state that 'knows' "fath" and so on. The second case is the case of one word being an imbedded sub- string of another word (for example,. "father" and "the"). In this case, we see that there must be more than one state that reports recognition of "the" (both the states typified by "the" and "fathe"). So the machine must be built in such a way as to 'know' that such things can happen. The third case is even nastier; imagine one string being a tail of another string ("father" again, and "her"). When we we reach the final "r" of "father", we need to report a match on two distinct words. In order to solve this problem, it would seem we need a linked list of words to report as found. Notice that the list can be a single chain, with several places looking as if they were the head as in the case of "grandfather", "father", and "her". Every time we see "father", we have also seen "her", including the case of having seen "father" as a result of having seen "grandfather". This particular piece of trickiness is represented in the machine by the column labelled "back pointer". What the machine is supposed to do when it sees it has recognized a word, and reported that fact, is to start following the back-pointers, reporting any words associated with the states it passes. The next example shows all of these conditions (except the triple match) in a machine. To make it easier to read, I have included a column on the left labelled "string known," which indicates how much of the information about the input stream is contained in each state. There is in fact, nothing in the algorithm (or my implementation of it in SAIL) that corresponds to that column. Another possibly confusing label in the example is "****", which simply is a column that represents any input character that does not otherwise have a column associated with it. MACHINE TO RECOGNIZE: (1) "FAT", (2) "FATHER", (3) "HER", (4) "THE", and (5) "HERE". INITIAL CHARACTER-STATE TABLE (pseudo-"trie") string back known state||****| F | A | T | H | E | R |pointer WORD: -------|-----||----|----|----|----|----|----|----|------|-------------- .none. | .. || .. | (1)| .. | (3)| (2)| .. | .. | (-1) | | || | | | | | | | | f | (1) || .. | .. | (4)| .. | .. | .. | .. | (-1) | h | (2) || .. | .. | .. | .. | .. | (5)| .. | (-1) | t | (3) || .. | .. | .. | .. | (6)| .. | .. | (-1) | | || | | | | | | | | fa | (4) || .. | .. | .. | (7)| .. | .. | .. | (-1) | he | (5) || .. | .. | .. | .. | .. | .. | (8)| (-1) | th | (6) || .. | .. | .. | .. | .. | (9)| .. | (-1) | | || | | | | | | | | fat | (7) || .. | .. | .. | .. |(10)| .. | .. | (-1) |W1 (fat) her | (8) || .. | .. | .. | .. | .. |(11)| .. | (-1) |W3 (her) the | (9) || .. | .. | .. | .. | .. | .. | .. | (-1) |W4 (the) | || | | | | | | | | fath |(10) || .. | .. | .. | .. | .. |(12)| .. | (-1) | here |(11) || .. | .. | .. | .. | .. | .. | .. | (-1) |W5 (here) | || | | | | | | | | fathe |(12) || .. | .. | .. | .. | .. | .. |(13)| (-1) | | || | | | | | | | | father |(13) || .. | .. | .. | .. | .. | .. | .. | (-1) |W2 (father) DISPATCH TABLE (processed "trie") (NOTE: non-trie pointers are encased in square brackets-"[]") string back known state||****| F | A | T | H | E | R |pointer WORD: -------|-----||----|----|----|----|----|----|----|-----||-------------- .none. | .. || .. | (1)| .. | (3)| (2)| .. | .. | .. | | || | | | | | | | | f | (1) || .. | [1]| (4)| [3]| [2]| .. | .. | .. | h | (2) || .. | [1]| .. | [3]| [2]| (5)| .. | .. | t | (3) || .. | [1]| .. | [3]| (6)| .. | .. | .. | | || | | | | | | | | fa | (4) || .. | [1]| .. | (7)| [2]| .. | .. | .. | he | (5) || .. | [1]| .. | [3]| [2]| .. | (8)| .. | th | (6) || .. | [1]| .. | [3]| [2]| (9)| .. | .. | | || | | | | | | | | fat | (7) || .. | [1]| .. | [3]|(10)| .. | .. | .. |W1 (fat) her | (8) || .. | [1]| .. | [3]| [2]|(11)| .. | .. |W3 (her) the | (9) || .. | [1]| .. | [3]| [2]| .. | [8]| .. |W4 (the) | || | | | | | | | | fath |(10) || .. | [1]| .. | [3]| [2]|(12)| .. | .. | here |(11) || .. | [1]| .. | [3]| [2]| .. | .. | .. |W5 (here) | || | | | | | | | | fathe |(12) || .. | [1]| .. | [3]| [2]| .. |(13)| .. |W4 (the) | || | | | | | | | | father |(13) || .. | [1]| .. | [3]| [2]|[11]| .. | (8) |W2 (father) There are essentially three phases in the searching algorithm, building the pseudo-trie, transforming the "trie" into the proper finite state machine ("cleaning"), and executing the finite state machine (the actual search). Each phase is treated as a separate algorithm, although they are almost useless without the other algorithms. The insertion phase assumes a list of non-empty words, and a function ("translate") that will translate a single character into a "character number." This function has two purposes, to allow the search to treat different characters as identical (for example, a common use would have all upper case characters equivalent to all lower case characters), and to reduce the amount of memory necessary to store the resultant "pseudo-trie" (by creating sequential numbers for the characters being included, and using a single number to represent "any other character"). It is similar to any trie insertion algorithm, but has two important differences. First, it uses one state for every character in a word, plus a separate state that means "word completed". The "word completed" state includes a table entry that indicates what word has been recognized (and will include a pointer to the reporting chain), and it allows a simple connection to another state to be built by the next algorithm. Second, the first character of each target word is inserted before the second character for each word which is inserted before the third.... This provision leaves the "trie" with an interesting property, which is used by the next algorithm. If we examine any "trie" built with these two restrictions we will discover that the string that is "known" by any state is either longer than or as long as the strings "known" by all states that have smaller state numbers. Another useful property of the "trie" comes from the fact that states are added only when needed. Every state is pointed to by some state with a smaller number. PSEUDO-TRIE building algorithm queue holds ordered triples that look like: (word-tail to insert, actual word name, and state for this word) state.table[ s, c ] will eventually hold the "trie" that is being constructed. the first index is the state number (the row in all of the figures), and the second index is the character (column in the figures) word[ s ] is the word recognized at state s (for this algorithm, this will have a value only if s is an "word recognized" state) maxi.state is the number of the largest state that currently exists name is the full name of the word currently being worked on current.word is a string containing all of the characters in "name" that have yet to be inserted in "state.table" state is the state for the word currently being worked on T1: [Initialize] state.table[ *, * ] _ 0; word[ * ] _ 0; maxi.state _ 0; state _ 0; For all non-empty words "target" Do queue <= (target, target, 0) T2: [Get the next character from the queue] If queue is empty then terminate (current.word, name, state) <= queue ch _ translate( first character of current.word ) remove first character from current.word T3: [Insert it in the state table] if state.table[ state, ch ] = 0 then state.table[ state, ch ] _ increment( maxi.state ) state _ state.table[ state, ch ] T4: [If there is more work to do on this word, place it on the queue] If length( current.word ) > 0 Then queue <= (current.word, name, state) go to T2 T5: [Otherwise, mark end of word] word[ state ] _ name go to T2 The "cleaning" phase converts the pseudo-trie produced by the insertion phase, and converts it into a finite state machine. This phase should determine where each state should proceed whenever it receives an unexpected character, and should also construct the reporting chains and their connections to the state table. The back pointer section of the machine is created here; first it is filled with values that are useful to the algorithm, and then those values are replaced with the values that should be in the finite state machine. The method is basically to iterate through the states, finding the "greatest common tail" of every state "next.state" pointed to by this state by finding out what the "greatest common tail" of this state would become if you added the character that causes the machine to go from this state to "next.state". Perhaps an example is in order, imagine that the state being scanned is associated with the string "grandf", its greatest common tail is a state associated with "f", and next.state is associated with "grandfa". Further imagine that there is a state (pointed to by the "f" state) associated with "fa", and a state (pointed to by the "fa" state) associated with "fas", but no state associated with "grandfas". Since the greatest common tail of "grandf" is "f", and adding "a" to "grandf" gets us to "grandfa", we will see what state we arrive at by supplying "f" with an "a". The result ("fa") is now declared to be "grandfa"'s back pointer. Now we loop through all of "grandfa"'s pointers, pausing on those that are as yet unset (have a value of 0). When we arrive at the "s" column we see that it is unset, so we copy the "s" entry from "grandfa"'s back-pointer state (that is, since we have no better knowledge about what to do with an "s" in state "grandfa", we see what "fa" would do with an "s", and find out that we should go to state "fas"). In order to believe this, we have to believe that states "f" and "fa" have all their pointers correct, and that the back-pointer for "grandf" does in fact point to its greatest common tail. Clearly "f" is in its final state, since it is a tail of "grandf", and has a state number less than the current state being scanned (because it has fewer characters). Hence "f" has been completely processed. The state "fa" is full of its proper entries because "f" has been scanned, hence at some previous time "fa" has had done to it exactly what is now happening to "grandfa". In a similar way, we can show that by making sure the word-recognized chain is connected properly to "grandfa" when we finish scanning its columns, everything beyond it will connect properly. CLEANING the TRIE: backpointer[ s ] is a vector of state numbers (or -1 to indicate unset) indexed by state number that indicates either (A) the state that 'knows' the "greatest common tail" for the string associated with "s", or (B) the state number of the state with the next word in the report chain. state is the position in the cleaning scan. Note that for all states less than state, all entries are fixed, and their back pointers are only looked at in sense (B). Hence, just before state gets incremented, its backpointer is converted from type (A) to (B). ch is the current character column being looked at in state next.state is simply state.table[ state, ch ], i.e. the state that would be reached if the character "ch" were received in state "state" g.c.tail is the largest state less than next.state whose associated string is a tail of the string associated with next.state CLEANING algorithm: C1: [Initialize] backpointer[ * ] _ -1 (-1 is a ficticious state that has no word, a backpointer of 0, and has 0 in all it's entries) state _ -1 C2: [Scan all states sequentially] increment( state ) if state > maxi.state then terminate C3: [Scan all characters in state] For all character values "ch" DO C4 - C6 If none left then go to C7 C4: [If the state pointed to is an unprocessed forward state then process it] next.state _ state.table[ state, ch ] If next.state > state AND backpointer( next.state ) = -1 Then do C5 Else continue C3 C5: [Find the state that represents the "greatest common tail" for next.state, and place it in next.state's backpointer] g.c.tail _ state.table[ backptr(state), ch ] backptr( next.state ) _ g.c.tail C6: [Fill empty slots in next.state with information from its g.c.tail] If word( next.state ) = 0 Then word( next.state ) _ word( g.c.tail ) For all characters "n.ch" DO If state.table[ next.state, n.ch ] = 0 Then state.table[ next.state, n.ch ] _ state.table[ g.c.tail, n.ch ] continue C3 c7: [Convert this state's backpointer from type (A) to type (B)] (Compress this state's backpointer to reveal only unique words) If word( backptr(state) ) = word( state ) OR word( backptr(state) ) = 0 Then backptr(state) _ backptr( backptr(state) ) go to C2 Finally we have the "demon" algorithm, which is basically an interpreter for our finite state machine. It is an extremely simple- minded algorithm, but even if it is rather dumb, it is very fast. As is often the case with politicians as well as algorithms, a lot of work has been done by others in order to make this fairly dense one look as if he really knows what he's doing. DEMON algorithm: D1: [Initialize] state _ 0; D2: [Get next character from file and classify it] character.class _ translate( INPUT.CHAR ); If end of file then terminate. D3: [Determine next machine state] report.head _ state _ state.table[ state, character.class ] D4: [Report the front of the report chain] If word( report.head ) = 0 then go to D2 else report( current.file.position, word( report.head ) ); D5: [Get the next element on the report chain] report.head _ backpointer( report.head ) go to D4. A CURSORY ANALYSIS OF RUNNING TIME: example will use words: "FAT", "FATHER", "HER", "WATER". total characters in words (C) = 3 + 6 + 3 + 5 = 17 distinct characters in words (D) = 3 + 3 + 0 + 1 = 7 total number of states (S) = 3 + 3 + 3 + 5 + 1 = 15 ( it is always true that S < C + 2 ) TRIE-BUILDING: proportional to total characters in words to be searched + 1 [ C + 1 = 18 ] CLEANING: proportional to 1 + number of distinct characters (if ASCII maximum is 129) * total number of states (which is less than or equal to the total number of characters in the words to be searched + 1) [ (D + 1) * S = 120 ] DEMON: proportional to the total number of characters in the file to be searched + total number of 'recognitions' made MEMORY REQUIRED: 3 + number of distinct characters (1 to indicate other characters, 1 for backpointer, and 1 for word) (if ASCII maximum is 131) * total number of states (which is less than or equal to the total number of characters in the words to be searched + 1) [ (3 + D) * S = 150 ]