There are a tremendous number of instances at IMSSS where
unstructured files are searched for sets of various character strings.  
For some time Pentti Kanerva and I have been talking of a searching
program that would write a string-specific program to search a file.
The initial concept was that the generated program would be in pure
PDP-10 machine code, and that it should do a minimum of back-tracking
in its search.  Since then we have decided two things: A) no back- 
tracking is necessary, and B) what we really were writing was not so 
much a PDP-10 program as a finite state machine to do the searching for 
us.

	The algorithm finally produced has roots in the "trie" search
method (Knuth "The Art of Computer Programming" volume 3, section 6.3), 
and in fact for some words looks remarkably like a "trie" recognition 
structure (see below).  In fact, the major difference between the 
finite state machine and a "trie" is that the finite state machine 
knows what to do if it fails to match a string, proceeding to the state 
that most completely represents its current knowledge about the input 
stream.

	MACHINE TO RECOGNIZE THE WORDS "CAB" AND "CAD"

				      other   recognized  back 
state	 "A"	"B"	"C"	"D"  letters	word	 pointer
 . 	 .	 .	(1)	 .	 .	""	   .
(1)	(2)	 .	(1)	 .	 .	""	   .
(2)	 .	(3)	(1)	(4)	 .	""	   .
(3)	 .	 .	(1)	 .	 .	"CAB"	   .
(4)	 .	 .	(1)	 .	 .	"CAD" 	   .

		"TRIE" FOR THE WORDS "CAB" AND "CAD"

	state	"a"	"b"	"c"	"d"	"e"	...
	 .	---	---	(1)	---	---	...
	(1)	(2)	---	---	---	---	...
	(2)	---    "cab"	---    "cad"	---	...


	Imagine each state represents some amount of knowledge about
the input stream.  As long as we follow any one string, our knowledge
increases with each new character we see.  However, if we see a 
character that does not match the next character of the string we are 
following, we should proceed to a state that "knows" as much of the 
input string as possible; it is this transition to the state of greatest
knowledge that distinguishes the finite state machine of this algorithm
from a simple "trie".  In the previous example, the transition to 
maximal knowledge is demonstrated whenever any state (except ".", which
should go there anyway) encounters a "C".  Whenever this happens, 
rather than simply reporting a failure to match, as the "trie" would,
the machine goes into state "(1)", which means "I know the input stream
ends with the character "C".

	Since we are looking for all matches in the input stream, there
are several special cases of search targets that must be dealt with in
a non-trie fashion.  All of these cases deal with search targets with
common sub-strings.  

	First, we have the case of one string being the initial sub-
string of another word (as in the case of "father" and "fat").  The
solution to this problem is fairly simple, we simply indicate that we
have recognized the word "fat" when we receive the "t", and continue
from the state that 'knows' "fat" to the state that 'knows' "fath" and
so on.  

	The second case is the case of one word being an imbedded sub-
string of another word (for example,. "father" and "the").  In this 
case, we see that there must be more than one state that reports 
recognition of "the" (both the states typified by "the" and "fathe"). 
So the machine must be built in such a way as to 'know' that such things
can happen.

	The third case is even nastier; imagine one string being a tail
of another string ("father" again, and "her").  When we we reach the
final "r" of "father", we need to report a match on two distinct words.
In order to solve this problem, it would seem we need a linked list of
words to report as found.  Notice that the list can be a single chain, 
with several places looking as if they were the head as in the case of
"grandfather", "father", and "her".  Every time we see "father", we have
also seen "her", including the case of having seen "father" as a result 
of having seen "grandfather".  This particular piece of trickiness is
represented in the machine by the column labelled "back pointer".  What
the machine is supposed to do when it sees it has recognized a word, and
reported that fact, is to start following the back-pointers, reporting
any words associated with the states it passes.  

	The next example shows all of these conditions (except the 
triple match) in a machine.  To make it easier to read, I have included
a column on the left labelled "string known," which indicates how much 
of the information about the input stream is contained in each state.
There is in fact, nothing in the algorithm (or my implementation of it
in SAIL) that corresponds to that column.  Another possibly confusing
label in the example is "****", which simply is a column that represents
any input character that does not otherwise have a column associated 
with it.

MACHINE TO RECOGNIZE:	
	(1) "FAT", (2) "FATHER", (3) "HER", (4) "THE", and (5) "HERE".

	INITIAL CHARACTER-STATE TABLE (pseudo-"trie")
string						   back
known	state||****|  F |  A |  T |  H |  E |  R |pointer   WORD:
-------|-----||----|----|----|----|----|----|----|------|--------------
.none. | ..  || .. | (1)| .. | (3)| (2)| .. | .. | (-1) |
       |     ||	   |	|    |	  |    |    |	 |      |
f      | (1) || .. | .. | (4)| .. | .. | .. | .. | (-1) |
h      | (2) || .. | .. | .. | .. | .. | (5)| .. | (-1) |
t      | (3) || .. | .. | .. | .. | (6)| .. | .. | (-1) |
       |     ||	   |	|    |	  |    |    |	 |      |
fa     | (4) || .. | .. | .. | (7)| .. | .. | .. | (-1) |
he     | (5) || .. | .. | .. | .. | .. | .. | (8)| (-1) |
th     | (6) || .. | .. | .. | .. | .. | (9)| .. | (-1) |
       |     ||	   |	|    |	  |    |    |	 |      |
fat    | (7) || .. | .. | .. | .. |(10)| .. | .. | (-1) |W1 (fat)
her    | (8) || .. | .. | .. | .. | .. |(11)| .. | (-1) |W3 (her)
the    | (9) || .. | .. | .. | .. | .. | .. | .. | (-1) |W4 (the)
       |     ||	   |	|    |	  |    |    |	 |      |
fath   |(10) || .. | .. | .. | .. | .. |(12)| .. | (-1) |
here   |(11) || .. | .. | .. | .. | .. | .. | .. | (-1) |W5 (here)
       |     ||	   |	|    |	  |    |    |	 |      |
fathe  |(12) || .. | .. | .. | .. | .. | .. |(13)| (-1) |
       |     ||	   |	|    |	  |    |    |	 |      |
father |(13) || .. | .. | .. | .. | .. | .. | .. | (-1) |W2 (father)

		DISPATCH TABLE (processed "trie")
	(NOTE: non-trie pointers are encased in square brackets-"[]")
string						   back
known	state||****|  F |  A |  T |  H |  E |  R |pointer   WORD:
-------|-----||----|----|----|----|----|----|----|-----||--------------
.none. | ..  || .. | (1)| .. | (3)| (2)| .. | .. |  ..  |
       |     ||	   |	|    |	  |    |    |	 |      |
f      | (1) || .. | [1]| (4)| [3]| [2]| .. | .. |  ..  |
h      | (2) || .. | [1]| .. | [3]| [2]| (5)| .. |  ..  |
t      | (3) || .. | [1]| .. | [3]| (6)| .. | .. |  ..  |
       |     ||	   |	|    |	  |    |    |	 |      |
fa     | (4) || .. | [1]| .. | (7)| [2]| .. | .. |  ..  |
he     | (5) || .. | [1]| .. | [3]| [2]| .. | (8)|  ..  |
th     | (6) || .. | [1]| .. | [3]| [2]| (9)| .. |  ..  |
       |     ||	   |	|    |	  |    |    |	 |      |
fat    | (7) || .. | [1]| .. | [3]|(10)| .. | .. |  ..  |W1 (fat)
her    | (8) || .. | [1]| .. | [3]| [2]|(11)| .. |  ..  |W3 (her)
the    | (9) || .. | [1]| .. | [3]| [2]| .. | [8]|  ..  |W4 (the)
       |     ||	   |	|    |	  |    |    |	 |      |
fath   |(10) || .. | [1]| .. | [3]| [2]|(12)| .. |  ..  |
here   |(11) || .. | [1]| .. | [3]| [2]| .. | .. |  ..  |W5 (here)
       |     ||	   |	|    |	  |    |    |	 |      |
fathe  |(12) || .. | [1]| .. | [3]| [2]| .. |(13)|  ..  |W4 (the)
       |     ||	   |	|    |	  |    |    |	 |      |
father |(13) || .. | [1]| .. | [3]| [2]|[11]| .. |  (8) |W2 (father)

	There are essentially three phases in the searching algorithm,
building the pseudo-trie, transforming the "trie" into the proper finite
state machine ("cleaning"), and executing the finite state machine (the 
actual search).  Each phase is treated as a separate algorithm, although
they are almost useless without the other algorithms.

	The insertion phase assumes a list of non-empty words, and a 
function ("translate") that will translate a single character into a
"character number."  This function has two purposes, to allow the 
search to treat different characters as identical (for example, a common
use would have all upper case characters equivalent to all lower case
characters), and to reduce the amount of memory necessary to store the
resultant "pseudo-trie" (by creating sequential numbers for the 
characters being included, and using a single number to represent "any
other character").  It is similar to any trie insertion algorithm, but
has two important differences.  First, it uses one state for every
character in a word, plus a separate state that means "word completed".
The "word completed" state includes a table entry that indicates what
word has been recognized (and will include a pointer to the reporting 
chain), and it allows a simple connection to another state to be built 
by the next algorithm.  Second, the first character of each target word 
is inserted before the second character for each word which is inserted 
before the third....  This provision leaves the "trie" with an 
interesting property, which is used by the next algorithm.  If we 
examine any "trie" built with these two restrictions we will discover 
that the string that is "known" by any state is either longer than or 
as long as the strings "known" by all states that have smaller state 
numbers.  Another useful property of the "trie" comes from the fact
that states are added only when needed.  Every state is pointed to by
some state with a smaller number.

PSEUDO-TRIE building algorithm	

queue 
	holds ordered triples that look like: 
	(word-tail to insert, actual word name, and state for this word)

state.table[ s, c ] 
	will eventually hold the "trie" that is being constructed. 
	the first index is the state number (the row in all of the 
	figures), and the second index is the character (column in 
	the figures)

word[ s ]
	is the word recognized at state s (for this algorithm, this
	will have a value only if s is an "word recognized" state)

maxi.state
	is the number of the largest state that currently exists

name
	is the full name of the word currently being worked on

current.word
	is a string containing all of the characters in "name" that
	have yet to be inserted in "state.table"

state
	is the state for the word currently being worked on

T1: [Initialize]
	state.table[ *, * ] _ 0;
	word[ * ] _ 0;
	maxi.state _ 0;
	state _ 0;
	For all non-empty words "target"
		Do queue <= (target, target, 0)

T2: [Get the next character from the queue]
	If queue is empty then terminate
	(current.word, name, state) <= queue
	ch _ translate( first character of current.word )
	remove first character from current.word 

T3: [Insert it in the state table]
	if state.table[ state, ch ] = 0
		then state.table[ state, ch ] _ increment( maxi.state )
	state _ state.table[ state, ch ] 

T4: [If there is more work to do on this word, place it on the queue]
	If length( current.word ) > 0 
		Then queue <= (current.word, name, state) go to T2 

T5: [Otherwise, mark end of word]
	word[ state ] _ name
	go to T2

	The "cleaning" phase converts the pseudo-trie produced by the
insertion phase, and converts it into a finite state machine.  This 
phase should determine where each state should proceed whenever it 
receives an unexpected character, and should also construct the 
reporting chains and their connections to the state table.  The back
pointer section of the machine is created here; first it is filled with
values that are useful to the algorithm, and then those values are 
replaced with the values that should be in the finite state machine.
The method is basically to iterate through the states, finding the
"greatest common tail" of every state "next.state" pointed to by this 
state by finding out what the "greatest common tail" of this state
would become if you added the character that causes the machine to go 
from this state to "next.state".  

	Perhaps an example is in order, imagine that the state being 
scanned is associated with the string "grandf", its greatest common 
tail is a state associated with "f", and next.state is associated with 
"grandfa".  Further imagine that there is a state (pointed to by the 
"f" state) associated with "fa", and a state (pointed to by the "fa" 
state) associated with "fas", but no state associated with "grandfas". 
Since the greatest common tail of "grandf" is "f", and adding "a" to 
"grandf" gets us to "grandfa", we will see what state we arrive at by 
supplying "f" with an "a".  The result ("fa") is now declared to be 
"grandfa"'s back pointer.  Now we loop through all of "grandfa"'s 
pointers, pausing on those that are as yet unset (have a value of 0).  
When we arrive at the "s" column we see that it is unset, so we copy 
the "s" entry from "grandfa"'s back-pointer state (that is, since we 
have no better knowledge about what to do with an "s" in state 
"grandfa", we see what "fa" would do with an "s", and find out that we 
should go to state "fas").  

	In order to believe this, we have to believe that states "f" 
and "fa" have all their pointers correct, and that the back-pointer for 
"grandf" does in fact point to its greatest common tail.  Clearly "f" 
is in its final state, since it is a tail of "grandf", and has a state 
number less than the current state being scanned (because it has fewer 
characters).  Hence "f" has been completely processed.  The state "fa" 
is full of its proper entries because "f" has been scanned, hence at 
some previous time "fa" has had done to it exactly what is now happening
to "grandfa".  In a similar way, we can show that by making sure the 
word-recognized chain is connected properly to "grandfa" when we finish 
scanning its columns, everything beyond it will connect properly.

CLEANING the TRIE: 

backpointer[ s ] 
	is a vector of state numbers (or -1 to indicate unset) indexed 
	by state number that indicates either 
	(A) the state that 'knows' the "greatest common tail" for the 
	    string associated with "s", or 
	(B) the state number of the state with the next word in the 
	    report chain.

state
	is the position in the cleaning scan.  Note that for all states
	less than state, all entries are fixed, and their back pointers
	are only looked at in sense (B).  Hence, just before state gets 
	incremented, its backpointer is converted from type (A) to (B).

ch
	is the current character column being looked at in state

next.state
	is simply state.table[ state, ch ], i.e. the state that would
	be reached if the character "ch" were received in state "state"

g.c.tail
	is the largest state less than next.state whose associated 
	string is a tail of the string associated with next.state

CLEANING algorithm:

C1: [Initialize]
	backpointer[ * ] _ -1 (-1 is a ficticious state that has no 
				word, a backpointer of 0, and has 0 in 
				all it's entries)
	state _ -1

C2: [Scan all states sequentially]
	increment( state )
	if state > maxi.state then terminate

C3: [Scan all characters in state]
	For all character values "ch" DO C4 - C6 
	If none left then go to C7

C4: [If the state pointed to is an unprocessed forward state 
     then process it]
	next.state _ state.table[ state, ch ] 
	If next.state > state	AND	backpointer( next.state ) = -1
		Then do C5
		Else continue C3

C5: [Find the state that represents the "greatest common tail" for 
     next.state, and place it in next.state's backpointer]
	g.c.tail _ state.table[ backptr(state), ch ]
	backptr( next.state ) _ g.c.tail

C6: [Fill empty slots in next.state with information from its g.c.tail]
	If word( next.state ) = 0 
		Then word( next.state ) _ word( g.c.tail )
	For all characters "n.ch" DO
		If state.table[ next.state, n.ch ] = 0
		    Then state.table[ next.state, n.ch ] _
				state.table[ g.c.tail, n.ch ] 
	continue C3

c7: [Convert this state's backpointer from type (A) to type (B)]
    (Compress this state's backpointer to reveal only unique words)
	If word( backptr(state) ) = word( state ) OR
	   word( backptr(state) ) = 0 
	  Then backptr(state) _ backptr( backptr(state) )
	go to C2

	Finally we have the "demon" algorithm, which is basically an
interpreter for our finite state machine.  It is an extremely simple-
minded algorithm, but even if it is rather dumb, it is very fast.  As
is often the case with politicians as well as algorithms, a lot of work 
has been done by others in order to make this fairly dense one look 
as if he really knows what he's doing.

DEMON algorithm:

D1: [Initialize]
	state _ 0;

D2: [Get next character from file and classify it]
	character.class _ translate( INPUT.CHAR );
		If end of file then terminate.

D3: [Determine next machine state]
	report.head _ state _ state.table[ state, character.class ]

D4: [Report the front of the report chain]
	If word( report.head ) = 0 then go to D2
	   else report( current.file.position, word( report.head ) );

D5: [Get the next element on the report chain]
	report.head _ backpointer( report.head )
	go to D4. 

A CURSORY ANALYSIS OF RUNNING TIME:

example will use words: "FAT", "FATHER", "HER", "WATER".
total characters in words 	(C) = 3 + 6 + 3 + 5	= 17 
distinct characters in words 	(D) = 3 + 3 + 0 + 1	=  7 
total number of states 		(S) = 3 + 3 + 3 + 5 + 1 = 15 
( it is always true that S < C + 2 )

	TRIE-BUILDING: proportional to 
		total characters in words to be searched + 1
			[ C + 1 = 18 ]
	CLEANING: proportional to 
		1 + number of distinct characters 
			(if ASCII maximum is 129) *
		total number of states (which is less than or equal to 
					the total number of characters 
					in the words to be searched + 1)
			[ (D + 1) * S = 120 ]
	DEMON: proportional to 
		the total number of characters in the file
			to be searched  + 
		total number of 'recognitions' made

	MEMORY REQUIRED:
		3 + number of distinct characters 
			(1 to indicate other characters, 
			 1 for backpointer, and 1 for word)
			(if ASCII maximum is 131) *
		total number of states (which is less than or equal to 
					the total number of characters 
					in the words to be searched + 1)
			[ (3 + D) * S = 150 ]