XSEARCH a fast multiple substring search utility -Scott Daniels This program has grown out of an algorithm developed by Scott Daniels and Pentti Kanerva of the Institute for Mathematical Studies in the Social Sciences, Stanford University, California. The algorithm was previously developed (and subsequently published) by Alfred Aho and Margaret J. Corasick of Bell Laboratories. The two developments of the algorithm were independent, but they are substantially the same algorithm. Briefly, the program achieves its speed because it pre- processes the strings to be searched for, and builds a Finite State Machine which does the actual searching. This document is intended solely to explain the use of the "XSEARCH" program; for a detailed description of the algorithm employed, see either: LOTS:XSEARCH.ALGORITHM, SUMEX:XSEARCH.ALGORITHM, IMSSS:XSEARCH.ALG, (same text as above) or "Efficient String Matching: An Aid to Bibliographic Search" (the Aho-Corasick article) in the June 1975 issue (volume 18 number 6) of Communications of the ACM. -------------- -- --- --- In order for the program to perform its search, it needs answers to three general questions: (A) What files are being searched? (B) What substrings are being searched for? (C) How (and where) are the results of the search to be shown? All of the questions asked of the user pertain to one of these three categories, and there has been an attempt to supply both rational grouping of questions and reasonable default answers to as many questions as possible. The user may get help at various points by typing the character "?" in response to any question, to which the program responds with a short description of the possible inputs at that point, and then repeats the question. The majority of this document is a fairly precise catalog of program features, and it may be simpler in many cases to try using the progam, hitting a question mark if you are confused by the questions, and accepting the default if it still seems unclear what the program wants from you. The first few pages of this document will be devoted to sample runs of the program, and the last part will explain precisely what each of the program's questions is asking for, what is involved in that sequence, and what possible responses can be made. ******** sample #1 search for document and real in XSEARCH.SAI ******** @XSEARCH.SAV;6 SUBSTRING SEARCH ROUTINE, ? FOR HELP FILES TO SEARCH: XSEARCH.SAI;9, TARGET 1) DOCUMENT TARGET 2) REAL TARGET 3) UPPER = LOWER ? YES//$ OUTPUT GOES TO: * TTY: // TYPE ^Q TO ABORT ANY PARTICULAR FILE SEARCH. XSEARCH.SAI;9 SEARCH OF FILE XSEARCH.SAI;9 4.35 REAL, BKJFN( '101 ); ! GO CHECK IF REALLY LINE FEED; ^L COUNT OF MATCHES (RECOGNITION COUNT = 1) 1) "DOCUMENT" 0 2) "REAL" 1 ^L ...DONE... CONTINUE TO START OVER @ ******** sample #2 search for previous words in CS.SAI and XSEARCH.SAI;7 as well as xsearch.sai;9 ******** @CONTINUE SUBSTRING SEARCH ROUTINE, ? FOR HELP DEFAULT LIST: XSEARCH.SAI;0 FILES TO SEARCH: (CONTINUED) : CS.SAI;1, XSEARCH.SAI;7;, OLD WORD LIST: 1) DOCUMENT 2) REAL TARGET 3) $ OUTPUT GOES TO: * TTY: // TYPE ^Q TO ABORT ANY PARTICULAR FILE SEARCH. XSEARCH.SAI;9 SEARCH OF FILE XSEARCH.SAI;9 4.35 REAL, BKJFN( '101 ); ! GO CHECK IF REALLY LINE FEED; ^L CS.SAI;1 SEARCH OF FILE CS.SAI;1 4.35 REAL, BKJFN( '101 ); ! GO CHECK IF REALLY LINE FEED; ^L XSEARCH.SAI;7 SEARCH OF FILE XSEARCH.SAI;7 4.35 REAL, BKJFN( '101 ); ! GO CHECK IF REALLY LINE FEED; ^L COUNT OF MATCHES (RECOGNITION COUNT = 3) 1) "DOCUMENT" 0 2) "REAL" 3 ^L ...DONE... CONTINUE TO START OVER @ ******** sample #3 search for previous words and the word bkjfn, in the same files as before, this time pay attention to character case ******** @CONTINUE SUBSTRING SEARCH ROUTINE, ? FOR HELP DEFAULT LIST: XSEARCH.SAI;0 CS.SAI;0 XSEARCH.SAI;7 FILES TO SEARCH: (CONTINUED) : OLD WORD LIST: 1) DOCUMENT 2) REAL TARGET 3) BKJFN TARGET 4) UPPER = LOWER ? YES//N DEFAULT EXPRESSION: 1 V 2 V 3 WHICH MEANS: DOCUMENT OR REAL OR BKJFN EXPRESSION: CREATE .PL FILES? NO// OUTPUT GOES TO: * TTY: // TYPE ^Q TO ABORT ANY PARTICULAR FILE SEARCH. XSEARCH.SAI;9 SEARCH OF FILE XSEARCH.SAI;9 2.13 BKJFN, BKJFN( '101 ); 4.44 BKJFN, ELSE BEGIN BKJFN( '101 ); RETURN( 0 ); END; 21.12 BKJFN, BKJFN( '101 ); ^L CS.SAI;1 SEARCH OF FILE CS.SAI;1 2.15 BKJFN, ELSE BEGIN BKJFN( '101 ); RETURN( 0 ); END; 4.44 BKJFN, ELSE BEGIN BKJFN( '101 ); RETURN( 0 ); END; 23.13 BKJFN, BKJFN( '101 ); ^L XSEARCH.SAI;7 SEARCH OF FILE XSEARCH.SAI;7 2.15 BKJFN, ELSE BEGIN BKJFN( '101 ); RETURN( 0 ); END; 4.44 BKJFN, ELSE BEGIN BKJFN( '101 ); RETURN( 0 ); END; 20.13 BKJFN, BKJFN( '101 ); ^L COUNT OF MATCHES (RECOGNITION COUNT = 9) 1) "DOCUMENT" 0 2) "REAL" 0 3) "BKJFN" 9 ^L ...DONE... CONTINUE TO START OVER @ ******** sample #4 search for previous words (any case), but only find lines with both "real" and "bkjfn" on them (same files as before) ******** @CONTINUE SUBSTRING SEARCH ROUTINE, ? FOR HELP DEFAULT LIST: XSEARCH.SAI;0 CS.SAI;0 XSEARCH.SAI;7 FILES TO SEARCH: (CONTINUED) : OLD WORD LIST: 1) DOCUMENT 2) REAL 3) BKJFN TARGET 4) UPPER = LOWER ? YES// DEFAULT EXPRESSION: 1 V 2 V 3 WHICH MEANS: DOCUMENT OR REAL OR BKJFN EXPRESSION: 2 & 3 CREATE .PL FILES? NO// OUTPUT GOES TO: * TTY: // TYPE ^Q TO ABORT ANY PARTICULAR FILE SEARCH. XSEARCH.SAI;9 SEARCH OF FILE XSEARCH.SAI;9 4.35 BKJFN, REAL, BKJFN( '101 ); ! GO CHECK IF REALLY LINE FEED; ^L CS.SAI;1 SEARCH OF FILE CS.SAI;1 4.35 BKJFN, REAL, BKJFN( '101 ); ! GO CHECK IF REALLY LINE FEED; ^L XSEARCH.SAI;7 SEARCH OF FILE XSEARCH.SAI;7 4.35 BKJFN, REAL, BKJFN( '101 ); ! GO CHECK IF REALLY LINE FEED; ^L COUNT OF MATCHES (RECOGNITION COUNT = 3) 1) "DOCUMENT" 0 2) "REAL" 3 3) "BKJFN" 12 ^L ...DONE... CONTINUE TO START OVER @ ******** sample #5 demonstrate file search abort ******** @CONTINUE SUBSTRING SEARCH ROUTINE, ? FOR HELP DEFAULT LIST: XSEARCH.SAI;0 CS.SAI;0 XSEARCH.SAI;7 FILES TO SEARCH: (CONTINUED) : $ OUTPUT GOES TO: * TTY: // TYPE ^Q TO ABORT ANY PARTICULAR FILE SEARCH. XSEARCH.SAI;9 SEARCH OF FILE XSEARCH.SAI;9 4.35 BKJFN, REAL, BKJFN( '101 ); ! GO CHECK IF REALLY LINE FEED; ...ABORTED...AT 13.3 ^L CS.SAI;1 SEARCH OF FILE CS.SAI;1 4.35 BKJFN, REAL, BKJFN( '101 ); ! GO CHECK IF REALLY LINE FEED; ...ABORTED...AT 14.9 ^L XSEARCH.SAI;7 SEARCH OF FILE XSEARCH.SAI;7 4.35 BKJFN, REAL, BKJFN( '101 ); ! GO CHECK IF REALLY LINE FEED; ^L COUNT OF MATCHES (RECOGNITION COUNT = 3) 1) "DOCUMENT" 0 2) "REAL" 3 3) "BKJFN" 10 ^L ...DONE... CONTINUE TO START OVER @ ******** sample #6 demonstrate deferred file list request ******** @XSEARCH.SAV;6 SUBSTRING SEARCH ROUTINE, ? FOR HELP FILES TO SEARCH: TARGET 1) TWO TARGET 2) $ FILES TO BE SEARCHED: CS.SAI;1, $ OUTPUT GOES TO: * TTY: // TYPE ^Q TO ABORT ANY PARTICULAR FILE SEARCH. CS.SAI;1 SEARCH OF FILE CS.SAI;1 23.2 TWO, SIMPLE INTEGER PROCEDURE GETWORDS( STRING ARRAY WORDLIST ); 23.3 TWO, BEGIN "GETWORDS" STRING REPLY; 23.83 TWO, END "GETWORDS"; 24.21 TWO, IF ESCOFF THEN CHANGEDWORDS _ GETWORDS( WORDLIST ); 24.25 TWO, ELSE CHANGEDWORDS _ GETWORDS( WORDLIST ); ^L COUNT OF MATCHES (RECOGNITION COUNT = 5) 1) "TWO" 5 ^L ...DONE... CONTINUE TO START OVER @ OVERVIEW OF THE PROGRAM'S QUESTIONS GROUP A: What files are being searched? (1) PROMPT: Files to search: INPUT: a list of text files to be searched GROUP B: What substrings are being searched for? (2) PROMPT: Target n) INPUT: A string to be searched for. (3) PROMPT: Upper = lower ? INPUT: Either "Y" or "N" depending upon whether the case (upper or lower) of the target must match the case of any recognized substring. GROUP C: How (and where) are the results of the search shown? (4) PROMPT: Expression: INPUT: A pseudo-boolean expression that indicates whether a line should be regarded as being "recognized". (5) PROMPT: Create .PL files? INPUT: Either "Y" or "N" depending upon whether an editor interface file should be created for "recognized" lines. (6) PROMPT: Output goes to: * INPUT: a destination for copies of "recognized" lines as well as some extra information pertaining to the search and the lines "recognized". (1) Files to search: Performing a search without searching any files is meaningless, but some people find it more convenient to specify what it is they are searching for before defining where they are searching. Hence, an empty file list given to this question will cause the program to postpone this question until after question 4 ("Expression:"). Initially no default is possible, but once the user has placed any files in the list that list becomes the default for this question. If a default list does exist, the list will be shown as in the example, and any filenames typed by the user will added to the end of the list. The user is also provided with commands to eliminate files from the end of the list, see the list as it stands currently, and to erase the list entirely. Since a person may often want to search a particular set of files, (for example the source files for the SAIL language) a provision has been made to allow a part of the file list to be taken from a file. An "indirect file" must have full file names (no recognition allowed), and each file name must be separated from the others by at least one delimiter (space, tab, comma, carriage return, line feed, and page mark are all acceptable delimiters). Indirect files are read only when needed during the search, and they may reference other indirect files. There is no check to see that references to indirect files do not create a loop, and if such a condition occurs, the program will merrily continue following the loop until the user stops it by typing control-C. Possible responses to "Files to search:" a file name to add a file to the list (can have *'s, TENEX recognition available), @file name to add an indirect file to the list (can have *'s, TENEX recognition available), "^" to recover the default list, ^X to erase and re-input the entire list, ^W to remove last file from existing list, ^S to show existing list, LF to continue file list input on next line, CR to accept list as is, ESC to accept list as is, and default further options, "?" to get help message (2) Target n) Just as there is no point in searching an empty list of files, it is pointless to search a set of files for nothing. The response to this prompt is the strings you wish to search for. The actual prompt has a word number instead of the "n", and you will find this question repeated until there is a least one string to search for, and you provide no more targets for the search. Initially there is no default for this section, but after the first response to this question, the default list of strings becomes whatever list was last active. The string you type in here is taken literally, which means if you type a TAB, the character "Q", and then a CR, the search string you have specified is a two-character string which consists of a TAB and a "Q". The standard TENEX line editing works here, and several characters can end an input (ESC, CR, and most control characters). In order to allow searching for strings that include characters that TENEX treats in some bizarre fashion, another mode of input is allowed (see below). Because of the internal representation of a string, there is a maximum of 100 target strings. Possible responses to "Target n)" a string to add the string to the target list (*) @file name to add a list of specially formatted words from a file (**) "^" to go back to the first question. ^X to erase the existing list, ^W to erase the last word of the existing list, ^S to show the existing list, CR to keep the existing list, ESC to keep the existing list and accept subsequent defaults, "?" to get the help message. * In order to input a string that begins with one of the special characters (@, ^, ?, etc.) the following trick can be used: type the character "X", the DEL key, and then begin typing the string that you want. ** Strings input from a file should be in a format reminiscent of SAIL constant string expressions. Each string is followed by a semicolon, and the end of the string list is indicated by an extra semicolon. The following format is used to specify the strings: (1) A double-quote (") then the string itself, using two consecutive double-quotes to indicate a double-quote that is actually a part of the string, then a double- quote to indicate the end of the string. (2) An apostrophe (') immediately followed by the octal number which corresponds to the desired ASCII code. (3) A decimal number which corresponds to the desired ASCII character. (4) An up-arrow (^) followed by a character to indicate a control character. (5) Any of these representations, an ampersand (&) to indicate concatenation, and finally any of these representations. SAMPLE set of strings as specified on a file: "string 1" ; "strin" & "g 2"; '163 & '164 & '162 & "ing" & 32 & "3"; """a quoted ("") string" & '42; "bells: " & ^G & '7; ; represents the following strings: 1) string 1 2) string 2 3) string 3 4) "a quoted (") string" 5) bells:  Note: When you respond @TTY: (a trick, primarily to be used by the fearless few who will try anything), it may be necessary to type an extra (third) semicolon at the end in order to complete the last input statement. (3) Upper = lower ? This question provides a method for treating upper and lower case letters as either distinct characters, or as if they were equivalent (eventually the user will be allowed to specify other sets of equivalent characters). The initial default is to have them equivalent, but once this question has been answered, the default is the last answer to the question. Possible responses to "Upper = lower ?" "Y" to treat upper and lower case characters as equivalent (i.e. "a" is the same as "A") "N" to treat upper and lower case characters as distinct (i.e. "a" is not matched by "A") CR or LF to accept the default ESC to accept the default and subsequent defaults "^" to go back to the first question "?" to get the help message (4) Expression: This input provides a method for "recognizing" only those lines with a particular combination of strings on them. The pseudo-boolean expression is evaluated for each line which contains at least one of the target strings, and if the result is "TRUE", the line is considered "recognized". The notation used involves the numbers of the strings in the target list as boolean values, the value being "TRUE" if the string associated with that number at the time of the search occurs at least once within the line, and "FALSE" otherwise. Enough open parentheses to match any unmatched close parentheses are assumed to exist just in front of the typed expression, and similarly all unmatched open parentheses will be presumed matched by imaginary close parentheses just after the end of the expression. The expression evaluator will do unpredictable things, however, if you fail to alternate operators and expressions.(For example, "1 & (3) & 2" is fine, but "1 (& 3) & 2" will not necessarily produce reasonable results.) The "&" operator is viewed as boolean multiplication, while "V" is boolean addition, with normal operator precedence holding (any AND'ed group is treated as if it were surrounded by parentheses). The default provided is a pseudo- boolean that will "recognize" a line if at least one of the words is on it, but the last expression that you typed will remain the default if the number of target strings does not change. The expression is called a "pseudo-boolean" expression since it has only "AND" and "INCLUSIVE-OR" operators, with no provision for a "NOT" or "exclusive-or", and no way to create an expression that would allow those functions. Note: if two or more of the words in the search list are the same, only the one with the largest associated number will ever be found during the search. Remember this fact when you write the expression. Here is a strict definition of what expressions are allowed: PBE := n | PBE & PBE | PBE V PBE | ( PBE ) where n means any number between 1 and the number of targets being searched for. Spaces, tabs, and line feeds may be freely dropped into the expression with no particular problem (exceptions: line feed has a special meaning if it is the first character of an expression, and nothing is allowed between digits of a number). Possible responses to "Expression:" a pseudo-boolean expression as defined above CR or LF accept current expression and go on to next question ESC accept current expression and default further options "^" go back to the first question ^R (CTRL-R) retype current expression (input compatible form) ^S (CTRL-S) retype current expression in an expanded form "?" (5) Create .PL files? These are editor interface files, which indicate page and line positions in a file to the editor. Basically they allow the editor to position itself at specified lines, and (after the command $+G) proceed to the next "recognized" line. For further information about these files, see documentation about program TV.SAV . The initial default for this question is "NO", and further defaults are the same as the previous response to the question. Possible responses to "Create .PL files?" "Y" to create a ".PL" file for any file with at least one line "recognized" "N" to prevent creation of ".PL" files. CR or LF to accept the default ESC to accept this and all subsequent defaults, "^" to go back to the first question. "?" to get a help message (6) Output goes to: * Every search causes an output that includes such things as the list of strings being searched for, the number of times each string was found, the number of lines that were "recognized", and a page for each file searched which contains the name of the file searched and every line that was "recognized" along with a list of the strings that were found on that line (in the order they were found). If the output is not going to the terminal (TTY:), the various counts will also be displayed on the terminal. If the output file specified already exists the output will be added to the end of the file, but the recognition is set to create a new version if the version number is defaulted. This question initially defaults to the terminal (TTY:), and will default to whatever the last output file was on subsequent searches. Note: This question is always asked just before the search. Possible responses to "Output goes to: *" CR or LF to accept this default, ESC to accept this and all subsequent defaults, "^" to go back to the first question. a TENEX file name to choose another output file (the results of the search will be appended to the file if it already exists)