'\" 	@(#)Document: V2-00-10
'\" 	@(#)Subject: V3.0 SW Dev - Lex
'\" 	@(#)Writer: Stewart Konzen
'\" 	@(#)Work: 5 weeks
'\" 	@(#)Target: 1/1
'\" 	@(#)Notes: 
'\" 	@(#)Mods: 
.ds :? LEX: A LEXICAL ANALYZER
.nr H1 9
.PH "'Lex''Lex'"
.H 2 "Introduction"
.Hc %
.I Lex
is a program generator designed for
lexical processing of character input streams.
It accepts a high-level, problem oriented specification
for character string matching, and
produces a C program that recognizes regular expressions.
The regular expressions are specified by the user in the
source specifications given to 
.I lex .
The 
.I lex
code recognizes these expressions
in an input stream and partitions the input stream into
strings matching the expressions.  
At the boundaries between strings, program sections
provided by the user are executed.
The 
.I lex
source file associates the regular expressions and the
program fragments.
As each expression appears in the input to the program written by 
.I lex ,
the corresponding fragment is executed.
.P
The user supplies the additional code
beyond expression matching
needed to complete his tasks, possibly
including code written by other generators.
The program that recognizes the expressions is generated in the
from the user's C program fragments.
.I Lex
is not a complete language, 
but rather a generator representing
a new language feature added on top of the C
programming language.
.P
.I Lex
turns the user's expressions and actions
(called
.I source
in this section) into  a C program names
.I yylex .
The
.I yylex
program recognizes expressions in a stream (called
.I input
here)
and performs the specified actions 
for each expression as it is detected.
.P
For a trivial example, consider a program to delete from the input
all blanks or tabs at the ends of lines.
.DS I
%%
[ \et]+$	;
.DE
is all that is required.
The program
contains a %% delimiter to mark the beginning of the rules, and
one rule.
This rule contains a regular expression
that matches one or more
instances of the characters blank or tab
(written \et for visibility, 
in accordance with the C language convention)
just prior to the end of a line.
The brackets indicate the character
class made of blank and tab; the + indicates 
.Q "one or more ..." ;
and the dollar sign ($) indicates 
.Q "end of line" . 
No action is specified,
so the program generated by 
.I lex
(yylex) will ignore these characters.
Everything else will be copied.
To change any remaining
string of blanks or tabs to a single blank,
add another rule:
.DS I
%%
[ \et]+$	;
[ \et]+	printf(" ");
.DE
The finite automaton generated for this
source scans for both rules at once,
observes at the termination of the string of blanks or tabs
whether or not there is a newline character, and 
then executes the desired rule's action.
The first rule matches all strings of blanks or tabs
at the end of lines, and the second
rule matches all remaining strings of blanks or tabs.
.P
.I Lex
can be used alone for simple transformations, or
for analysis and statistics gathering on a lexical level.
.I Lex
can also be used with a parser generator
to perform the lexical analysis phase; it is especially
easy to interface 
.I lex
and YACC.
.I Lex
programs recognize only regular expressions;
YACC writes parsers that accept a large class 
of context free grammars,
but that require a lower level analyzer 
to recognize input tokens.
Thus, a combination of 
.I lex
and YACC is often appropriate.
When used as a preprocessor for a later parser generator,
.I lex
is used to partition the input stream,
and the parser generator assigns structure to
the resulting pieces.
Additional programs,
written by other generators
or by hand, can
be added easily to programs written by 
.I lex .
YACC users
will realize that the name
.I yylex
is what YACC expects its lexical analyzer to be named,
so that the use of this name by 
.I lex
simplifies interfacing.
.P
.I Lex
generates a deterministic finite automaton 
from the regular expressions in the source.
The automaton is interpreted, rather than compiled, in order
to save space.
The result is still a fast analyzer.
In particular, the time taken by a 
.I lex
program
to recognize and partition an input stream is
proportional to the length of the input.
The number of 
.I lex
rules or
the complexity of the rules is
not important in determining speed,
unless rules which include
forward context require a significant amount of rescanning.
What does increase with the number and complexity of rules
is the size of the finite
automaton, and therefore the size of the program
generated by 
.I lex .
.P
In the program written by 
.I lex ,
the user's fragments
(representing the
.I actions
to be performed as each regular expression
is found)
are gathered
as cases of a switch.
The automaton interpreter directs the control flow.
Opportunity is provided for the user to insert either
declarations or additional statements in the routine containing
the actions, or to
add subroutines outside this action routine.
.P
.I Lex
is not limited to source that can
be interpreted on the basis of one character lookahead.
For example,
if there are two rules, one looking for
.Q "ab"
and another for 
.Q "abcdefg" ,
and the input stream is 
.Q "abcdefh" ,
.I lex
will recognize
.Q "ab"
and leave the input pointer just before
.Q "cd" .
Such backup is more costly
than the processing of simpler languages.
.H 2 "Command Usage"
There are two steps in
compiling a 
.I lex
source program.
First, the 
.I lex
source must be turned into a generated program
in the host general purpose language.
Then this program must be compiled and loaded, usually with
a library of 
.I lex
subroutines.
The generated program
is in a file named
.I lex.yy.c .
The I/O library is defined in terms of the C standard library.
.P
The library is accessed by the loader flag
.I \-lln .
So an appropriate
set of commands is
.DS I
lex source
cc lex.yy.c \-lln
.DE
The resulting program is placed on the usual file
.I a.out
for later execution.
To use 
.I lex
with YACC see 
the section 
.Q "Lex and YACC"
and Chapter 10,
.I "YACC: A Compiler-Compiler" .
Although the default 
.I lex
I/O routines use the C standard library,
the 
.I lex
automata themselves do not do so.
If private versions of
.I input ,
.I output
and
.I unput
are given, the library can be avoided.
.H 2 "Lex Source Format"
The general format of 
.I lex
source is:
.DS I
{definitions}
%%
{rules}
%%
{user subroutines}
.DE
where the definitions and the user subroutines
are often omitted.
The second %% is optional, but the first is required
to mark the beginning of the rules.
The absolute minimum 
.I lex
program is thus
.DS I
%%
.DE
(no definitions, no rules) which translates into a program
that copies the input to the output unchanged.
.P
In the .
.I lex
program format shown above, the
.I rules
represent the user's control
decisions. 
They make up a table in which the left column
contains
.I "regular expressions"
and the right column contains
.I actions ,
program fragments to be executed when the expressions
are recognized.
Thus the following individual rule might appear:
.DS I
integer	printf("found keyword INT");
.DE
This looks for the string
.I integer
in the input stream and
prints the message
.DS I
found keyword INT
.DE
whenever it appears in the input text.
In this example the C library function
.IR printf ()
is used to print the string.
The end of the lex regular expression 
is indicated by the first blank or tab character.
If the action is merely a single C expression,
it can be given on the right side of the line; if it is
compound, or takes more than a line, it should be enclosed in
braces.
As a slightly more useful example, suppose it is desired to
change a number of words from British to American spelling.
.I Lex
rules such as
.DS I
colour	printf("color");
mechanise	printf("mechanize");
petrol	printf("gas");
.DE
would be a start.  
These rules are not quite enough, since the word
.Q "petroleum"
would become
.Q "gaseum" ;
a way of dealing with such problems
is described in a later section.
.H 2 "Lex Regular Expressions"
A regular expression specifies a set of strings to be matched.
It contains text characters (that match the corresponding
characters in the strings being compared)
and operator characters (these specify
repetitions, choices, and other features).
The letters of the alphabet and the digits are
always text characters. 
Thus, the regular expression
.DS I
integer
.DE
matches the string
.Q "integer"
wherever it appears
and the expression
.DS I
a57D
.DE
looks for the string
.Q "a57D" .
.P
The operator characters are
.DS I
" \e [ ] ^ - ? . * + | ( ) $ / { } % < >
.DE
If any of these characters are to be used literally,
they needed to be quoted individually with a backslash
(\|\e\|) or as a group within quotation marks
(\|"\|).
The quotation mark operator (")
indicates that whatever is contained between a pair of 
quotation marks
is to be taken as text characters.
Thus
.DS I
xyz"++"
.DE
matches the string
.I xyz++
when it appears.  
Note that a part of a string may be quoted.
It is harmless but unnecessary to quote an ordinary
text character; the expression
.DS I
"xyz++"
.DE
is the same as the one above.
Thus by quoting every nonalphanumeric character
being used as a text character, you need not memorize
the above list of current operator characters. 
.P
An operator character may also be turned into a text character
by preceding it with a backslash (\e) as in
.DS I
xyz\e+\e+
.DE
which
is another, less readable, equivalent of the above expressions.
The quoting mechanism can also be used to get a blank into
an expression; normally, as explained above, blanks or tabs end
a rule.
Any blank character not contained within brackets
must be quoted.
Several normal C escapes with the backslash (\|\e\|)
are recognized: 
.DS I
\en newline
\et tab
\eb backspace
\e\e backslash
.DE
.P
Since newline is illegal in an expression, a 
.Q "\en" 
must be used;
it is not required to escape tab and backspace.
Every character but blank, tab, newline and the list above is always
a text character.
.H 3 "Character Classes"
Classes of characters can be specified using brackets: [ and ].
The construction
.DS I
[abc]
.DE
matches a single character, which may be
.Q "a" ,
.Q "b" ,
or
.Q "c" .
Within square brackets,
most operator meanings are ignored.
Only three characters are special:
these are the backslash (\e), the dash (-), and the caret (\|^\|).
The dash character indicates ranges.  
For example
.DS I
[a-z0-9<>_]
.DE
indicates the character class containing all the lowercase letters,
the digits,
the angle brackets, and underline.
Ranges may be given in either order.
Using the dash between any pair of characters that are
not both uppercase letters, both lowercase letters, or both digits
is implementation dependent and causes a warning message.
If it is desired to include the
dash in a character class, it should be first or last; 
thus
.DS I
[-+0-9]
.DE
matches all the digits and the plus and minus signs.
.P
In character classes,
the caret (^) operator must appear as the first character
after the left bracket; it indicates that the resulting string
is to be complemented with respect to the computer character set.
Thus
.DS I
[^abc]
.DE
matches all characters except 
.Q "a" , 
.Q "b" ,
or 
.Q "c" ,
including all special or control characters; or
.DS I
[^a-zA-Z]
.DE
is any character which is not a letter.
The backslash (\|\e\|) provides an escape mechanism
within character class brackets, so that characters
can be entered literally by preceding them with this
character.
.H 3 "Arbitrary Character"
To match almost any character, the period (\|.\|)
designates the class of all characters except a newline.
Escaping into octal is possible although nonportable.
For example
.DS I
[\e40-\e176]
.DE
matches all printable characters in the ASCII character set, from octal
40 (blank) to octal 176 (tilde).
.H 3 "Optional Expressions"
The question mark (?) operator indicates
an optional element of an expression.
Thus
.DS I
ab?c
.DE
matches either 
.Q "ac"
or 
.Q "abc" .
Note that the meaning of the question mark here
differs from its meaning in the shell.
.H 3 "Repeated Expressions"
Repetitions of classes are indicated by the asterisk
(*) and plus (+) operators.
For example
.DS I
a*
.DE
matches any number of consecutive
.Q "a"
characters, including zero; while
.Q "a+"
matches one or more instances of 
.Q "a" .
For example,
.DS I
[a-z]+
.DE
matches all strings of lowercase letters, and
.DS I
[A-Za-z][A-Za-z0-9]*
.DE
matches all alphanumeric strings with a leading alphabetic character;
this is a typical expression for recognizing identifiers in
computer languages.
.H 3 "Alternation and Grouping"
The vertical bar (\||\|) operator
indicates alternation.
For example
.DS I
(ab|cd)
.DE
matches either 
.Q "ab"
or 
.Q "cd" .
Note that parentheses are used for grouping, although
they are not necessary at the outside level.
For example
.DS I
ab|cd
.DE
would have sufficed in the preceding example.
Parentheses should be used for more complex expressions,
such as
.DS I
(ab|cd+)?(ef)*
.DE
which matches such strings as
.Q "abefef" , 
.Q "efefef" , 
.Q "cdef" ,
and 
.Q "cddd" ,
but not 
.Q "abc" , 
.Q "abcd" ,
or 
.Q "abcdef" .
.H 3 "Context Sensitivity"
.I Lex
recognizes a small amount of surrounding context.  
The two simplest operators for this are
the caret (\|^\|) and the dollar sign ($).
If the first character of an expression is a caret, then
the expression is only matched at the beginning
of a line (after a newline character, or at the beginning of
the input stream).
This can never conflict with the other meaning of
the caret, complementation of character classes, 
since complementation only applies within brackets.
If the very last character is dollar sign,
the expression only matched at the end of a line (when
immediately followed by newline).
The latter operator is a special case of the slash (/)
operator, which indicates trailing context.
The expression
.DS I
ab/cd
.DE
matches the string
.Q "ab" ,
but only if followed by
.Q "cd" .
Thus
.DS I
ab$
.DE
is the same as
.DS I
ab/\en
.DE
Left context is handled in 
.I lex
by specifying
.I "start conditions"
as explained in the section on
.Q "Left Context Sensitivity" .
If a rule is only to be executed
when the 
.I lex
automaton interpreter is in start condition
.Q "x" ,
the rule should be enclosed in angle brackets:
.DS I
<x>
.DE
If we considered 
.Q "being at the beginning of a line"
to be
start condition ONE, then the caret (\|^\|) operator
would be equivalent to
.DS I
<ONE>
.DE
Start conditions are explained more fully later.
.H 3 "Repetitions and Definitions"
The curly braces ({ and }) specify
either repetitions (if they enclose numbers)
or definition expansion (if they enclose a name).  
For example
.DS I
{digit}
.DE
looks for a predefined string named
.Q "digit"
and inserts it at that point in the expression.
The definitions are given in the first part of the 
.I lex
input, before the rules.
In contrast,
.DS I
a{1,5}
.DE
looks for 1 to 5 occurrences of
the character 
.Q "a" .
.P
Finally, an initial percent sign 
.RI (\| % \|)
is special, since it is the
separator for 
.I lex
source segments.
.H 2 "Lex Actions"
When an expression is matched by a pattern of text
in the input, 
.I lex
executes the corresponding action.  
This section describes
some features of 
.I lex
which aid in writing actions.  
Note that there is a default action, which
consists of copying the input to the output.  
This is performed on all strings not otherwise matched.  
Thus the 
.I lex
user who wishes to absorb the entire input, without
producing any output, must provide rules to match everything.
When 
.I lex
is being used with YACC, this is the normal situation.
You may consider that actions are what is done instead of
copying the input to the output; thus, in general,
a rule which merely copies can be omitted.
.P
One of the simplest things that can be done is to ignore the input.   
Specifying a C null statement 
.Q ";"
as an action causes this result.  
A frequent rule is
.DS I
[ \et\en]	;
.DE
which causes the three spacing characters (blank, tab, and newline)
to be ignored.
.P
Another easy way to avoid writing actions 
is to use the repeat action character, 
.Q "\||\|" , 
which indicates that the action for this rule is the action
for the next rule.
The previous example could also have been written
.DS I
" "		|
"\et"		|
"\en"		;
.DE
with the same result, although in a different style.
The quotes around 
.Q "\en"
and 
.Q "\et"
are not required.
.P
In more complex actions, you
often want to know the actual text 
that matched some expression like:
.DS I
[a\-z]+
.DE
.I Lex
leaves this text in an external character array named
.Q "yytext" .
.R
Thus, to print the name found,
a rule like
.DS I
[a-z]+	printf("%s", yytext);
.DE
prints the string in 
.Q "yytext" .
The C function
.I printf
accepts a format argument and data to be printed;
in this case, the format is 
.Q "print string"
where the percent sign (%) indicates
data conversion, and
the 
.Q s
indicate string type,
and the data are the characters in 
.Q "yytext" .
So this just places the matched string on the output.
This action is so common that it may be written as ECHO.
For example
.DS I
[a-z]+	ECHO;
.DE
is the same as the preceding example.
Since the default action is just to
print the characters found, one might ask why
give a rule, like this one, which merely specifies
the default action?
Such rules are often required
to avoid matching some other rule
that is not desired.  
For example, if there is a rule that matches
.I read
it will normally match the instances of
.I read
contained in
.I bread
or
.I readjust ;
to avoid this,
a rule of the form
.DS I
[a\-z]+
.DE
is needed.
This is explained further below.
.P
Sometimes it is more convenient to know the end of what
has been found; hence 
.I lex
also provides a count
of the number of characters matched
in the variable, 
.Q "yyleng" .
To count both the number of words and 
the number of characters in words in the input, 
you might write
.DS I
[a\-zA\-Z]+	{words++; chars += yyleng;}
.DE
which accumulates in the variables 
.Q "chars"
the number of characters in the words recognized.
The last character in the string matched can
be accessed with:
.DS I
yytext[yyleng\-1]
.DE
.P
Occasionally, a 
.I lex
action may decide that a rule has not recognized the correct
span of characters.
Two routines are provided to aid with this situation.
First,
.I yymore () 
can be called to indicate 
that the next input expression recognized is to be
tacked on to the end of this input.  
Normally, the next input string will overwrite the current
entry in
.Q "yytext" .
Second,
.IR yyless (n)
may be called to indicate that not all the characters matched
by the currently successful expression are wanted right now.
The argument 
.Q "n"
indicates the number of characters in
.Q "yytext"
to be retained.
Further characters previously matched are
returned to the input.  
This provides the same sort of
lookahead offered by the slash (/) operator,
but in a different form.
.P
.I "Example:"
Consider a language that defines
a string as a set of characters 
between quotation marks ("), and provides that
to include a quotation mark in a string,
it must be preceded by a backslash (\e).  
The regular expression that matches this is somewhat confusing,
so that it might be preferable to write
.DS I
\e"[^"]*	{
	if (yytext[yyleng-1] == '\e\e')
	     yymore();
	else
	     ... normal user processing
	}
.DE
which, when faced with a string such as
.DS I
"abc\e"def"
.DE
will first match
the five characters
.DS I
"abc\e
.DE
and then the call to
.IR yymore ()
will cause the next part of the string,
.DS I
"def
.DE
to be tacked on the end.
Note that the final quotation mark terminating the
string should be picked
up in the code labeled 
.Q "normal processing" .
.P
The function
.I
yyless()
.R
might be used to reprocess
text in various circumstances.  
Consider the problem in the older C syntax 
of distinguishing the ambiguity of 
.Q "=\-a" .
Suppose it is desired to treat this as 
.Q "=\-\0a"
and to print a message.  
A rule might be
.DS I
=-[a-zA-Z]	{
	printf("Operator (=-) ambiguous\en");
	yyless(yyleng-1);
	... action for =- ...
	}
.DE
which prints a message, returns the letter after the
operator to the input stream, and treats the operator as 
.Q "=\(mi" .
.P
Alternatively it might be desired to treat this as 
.Q "=  \(mia" .
To do this, just return the minus
sign as well as the letter to the input.
The following performs the interpretation:
.DS I
=-[a-zA-Z]	{
	printf("Operator (=-) ambiguous\en");
	yyless(yyleng-2);
	... action for = ...
	}
.DE
Note that the expressions for the two cases might more easily
be written
.DS I
=-/[A-Za-z]
.DE
in the first case and
.DS I
=/-[A-Za-z]
.DE
in the second:
no backup would be required in the rule action.
It is not necessary to recognize the whole identifier
to observe the ambiguity.
The
possibility of 
.Q "=\(mi3" ,
however, makes
.DS I
=-/[^ \et\en]
.DE
a still better rule.
.P
In addition to these routines, 
.I lex
also permits
access to the I/O routines
it uses.
They include:
.AL 1
.LI 
.I input()
which returns the next input character;
.LI 
.I output(c)
which writes the character
.I c
on the output; and
.LI 
.I unput(c)
which pushes the character
.I c
back onto the input stream to be read later by
.I input().
.LE
.P
By default these routines are provided as macro definitions,
but the user can override them and supply private versions.
These routines
define the relationship between external files and
internal characters, and must all be retained
or modified consistently.
They may be redefined, to
cause input or output to be transmitted to or from strange
places, including other programs or internal memory;
but the character set used must be consistent in all routines;
a value of zero returned by
.I
input
.R
must mean end-of-file; and
the relationship between
.I
unput
.R
and
.I
input
.R
must be retained
or the 
lookahead will not work.
.I Lex
does not look ahead at all if it does not have to,
but every rule containing a slash (\|/\|) or
ending in one of the following characters
implies lookahead:
.DS I
+  \(**  ?  $
.DE
Lookahead is also necessary to match 
an expression that is a prefix
of another expression.
See below for a discussion of the character set used by 
.I lex .
The standard 
.I lex
library imposes a 100 character limit on backup.
.P
Another 
.I lex
library routine that you sometimes want to redefine is
.IR yywrap()
which is called whenever 
.I lex
reaches an end-of-file.
If
.I yywrap
returns a 1, 
.I lex
continues with the normal wrapup on end of input.
Sometimes, however, it is convenient to arrange for more
input to arrive from a new source.
In this case, the user should provide a
.I yywrap
that arranges for new input and returns 0.
This instructs 
.I lex
to continue processing.
The default
.I yywrap
always returns 1.
.P
This routine is also a convenient place
to print tables, summaries, etc. at the end
of a program.  
Note that it is not
possible to write a normal rule that recognizes
end-of-file; the only access to this condition is
through
.IR yywrap ().
In fact, unless a private version of
.IR input()
is supplied a file containing nulls cannot be handled,
since a value of 0 returned by
.I input
is taken to be end-of-file.
.H 3 "Ambiguous Source Rules"
.I Lex
can handle ambiguous specifications.
When more than one expression can match the current input, 
.I lex
chooses as follows:
.BL
.LI 
The longest match is preferred.
.LI 
Among rules that match the same number of characters,
the first given rule is preferred.
.LE
.P
For example, suppose the following rules are given:
.DS I
integer	keyword action ...;
[a-z]+	identifier action ...;
.DE
If the input is
.I integers ,
it is taken as an identifier, because 
.DS I
[a\-z]+
.DE
matches 8 characters while
.DS I
integer
.DE
matches only 7.
If the input is
.Q integer ,
both rules match 7 characters, and
the keyword rule is selected because it was given first.
Anything shorter (e.g.,
.Q int \|) 
does not match the expression
.Q integer ,
so the identifier interpretation is used.
.P
The principle of preferring the longest
match makes certain constructions dangerous, 
such as the following:
.DS I
\&.*
.DE
For example
.DS I
\&'.*'
.DE
might seem a good way of recognizing a string in single quotes.
But it is an invitation for the program to read far
ahead, looking for a distant single quote.
Presented with the input
.DS I
\&\'first\' quoted string here, \'second\' here
.DE
the above expression matches
.DS I
\&\'first\' quoted string here, \'second\'
.DE
which is probably not what was wanted.
A better rule is of the form
.DS I
\&'[^\'\en]*'
.DE
which, on the above input, stops
after \'first\'.
The consequences of errors like this are mitigated by the fact
that the dot (\|.\|) operator does not match a newline.
Therefore, no more than one line is ever
matched by such expressions.
Don't try to defeat this with expressions like
.DS I
[.\en]+
.DE
or their equivalents:
the 
.I lex
generated program will try to read
the entire input file, causing internal buffer overflows.
.P
Note that 
.I lex
is normally partitioning
the input stream, not searching for all possible matches
of each expression.
This means that each character is accounted for
once and only once.
For example, suppose it is desired to count occurrences of both 
.Q "she"
and 
.Q "he"
in an input text.
Some 
.I lex
rules to do this might be
.DS I
she	s++;
he	h++;
\en	|
\&.	;
.DE
where the last 
two rules ignore everything besides 
.Q "he"
and 
.Q "she" .
Remember that the period (\|.\|) does not include the
newline.
Since 
.Q "she"
includes 
.Q "he" , 
.I lex
will normally
.I not
recognize the instances of 
.Q "he"
included in 
.Q "she" ,
since once it has passed a 
.Q "she"
those characters are gone.
.P
Sometimes the user would like to override this choice.  
The action REJECT
means 
.Q "go do the next alternative" .
It causes whatever rule was second choice after the current
rule to be executed.
The position of the input pointer is adjusted accordingly.
Suppose the user really wants 
to count the included instances of 
.Q "he" :
.DS I
she	{s++; REJECT;}
he	{h++; REJECT;}
\en	|
\&.	;
.DE
These rules are one way of changing the previous example
to do just that.
After counting each expression, it is rejected; whenever appropriate,
the other expression will then be counted.  
In this example, of course, the user could note that 
.Q "she"
includes 
.Q "he" ,
but not vice versa, and omit the REJECT action on 
.Q "he" ;
in other cases, however, it
would not be possible to tell
which input characters were in both classes.
.P
Consider the two rules
.DS I
a[bc]+	{ ... ; REJECT;}
a[cd]+	{ ... ; REJECT;}
.DE
If the input is
.Q "ab" ,
only the first rule matches,
and on
.Q "ad"
only the second matches.
The input string
.Q "accb"
matches the first rule for four characters
and then the second rule for three characters.
In contrast, the input 
.Q "accd"
agrees with
the second rule for four characters and then the first
rule for three.
.P
In general, REJECT is useful whenever
the purpose of 
.I lex
is not to partition the input
stream but to detect all examples of some items
in the input, and the instances of these items
may overlap or include each other.
Suppose a digram table of the input is desired;
normally the digrams overlap, that is the word
.Q "the"
is considered to contain both 
.Q "th"
and 
.Q "he" .
Assuming a two-dimensional array named
.I digram
to be incremented, the appropriate
source is
.DS I
%%
[a-z][a-z]	{digram[yytext[0]][yytext[1]]++; REJECT;}
\&.	;
\en	;
.DE
where the REJECT is necessary to pick up
a letter pair beginning at every character, rather than at every
other character.
.P
Remember that REJECT does not rescan the input. 
Instead it remembers the results of the previous scan.  
This means that if a rule with trailing context is found, 
and REJECT executed, you must not have used
.I unput
to change the characters forthcoming
from the input stream.
This is the only restriction to ability to manipulate
the not-yet-processed input.
.H 3 "Left Context Sensitivity"
Sometimes it is desirable to have several sets of lexical rules
to be applied at different times in the input.
For example, a compiler preprocessor might distinguish
preprocessor statements and analyze them differently
from ordinary statements.
This requires sensitivity
to prior context, and there are several ways of handling
such problems.
The caret (^) operator, 
for example, is a prior context operator, recognizing 
immediately preceding left context just as the
dollar sign ($) recognizes
immediately following right context.
Adjacent left context could be extended, 
to produce a facility similar to
that for adjacent right context, but it is unlikely
to be as useful, since often the relevant left context
appeared some time earlier, such as at the beginning of a line.
.P
This section describes three means of dealing
with different environments: 
.AL 1
.LI 
The use of flags,
when only a few rules change from one environment to another
.LI
The use of
.I "start conditions"
with rules
.LI
The use multiple lexical analyzers running together.
.LE
.P
In each case, there are rules that recognize the need to change the
environment in which the
following input text is analyzed, and set some parameter
to reflect the change.  
This may be a flag explicitly tested by
the user's action code; such a flag is the simplest way of dealing
with the problem, since 
.I lex
is not involved at all.
It may be more convenient, however,
to have 
.I lex
remember the flags as initial conditions on the rules.
Any rule may be associated with a start condition.  It will only
be recognized when 
.I lex
is in that start condition.
The current start condition may be changed at any time.
Finally, if the sets of rules for the different environments
are very dissimilar,
clarity may be best achieved by writing several distinct lexical
analyzers, and switching from one to another as desired.
.P
Consider the following problem: copy the input to the output,
changing the word 
.Q "magic"
to 
.Q "first"
on every line that began with the letter 
.Q "a" , 
changing 
.Q "magic"
to 
.Q "second" 
on every line that began with the letter 
.Q "b" , 
and changing 
.Q "magic"
to 
.Q "third"
on every line that began with the letter 
.Q "c" .  
All other words and all other lines
are left unchanged.
.P
These rules are so simple that the easiest way
to do this job is with a flag:
.DS I
	int flag;
%%
^a	{flag = \'a\'; ECHO;}
^b	{flag = \'b\'; ECHO;}
^c	{flag = \'c\'; ECHO;}
\en	{flag =  0 ; ECHO;}
magic	{
	switch (flag)
	{
	case \'a\': printf("first"); break;
	case \'b\': printf("second"); break;
	case \'c\': printf("third"); break;
	default: ECHO; break;
	}
	}
.DE
should be adequate.
.P
To handle the same problem with start conditions, each
start condition must be introduced to 
.I lex
in the definitions section with a line reading
.DS I
%Start	name1 name2 ...
.DE
where the conditions may be named in any order.
The word 
.Q "Start"
may be abbreviated to 
.Q "s"
or 
.Q "S" .
The conditions may be referenced at the
head of a rule with angle brackets.
For example
.DS I
<name1>expression
.DE
is a rule that is only recognized when 
.I lex
is in the start condition 
.Q "name1" .
To enter a start condition,
execute the action statement
.DS I
BEGIN name1;
.DE
which changes the start condition to 
.I name1.
To return to the initial state
.DS I
BEGIN 0;
.DE
resets the initial condition
of the 
.I lex
automaton interpreter.
A rule may be active in several
start conditions;
for example:
.DS I
<name1,name2,name3>
.DE
is a legal prefix.  
Any rule not beginning with the
<> prefix operator is always active.
.P
The same example as before can be written:
.DS I
%START AA BB CC
%%
^a	{ECHO; BEGIN AA;}
^b	{ECHO; BEGIN BB;}
^c	{ECHO; BEGIN CC;}
\en	{ECHO; BEGIN 0;}
<AA>magic	printf("first");
<BB>magic	printf("second");
<CC>magic	printf("third");
.DE
where the logic is exactly the same as in the previous
method of handling the problem, but 
.I lex
does the work rather than the user's code.
.H 2 "Lex Source Definitions"
Remember the format of the 
.I lex
source:
.DS I
{definitions}
%%
{rules}
%%
{user routines}
.DE
So far only the rules have been described.  
You will need additional options, though, 
to define variables for use in your program and for use by 
.I lex .
These can go either in the definitions section
or in the rules section.
.P
Remember that 
.I lex
is turning the rules into a program.
Any source not intercepted by 
.I lex
is copied into the generated program.  
There are three classes of such things:
.AL 1
.LI 
Any line that is not part of a 
.I lex
rule or action which begins with a blank or tab is copied into the 
.I lex
generated program.
Such source input prior to the first %% delimiter will be external
to any function in the code; if it appears 
immediately after the first %%,
it appears in an appropriate place for declarations
in the function written by 
.I lex
which contains the actions.
This material must look like program fragments,
and should precede the first 
.I lex
rule.
.P
As a side effect of the above, lines that begin with a blank
or tab, and which contain a comment,
are passed through to the generated program.
This can be used to include comments in either the 
.I lex
source or the generated code.  
The comments should follow the conventions of the 
C language.
.LI 
Anything included between lines containing
only 
.Q "%{"
and 
.Q "%}"
is copied out as above.  
The delimiters are discarded.
This format permits entering text like preprocessor statements that
must begin in column 1,
or copying lines that do not look like programs.
.LI 
Anything after the third 
.Q "%%"
delimiter, regardless of formats, 
is copied out after the 
.I lex
output.
.P
.LE
.P
Definitions intended for 
.I lex
are given before the first 
.Q "%%"
delimiter.  
Any line in this section
not contained between 
.Q "%{"
and 
.Q "%}" ,
and beginning
in column 1, is assumed to define 
.I lex
substitution strings.
The format of such lines is
.DS I
name translation
.DE
and it
causes the string given as a translation to
be associated with the name.
The name and translation
must be separated by at 
least one blank or tab, and the name must begin with a letter.
The translation can then be called out
by the {name} syntax in a rule.
Using {D} for the digits and {E} for an exponent field,
for example, might abbreviate rules to recognize numbers:
.DS I
D	[0-9]
E	[DEde][-+]?{D}+
%%
{D}+	printf("integer");
{D}+"."{D}*({E})?	|
{D}*"."{D}+({E})?	|
{D}+{E}		printf("real");
.DE
Note the first two rules for real numbers;
both require a decimal point and contain
an optional exponent field,
but the first requires at least one digit before the
decimal point and the second requires at least one
digit after the decimal point.
To correctly handle the problem
posed by a FORTRAN expression such as
.Q "35.EQ.I" ,
which does not contain a real number, a context-sensitive
rule such as
.DS I
[0-9]+/"."EQ	printf("integer");
.DE
could be used in addition to the normal rule for integers.
.P
The definitions
section may also contain other commands, including 
a character set table,
a list of start conditions, 
or adjustments to the default size of arrays within 
.I lex
itself for larger source programs.
These possibilities are discussed below under 
.Q "Summary of Source Format" .
.H 2 "Lex and YACC"
If you want to use 
.I lex
with YACC, 
note that what 
.I lex
writes is a program named
.IR yylex (),
the name required by YACC for its analyzer.
Normally, the default main program on the 
.I lex
library
calls this routine, but if YACC is loaded, and its main
program is used, YACC will call
.IR yylex ().
In this case, each 
.I lex
rule should end with
.DS I
return(token);
.DE
where the appropriate token value is returned.
An easy way to get access
to YACC's names for tokens is to
compile the 
.I lex
output file as part of
the YACC output file by placing the line
.DS I
# include "lex.yy.c"
.DE
in the last section of YACC input.
Supposing the grammar to be
named 
.Q "good"
and the lexical rules to be named 
.Q "better"
the \*(x1 command sequence can just be:
.DS I
yacc good
lex better
cc y.tab.c -ly -lln
.DE
The YACC library (\-ly) should be loaded before the 
.I lex
library,
to obtain a main program which invokes the YACC parser.
The generation of 
.I lex
and YACC programs can be done in either order.
.P
As a trivial problem, consider copying an input file while
adding 3 to every positive number divisible by 7.
Here is a suitable 
.I lex
source program to do just that:
.DS I
%%
	int k;
[0-9]+	{
	k = atoi(yytext);
	if (k%7 == 0)
	     printf("%d", k+3);
	else
	     printf("%d",k);
	}
.DE
The rule [0\-9]+ recognizes strings of digits;
.IR atoi ()
converts the digits to binary
and stores the result in
.Q "k" .
The remainder operator (%) is used to check whether
.Q "k"
is divisible by 7; 
if it is, it is incremented by 3 as it is written out.
It may be objected that this program will alter such
input items as 49.63 or X7.
Furthermore, it increments the absolute value
of all negative numbers divisible by 7.
To avoid this, just add a few more rules after the active one,
as here:
.DS I
%%
	int k;
-?[0-9]+	{
	k = atoi(yytext);
	printf("%d", k%7 == 0 ? k+3 : k);
	}
-?[0-9.]+		ECHO;
[A-Za-z][A-Za-z0-9]+	ECHO;
.DE
Numerical strings containing a decimal point
or preceded by a letter will be picked up by
one of the last two rules, and not changed.
The
.B if\-else
has been replaced by
a C conditional expression to save space;
the form 
.Q "a?b:c"
means: if 
.Q "a"
then 
.Q "b"
else 
.Q "c" .
.P
For an example of statistics gathering, here
is a program which makes histograms of word lengths,
where a word is defined as a string of letters.
.DS I
	int lengs[100];
%%
[a-z]+	lengs[yyleng]++;
\&.	|
\en	;
%%
yywrap()
{
int i;
printf("Length  No. words\en");
for(i=0; i<100; i++)
     if (lengs[i] > 0)
          printf("%5d%10d\en",i,lengs[i]);
return(1);
}
.DE
This program accumulates the histogram, 
while producing no output.  
At the end of the input it prints the table.
The final statement
.IR return (1);
indicates that 
.I lex
is to perform wrapup.  
If
.IR yywrap ()
returns zero (false)
it implies that further input is available
and the program is
to continue reading and processing.
To provide a
.IR yywrap ()
that never returns true causes an infinite loop.
.P
As a larger example,
here are some parts of a program written 
to convert double precision FORTRAN to single precision FORTRAN.
Because FORTRAN does not distinguish between 
upper- and lowercase letters,
this routine begins by defining 
a set of classes including both cases of each letter:
.DS I
a	[aA]
b	[bB]
c	[cC]
\&.	.
\&.	.
\&.	.
z	[zZ]
.DE
An additional class recognizes white space:
.DS I
W	[ \et]*
.DE
The first rule changes
.Q "double precision"
to 
.Q "real" , 
or 
.Q "DOUBLE PRECISION"
to 
.Q "REAL" .
.DS I
{d}{o}{u}{b}{l}{e}{W}{p}{r}{e}{c}{i}{s}{i}{o}{n} {
     printf(yytext[0]=='d'? "real" : "REAL");
     }
.DE
Care is taken throughout this program to preserve the case
of the original program.
The conditional operator is used to
select the proper form of the keyword.
The next rule copies continuation card indications to
avoid confusing them with constants:
.DS I
^"     "[^ 0]	ECHO;
.DE
In the regular expression, the quotes surround the
blanks.
It is interpreted as
.Q "beginning of line, then five blanks,\
then anything but blank or zero." 
Note the two different meanings of the caret (^) here.
There follow some rules to change double precision
constants to ordinary floating constants.
.DS I
[0-9]+{W}{d}{W}[+-]?{W}[0-9]+           |
[0-9]+{W}"."{W}{d}{W}[+-]?{W}[0-9]+     |
"."{W}[0-9]+{W}{d}{W}[+-]?{W}[0-9]+     {
     /* convert constants */
     for(p=yytext; *p != 0; p++)
          {
          if (*p == 'd' || *p == 'D')
               *p+= 'e'- 'd';
          ECHO;
          }
.DE
.P
After the floating point constant is recognized, it is
scanned by the
.B for
loop to find the letter 
.Q d 
or 
.Q D .
The program then adds
.I \(fme\(fm\-\(fmd\(fm 
which converts it to the next letter of the alphabet.
The modified constant, now single precision,
is written out again.
There follow a series of names which must be respelled to remove
their initial 
.Q "d" .
By using the array
.Q "yytext"
the same action suffices 
for all the names (only a sample of
a rather long list is given here).
.DS I
{d}{s}{i}{n}	|
{d}{c}{o}{s}	|
{d}{s}{q}{r}{t}	|
{d}{a}{t}{a}{n}	|
\&...
{d}{f}{l}{o}{a}{t}	printf("%s",yytext+1);
.DE
Another list of names 
must have initial 
.Q "d"
changed to initial 
.Q "a" :
.DS I
{d}{l}{o}{g}	|
{d}{l}{o}{g}10	|
{d}{m}{i}{n}1	|
{d}{m}{a}{x}1	{
	yytext[0] += \'a\' - \'d\';
	ECHO;
	}
.DE
And one routine
must have initial 
.Q "d"
changed to initial 
.Q "r" :
.DS I
{d}1{m}{a}{c}{h}	{yytext[0] += \'r\'  - \'d\';
		ECHO;
		}
.DE
To avoid such names as 
.Q "dsinx"
being detected as instances of 
.Q "dsin" , 
some final rules pick up longer words as identifiers
and copy some surviving characters:
.DS I
[A-Za-z][A-Za-z0-9]*	|
[0-9]+			|
\en			|
\&.	ECHO;
.DE
.P
Note that this program is not complete; 
it does not deal with the spacing problems in FORTRAN or
with the use of keywords as identifiers.
.H 2 "Character Sets"
The programs generated by 
.I lex
handle character I/O only through the routines
.I input ,
.I output
and
.I unput .
Thus the character representation
provided in these routines
is accepted by 
.I lex
and employed to return values in
.Q "yytext" .
For internal use
a character is represented as a small integer
which, if the standard library is used,
has a value equal to the integer value of the bit
pattern representing the character on the host computer.
Normally, the letter 
.Q "a"
is represented as the same form as the character constant: 
.DS I
\'a\'
.DE
If this interpretation is changed, by providing I/O
routines which translate the characters,
.I lex
must be told about
it, by giving a translation table.
This table must be in the definitions section,
and must be bracketed by lines containing  only
.Q "%T" .
The table contains lines of the form
.DS I
{integer} {character string}
.DE
which indicate the value associated with each character.
For example:
.DS I
%T
 1	Aa
 2	Bb
\&...
26	Zz
27	\en
28	+
29	-
30	0
31	1
\&...
39	9
%T
.DE
This table maps the lowercase and 
uppercase letters together into the integers 1 through 26,
newline into 27, plus (+) and minus (\-) into 28 and 29, and the
digits into 30 through 39.
Note the escape for newline.
If a table is supplied, every character that is to appear either
in the rules or in any valid input must be included in the table.
No character may be assigned the number 0, and no character may be
assigned a larger number than the size of the hardware character set.
.H 2 "Summary of Source Format"
The general form of a 
.I lex
source file is:
.DS I
{definitions}
%%
{rules}
%%
{user subroutines}
.DE
The definitions section contains
a combination of
.AL 1
.LI 
Definitions, in the form 
.Q "name space translation" 
.LI 
Included code, in the form 
.Q "space code" 
.LI 
Included code, in the form
.DS I
%{
code
%}
.DE
.LI 
Start conditions, given in the form
.DS I
%S name1 name2 ...
.DE
.LI 
Character set tables, in the form
.DS I
%T
number space character-string
...
%T
.DE
.LI 
Changes to internal array sizes, in the form
.DS I
%x\0\0nnn
.DE
where 
.I nnn 
is a decimal integer representing an array size
and 
.Q "x"
selects the parameter as follows:
.DS I
Letter	Parameter
p	positions
n	states
e	tree nodes
a	transitions
k	packed character classes
o	output array size
.DE
.LE
.P
Lines in the rules section have the form:
.DS I
.I 
expression  action
.R
.DE
where the action may be continued on succeeding
lines by using braces to delimit it.
.P
Regular expressions in 
.I lex
use the following
operators:
.tr @"
.VL 10 2
.LI x
The character "x"
.LI "@x@"
An "x", even if x is an operator.
.LI \ex
An "x", even if x is an operator.
.LI [xy]
The character x or y.
.LI [x\-z]
The characters x, y or z.
.LI [^x]
Any character but x.
.LI \&.
Any character but newline.
.LI ^x
An x at the beginning of a line.
.LI <y>x
An x when 
.I lex
is in start condition y.
.LI x$
An x at the end of a line.
.LI x?
An optional x.
.LI x\(**
0,1,2, ... instances of x.
.LI x+
1,2,3, ... instances of x.
.LI x|y
An x or a y.
.LI (x)	
An x.
.LI x/y	
An x but only if followed by y.
.LI {xx}
The translation of xx from the definitions section.
.LI x{m,n}	
.I m
through 
.I n 
occurrences of x.
.LE
.tr @@
.TC 2 1 5 0
