1.ds re \fBre2c\fP
2.ds le \fBlex\fP
3.ds rx regular expression
4.ds lx \fIl\fP-expression
5.TH RE2C 1 "8 April 1994" "Version 0.5"
6\"$Log: re2c.1,v $
7\"Revision 1.1  2002/04/07 22:27:06  peter
8\"Initial revision
9\"
10\"Revision 1.2  1994/04/16  15:50:32  peterr
11\"Fix bug in simple example.
12\"
13\"Revision 1.1  1994/04/08  15:39:09  peterr
14\"Initial revision
15\"
16.SH NAME
17re2c \- convert regular expressions to C/C++
18
19.SH SYNOPSIS
20\*(re [\fB-esb\fP] \fIname\fP
21
22.SH DESCRIPTION
23\*(re is a preprocessor that generates C-based recognizers from regular
24expressions.
25The input to \*(re consists of C/C++ source interleaved with
26comments of the form \fC/*!re2c\fP ... \fC*/\fP which contain
27scanner specifications.
28In the output these comments are replaced with code that, when
29executed, will find the next input token and then execute
30some user-supplied token-specific code.
31
32For example, given the following code
33
34.in +3
35.nf
36#define NULL            ((char*) 0)
37char *scan(char *p){
38char *q;
39#define YYCTYPE         char
40#define YYCURSOR        p
41#define YYLIMIT         p
42#define YYMARKER        q
43#define YYFILL(n)
44/*!re2c
45        [0-9]+          {return YYCURSOR;}
46        [\\000-\\377]     {return NULL;}
47*/
48}
49.fi
50.in -3
51
52\*(re will generate
53
54.in +3
55.nf
56/* Generated by re2c on Sat Apr 16 11:40:58 1994 */
57#line 1 "simple.re"
58#define NULL            ((char*) 0)
59char *scan(char *p){
60char *q;
61#define YYCTYPE         char
62#define YYCURSOR        p
63#define YYLIMIT         p
64#define YYMARKER        q
65#define YYFILL(n)
66{
67        YYCTYPE yych;
68        unsigned int yyaccept;
69        goto yy0;
70yy1:    ++YYCURSOR;
71yy0:
72        if((YYLIMIT - YYCURSOR) < 2) YYFILL(2);
73        yych = *YYCURSOR;
74        if(yych <= '/') goto yy4;
75        if(yych >= ':') goto yy4;
76yy2:    yych = *++YYCURSOR;
77        goto yy7;
78yy3:
79#line 10
80        {return YYCURSOR;}
81yy4:    yych = *++YYCURSOR;
82yy5:
83#line 11
84        {return NULL;}
85yy6:    ++YYCURSOR;
86        if(YYLIMIT == YYCURSOR) YYFILL(1);
87        yych = *YYCURSOR;
88yy7:    if(yych <= '/') goto yy3;
89        if(yych <= '9') goto yy6;
90        goto yy3;
91}
92#line 12
93
94}
95.fi
96.in -3
97
98.SH OPTIONS
99\*(re provides the following options:
100.TP
101\fB-e\fP
102Cross-compile from an ASCII platform to an EBCDIC one. 
103.TP
104\fB-s\fP
105Generate nested \fCif\fPs for some \fCswitch\fPes.  Many compilers need this
106assist to generate better code.
107.TP
108\fB-b\fP
109Implies \fB-s\fP.  Use bit vectors as well in the attempt to coax better
110code out of the compiler.  Most useful for specifications with more than a
111few keywords (e.g. for most programming languages).
112
113.SH "INTERFACE CODE"
114Unlike other scanner generators, \*(re does not generate complete scanners:
115the user must supply some interface code.
116In particular, the user must define the following macros:
117.TP
118\fCYYCHAR\fP
119Type used to hold an input symbol.
120Usually \fCchar\fP or \fCunsigned char\fP.
121.TP
122\fCYYCURSOR\fP
123\*(lx of type \fC*YYCHAR\fP that points to the current input symbol.
124The generated code advances \fCYYCURSOR\fP as symbols are matched.
125On entry, \fCYYCURSOR\fP is assumed to point to the first character of the
126current token.  On exit, \fCYYCURSOR\fP will point to the first character of
127the following token.
128.TP
129\fCYLIMIT\fP
130Expression of type \fC*YYCHAR\fP that marks the end of the buffer
131(\fCYLIMIT[-1]\fP is the last character in the buffer).
132The generated code repeatedly compares \fCYYCURSOR\fP to \fCYLIMIT\fP
133to determine when the buffer needs (re)filling.
134.TP
135\fCYYMARKER\fP
136\*(lx of type \fC*YYCHAR\fP.
137The generated code saves backtracking information in \fCYYMARKER\fP.
138.TP
139\fCYYFILL(\fP\fIn\fP\fC)\fP
140The generated code "calls" \fCYYFILL\fP when the buffer needs
141(re)filling:  at least \fIn\fP additional characters should
142be provided.  \fCYYFILL\fP should adjust \fCYYCURSOR\fP, \fCYYLIMIT\fP and
143\fCYYMARKER\fP as needed.  Note that for typical programming languages
144\fIn\fP will be the length of the longest keyword plus one.
145
146.SH "SCANNER SPECIFICATIONS"
147Each scanner specification consists of a set of \fIrules\fP and name
148definitions.
149Rules consist of a regular expression along with a block of C/C++ code that
150is to be executed when the associated regular expression is matched.
151Name definitions are of the form
152``\fIname\fP \fC=\fP \fIregular expression\fP\fC;\fP''.
153
154.SH "SUMMARY OF RE2C REGULAR EXPRESSIONS"
155.TP
156\fC"foo"\fP
157the literal string \fCfoo\fP.
158ANSI-C escape sequences can be used.
159.TP
160\fC[xyz]\fP
161a "character class"; in this case,
162the \*(rx matches either an '\fCx\fP', a '\fCy\fP', or a '\fCz\fP'.
163.TP
164\fC[abj-oZ]\fP
165a "character class" with a range in it;
166matches an '\fCa\fP', a '\fCb\fP', any letter from '\fCj\fP' through '\fCo\fP',
167or a '\fCZ\fP'.
168.TP
169\fIr\fP\fC\e\fP\fIs\fP
170match any \fIr\fP which isn't an \fIs\fP. \fIr\fP and \fIs\fP must be regular expressions
171which can be expressed as character classes.
172.TP
173\fIr\fP\fC*\fP
174zero or more \fIr\fP's, where \fIr\fP is any regular expression
175.TP
176\fC\fIr\fP\fC+\fP
177one or more \fIr\fP's
178.TP
179\fC\fIr\fP\fC?\fP
180zero or one \fIr\fP's (that is, "an optional \fIr\fP")
181.TP
182name
183the expansion of the "name" definition (see above)
184.TP
185\fC(\fP\fIr\fP\fC)\fP
186an \fIr\fP; parentheses are used to override precedence
187(see below)
188.TP
189\fIrs\fP
190an \fIr\fP followed by an \fIs\fP ("concatenation")
191.TP
192\fIr\fP\fC|\fP\fIs\fP
193either an \fIr\fP or an \fIs\fP
194.TP
195\fIr\fP\fC/\fP\fIs\fP
196an \fIr\fP but only if it is followed by an \fIs\fP. The s is not part of
197the matched text. This type of \*(rx is called "trailing context".
198.LP
199The regular expressions listed above are grouped according to
200precedence, from highest precedence at the top to lowest at the bottom.
201Those grouped together have equal precedence.
202
203.SH "A LARGER EXAMPLE"
204.LP
205.in +3
206.nf
207#include <stdlib.h>
208#include <stdio.h>
209#include <fcntl.h>
210#include <string.h>
211
212#define ADDEQ   257
213#define ANDAND  258
214#define ANDEQ   259
215#define ARRAY   260
216#define ASM     261
217#define AUTO    262
218#define BREAK   263
219#define CASE    264
220#define CHAR    265
221#define CONST   266
222#define CONTINUE        267
223#define DECR    268
224#define DEFAULT 269
225#define DEREF   270
226#define DIVEQ   271
227#define DO      272
228#define DOUBLE  273
229#define ELLIPSIS        274
230#define ELSE    275
231#define ENUM    276
232#define EQL     277
233#define EXTERN  278
234#define FCON    279
235#define FLOAT   280
236#define FOR     281
237#define FUNCTION        282
238#define GEQ     283
239#define GOTO    284
240#define ICON    285
241#define ID      286
242#define IF      287
243#define INCR    288
244#define INT     289
245#define LEQ     290
246#define LONG    291
247#define LSHIFT  292
248#define LSHIFTEQ        293
249#define MODEQ   294
250#define MULEQ   295
251#define NEQ     296
252#define OREQ    297
253#define OROR    298
254#define POINTER 299
255#define REGISTER        300
256#define RETURN  301
257#define RSHIFT  302
258#define RSHIFTEQ        303
259#define SCON    304
260#define SHORT   305
261#define SIGNED  306
262#define SIZEOF  307
263#define STATIC  308
264#define STRUCT  309
265#define SUBEQ   310
266#define SWITCH  311
267#define TYPEDEF 312
268#define UNION   313
269#define UNSIGNED        314
270#define VOID    315
271#define VOLATILE        316
272#define WHILE   317
273#define XOREQ   318
274#define EOI     319
275
276typedef unsigned int uint;
277typedef unsigned char uchar;
278
279#define BSIZE   8192
280
281#define YYCTYPE         uchar
282#define YYCURSOR        cursor
283#define YYLIMIT         s->lim
284#define YYMARKER        s->ptr
285#define YYFILL(n)       {cursor = fill(s, cursor);}
286
287#define RET(i)  {s->cur = cursor; return i;}
288
289typedef struct Scanner {
290    int                 fd;
291    uchar               *bot, *tok, *ptr, *cur, *pos, *lim, *top, *eof;
292    uint                line;
293} Scanner;
294
295uchar *fill(Scanner *s, uchar *cursor){
296    if(!s->eof){
297        uint cnt = s->tok - s->bot;
298        if(cnt){
299            memcpy(s->bot, s->tok, s->lim - s->tok);
300            s->tok = s->bot;
301            s->ptr -= cnt;
302            cursor -= cnt;
303            s->pos -= cnt;
304            s->lim -= cnt;
305        }
306        if((s->top - s->lim) < BSIZE){
307            uchar *buf = (uchar*)
308                malloc(((s->lim - s->bot) + BSIZE)*sizeof(uchar));
309            memcpy(buf, s->tok, s->lim - s->tok);
310            s->tok = buf;
311            s->ptr = &buf[s->ptr - s->bot];
312            cursor = &buf[cursor - s->bot];
313            s->pos = &buf[s->pos - s->bot];
314            s->lim = &buf[s->lim - s->bot];
315            s->top = &s->lim[BSIZE];
316            free(s->bot);
317            s->bot = buf;
318        }
319        if((cnt = read(s->fd, (char*) s->lim, BSIZE)) != BSIZE){
320            s->eof = &s->lim[cnt]; *(s->eof)++ = '\\n';
321        }
322        s->lim += cnt;
323    }
324    return cursor;
325}
326
327int scan(Scanner *s){
328        uchar *cursor = s->cur;
329std:
330        s->tok = cursor;
331/*!re2c
332any     = [\\000-\\377];
333O       = [0-7];
334D       = [0-9];
335L       = [a-zA-Z_];
336H       = [a-fA-F0-9];
337E       = [Ee] [+-]? D+;
338FS      = [fFlL];
339IS      = [uUlL]*;
340ESC     = [\\\\] ([abfnrtv?'"\\\\] | "x" H+ | O+);
341*/
342
343/*!re2c
344        "/*"                    { goto comment; }
345        
346        "auto"                  { RET(AUTO); }
347        "break"                 { RET(BREAK); }
348        "case"                  { RET(CASE); }
349        "char"                  { RET(CHAR); }
350        "const"                 { RET(CONST); }
351        "continue"              { RET(CONTINUE); }
352        "default"               { RET(DEFAULT); }
353        "do"                    { RET(DO); }
354        "double"                { RET(DOUBLE); }
355        "else"                  { RET(ELSE); }
356        "enum"                  { RET(ENUM); }
357        "extern"                { RET(EXTERN); }
358        "float"                 { RET(FLOAT); }
359        "for"                   { RET(FOR); }
360        "goto"                  { RET(GOTO); }
361        "if"                    { RET(IF); }
362        "int"                   { RET(INT); }
363        "long"                  { RET(LONG); }
364        "register"              { RET(REGISTER); }
365        "return"                { RET(RETURN); }
366        "short"                 { RET(SHORT); }
367        "signed"                { RET(SIGNED); }
368        "sizeof"                { RET(SIZEOF); }
369        "static"                { RET(STATIC); }
370        "struct"                { RET(STRUCT); }
371        "switch"                { RET(SWITCH); }
372        "typedef"               { RET(TYPEDEF); }
373        "union"                 { RET(UNION); }
374        "unsigned"              { RET(UNSIGNED); }
375        "void"                  { RET(VOID); }
376        "volatile"              { RET(VOLATILE); }
377        "while"                 { RET(WHILE); }
378        
379        L (L|D)*                { RET(ID); }
380        
381        ("0" [xX] H+ IS?) | ("0" D+ IS?) | (D+ IS?) |
382        (['] (ESC|any\\[\\n\\\\'])* ['])
383                                { RET(ICON); }
384        
385        (D+ E FS?) | (D* "." D+ E? FS?) | (D+ "." D* E? FS?)
386                                { RET(FCON); }
387        
388        (["] (ESC|any\\[\\n\\\\"])* ["])
389                                { RET(SCON); }
390        
391        "..."                   { RET(ELLIPSIS); }
392        ">>="                   { RET(RSHIFTEQ); }
393        "<<="                   { RET(LSHIFTEQ); }
394        "+="                    { RET(ADDEQ); }
395        "-="                    { RET(SUBEQ); }
396        "*="                    { RET(MULEQ); }
397        "/="                    { RET(DIVEQ); }
398        "%="                    { RET(MODEQ); }
399        "&="                    { RET(ANDEQ); }
400        "^="                    { RET(XOREQ); }
401        "|="                    { RET(OREQ); }
402        ">>"                    { RET(RSHIFT); }
403        "<<"                    { RET(LSHIFT); }
404        "++"                    { RET(INCR); }
405        "--"                    { RET(DECR); }
406        "->"                    { RET(DEREF); }
407        "&&"                    { RET(ANDAND); }
408        "||"                    { RET(OROR); }
409        "<="                    { RET(LEQ); }
410        ">="                    { RET(GEQ); }
411        "=="                    { RET(EQL); }
412        "!="                    { RET(NEQ); }
413        ";"                     { RET(';'); }
414        "{"                     { RET('{'); }
415        "}"                     { RET('}'); }
416        ","                     { RET(','); }
417        ":"                     { RET(':'); }
418        "="                     { RET('='); }
419        "("                     { RET('('); }
420        ")"                     { RET(')'); }
421        "["                     { RET('['); }
422        "]"                     { RET(']'); }
423        "."                     { RET('.'); }
424        "&"                     { RET('&'); }
425        "!"                     { RET('!'); }
426        "~"                     { RET('~'); }
427        "-"                     { RET('-'); }
428        "+"                     { RET('+'); }
429        "*"                     { RET('*'); }
430        "/"                     { RET('/'); }
431        "%"                     { RET('%'); }
432        "<"                     { RET('<'); }
433        ">"                     { RET('>'); }
434        "^"                     { RET('^'); }
435        "|"                     { RET('|'); }
436        "?"                     { RET('?'); }
437
438
439        [ \\t\\v\\f]+           { goto std; }
440
441        "\\n"
442            {
443                if(cursor == s->eof) RET(EOI);
444                s->pos = cursor; s->line++;
445                goto std;
446            }
447
448        any
449            {
450                printf("unexpected character: %c\\n", *s->tok);
451                goto std;
452            }
453*/
454
455comment:
456/*!re2c
457        "*/"                    { goto std; }
458        "\\n"
459            {
460                if(cursor == s->eof) RET(EOI);
461                s->tok = s->pos = cursor; s->line++;
462                goto comment;
463            }
464        any                     { goto comment; }
465*/
466}
467
468main(){
469    Scanner in;
470    int t;
471    memset((char*) &in, 0, sizeof(in));
472    in.fd = 0;
473    while((t = scan(&in)) != EOI){
474/*
475        printf("%d\\t%.*s\\n", t, in.cur - in.tok, in.tok);
476        printf("%d\\n", t);
477*/
478    }
479    close(in.fd);
480}
481.fi
482.in -3
483
484.SH "SEE ALSO"
485.LP
486flex(1), lex(1).
487
488.SH FEATURES
489.LP
490\*(re does not provide a default action:
491the generated code assumes that the input
492will consist of a sequence of tokens.
493Typically this can be dealt with by adding a rule such as the one for
494unexpected characters in the example above.
495.LP
496The user must arrange for a sentinel token to appear at the end of input
497(and provide a rule for matching it):
498\*(re does not provide an \fC<<EOF>>\fP expression.
499If the source is from a null-byte terminated string, a
500rule matching a null character will suffice.  If the source is from a
501file then the approach taken in the example can be used: pad the input with
502a newline (or some other character that can't appear within another token);
503upon recognizing such a character check to see if it is the sentinel
504and act accordingly.
505.LP
506\*(re does not provide start conditions:  use a separate scanner
507specification for each start condition (as illustrated in the above example).
508.LP
509No [^x].  Use difference instead.
510.SH BUGS
511.LP
512Only fixed length trailing context can be handled.
513.LP
514The maximum value appearing as a parameter \fIn\fP to \fCYYFILL\fP is not
515provided to the generated code (this value is needed for constructing
516the interface code).
517Note that this value is usually relatively small: for
518typical programming languages \fIn\fP will be the length of the longest
519keyword plus one.
520.LP
521Difference only works for character sets.
522.LP
523The \*(re internal algorithms need documentation.
524
525.SH AUTHOR
526.LP
527Please send bug reports, fixes and feedback to:
528.LP
529.nf
530Peter Bumbulis
531Computer Systems Group
532University of Waterloo
533Waterloo, Ontario
534N2L 3G1
535Internet:  peterr@csg.uwaterloo.ca
536.fi
537