1.ds re \fBre2c\fP 2.ds le \fBlex\fP 3.ds rx regular expression 4.ds lx \fIl\fP-expression 5.TH RE2C 1 "8 April 1994" "Version 0.5" 6\"$Log: re2c.1,v $ 7\"Revision 1.1 2002/04/07 22:27:06 peter 8\"Initial revision 9\" 10\"Revision 1.2 1994/04/16 15:50:32 peterr 11\"Fix bug in simple example. 12\" 13\"Revision 1.1 1994/04/08 15:39:09 peterr 14\"Initial revision 15\" 16.SH NAME 17re2c \- convert regular expressions to C/C++ 18 19.SH SYNOPSIS 20\*(re [\fB-esb\fP] \fIname\fP 21 22.SH DESCRIPTION 23\*(re is a preprocessor that generates C-based recognizers from regular 24expressions. 25The input to \*(re consists of C/C++ source interleaved with 26comments of the form \fC/*!re2c\fP ... \fC*/\fP which contain 27scanner specifications. 28In the output these comments are replaced with code that, when 29executed, will find the next input token and then execute 30some user-supplied token-specific code. 31 32For example, given the following code 33 34.in +3 35.nf 36#define NULL ((char*) 0) 37char *scan(char *p){ 38char *q; 39#define YYCTYPE char 40#define YYCURSOR p 41#define YYLIMIT p 42#define YYMARKER q 43#define YYFILL(n) 44/*!re2c 45 [0-9]+ {return YYCURSOR;} 46 [\\000-\\377] {return NULL;} 47*/ 48} 49.fi 50.in -3 51 52\*(re will generate 53 54.in +3 55.nf 56/* Generated by re2c on Sat Apr 16 11:40:58 1994 */ 57#line 1 "simple.re" 58#define NULL ((char*) 0) 59char *scan(char *p){ 60char *q; 61#define YYCTYPE char 62#define YYCURSOR p 63#define YYLIMIT p 64#define YYMARKER q 65#define YYFILL(n) 66{ 67 YYCTYPE yych; 68 unsigned int yyaccept; 69 goto yy0; 70yy1: ++YYCURSOR; 71yy0: 72 if((YYLIMIT - YYCURSOR) < 2) YYFILL(2); 73 yych = *YYCURSOR; 74 if(yych <= '/') goto yy4; 75 if(yych >= ':') goto yy4; 76yy2: yych = *++YYCURSOR; 77 goto yy7; 78yy3: 79#line 10 80 {return YYCURSOR;} 81yy4: yych = *++YYCURSOR; 82yy5: 83#line 11 84 {return NULL;} 85yy6: ++YYCURSOR; 86 if(YYLIMIT == YYCURSOR) YYFILL(1); 87 yych = *YYCURSOR; 88yy7: if(yych <= '/') goto yy3; 89 if(yych <= '9') goto yy6; 90 goto yy3; 91} 92#line 12 93 94} 95.fi 96.in -3 97 98.SH OPTIONS 99\*(re provides the following options: 100.TP 101\fB-e\fP 102Cross-compile from an ASCII platform to an EBCDIC one. 103.TP 104\fB-s\fP 105Generate nested \fCif\fPs for some \fCswitch\fPes. Many compilers need this 106assist to generate better code. 107.TP 108\fB-b\fP 109Implies \fB-s\fP. Use bit vectors as well in the attempt to coax better 110code out of the compiler. Most useful for specifications with more than a 111few keywords (e.g. for most programming languages). 112 113.SH "INTERFACE CODE" 114Unlike other scanner generators, \*(re does not generate complete scanners: 115the user must supply some interface code. 116In particular, the user must define the following macros: 117.TP 118\fCYYCHAR\fP 119Type used to hold an input symbol. 120Usually \fCchar\fP or \fCunsigned char\fP. 121.TP 122\fCYYCURSOR\fP 123\*(lx of type \fC*YYCHAR\fP that points to the current input symbol. 124The generated code advances \fCYYCURSOR\fP as symbols are matched. 125On entry, \fCYYCURSOR\fP is assumed to point to the first character of the 126current token. On exit, \fCYYCURSOR\fP will point to the first character of 127the following token. 128.TP 129\fCYLIMIT\fP 130Expression of type \fC*YYCHAR\fP that marks the end of the buffer 131(\fCYLIMIT[-1]\fP is the last character in the buffer). 132The generated code repeatedly compares \fCYYCURSOR\fP to \fCYLIMIT\fP 133to determine when the buffer needs (re)filling. 134.TP 135\fCYYMARKER\fP 136\*(lx of type \fC*YYCHAR\fP. 137The generated code saves backtracking information in \fCYYMARKER\fP. 138.TP 139\fCYYFILL(\fP\fIn\fP\fC)\fP 140The generated code "calls" \fCYYFILL\fP when the buffer needs 141(re)filling: at least \fIn\fP additional characters should 142be provided. \fCYYFILL\fP should adjust \fCYYCURSOR\fP, \fCYYLIMIT\fP and 143\fCYYMARKER\fP as needed. Note that for typical programming languages 144\fIn\fP will be the length of the longest keyword plus one. 145 146.SH "SCANNER SPECIFICATIONS" 147Each scanner specification consists of a set of \fIrules\fP and name 148definitions. 149Rules consist of a regular expression along with a block of C/C++ code that 150is to be executed when the associated regular expression is matched. 151Name definitions are of the form 152``\fIname\fP \fC=\fP \fIregular expression\fP\fC;\fP''. 153 154.SH "SUMMARY OF RE2C REGULAR EXPRESSIONS" 155.TP 156\fC"foo"\fP 157the literal string \fCfoo\fP. 158ANSI-C escape sequences can be used. 159.TP 160\fC[xyz]\fP 161a "character class"; in this case, 162the \*(rx matches either an '\fCx\fP', a '\fCy\fP', or a '\fCz\fP'. 163.TP 164\fC[abj-oZ]\fP 165a "character class" with a range in it; 166matches an '\fCa\fP', a '\fCb\fP', any letter from '\fCj\fP' through '\fCo\fP', 167or a '\fCZ\fP'. 168.TP 169\fIr\fP\fC\e\fP\fIs\fP 170match any \fIr\fP which isn't an \fIs\fP. \fIr\fP and \fIs\fP must be regular expressions 171which can be expressed as character classes. 172.TP 173\fIr\fP\fC*\fP 174zero or more \fIr\fP's, where \fIr\fP is any regular expression 175.TP 176\fC\fIr\fP\fC+\fP 177one or more \fIr\fP's 178.TP 179\fC\fIr\fP\fC?\fP 180zero or one \fIr\fP's (that is, "an optional \fIr\fP") 181.TP 182name 183the expansion of the "name" definition (see above) 184.TP 185\fC(\fP\fIr\fP\fC)\fP 186an \fIr\fP; parentheses are used to override precedence 187(see below) 188.TP 189\fIrs\fP 190an \fIr\fP followed by an \fIs\fP ("concatenation") 191.TP 192\fIr\fP\fC|\fP\fIs\fP 193either an \fIr\fP or an \fIs\fP 194.TP 195\fIr\fP\fC/\fP\fIs\fP 196an \fIr\fP but only if it is followed by an \fIs\fP. The s is not part of 197the matched text. This type of \*(rx is called "trailing context". 198.LP 199The regular expressions listed above are grouped according to 200precedence, from highest precedence at the top to lowest at the bottom. 201Those grouped together have equal precedence. 202 203.SH "A LARGER EXAMPLE" 204.LP 205.in +3 206.nf 207#include <stdlib.h> 208#include <stdio.h> 209#include <fcntl.h> 210#include <string.h> 211 212#define ADDEQ 257 213#define ANDAND 258 214#define ANDEQ 259 215#define ARRAY 260 216#define ASM 261 217#define AUTO 262 218#define BREAK 263 219#define CASE 264 220#define CHAR 265 221#define CONST 266 222#define CONTINUE 267 223#define DECR 268 224#define DEFAULT 269 225#define DEREF 270 226#define DIVEQ 271 227#define DO 272 228#define DOUBLE 273 229#define ELLIPSIS 274 230#define ELSE 275 231#define ENUM 276 232#define EQL 277 233#define EXTERN 278 234#define FCON 279 235#define FLOAT 280 236#define FOR 281 237#define FUNCTION 282 238#define GEQ 283 239#define GOTO 284 240#define ICON 285 241#define ID 286 242#define IF 287 243#define INCR 288 244#define INT 289 245#define LEQ 290 246#define LONG 291 247#define LSHIFT 292 248#define LSHIFTEQ 293 249#define MODEQ 294 250#define MULEQ 295 251#define NEQ 296 252#define OREQ 297 253#define OROR 298 254#define POINTER 299 255#define REGISTER 300 256#define RETURN 301 257#define RSHIFT 302 258#define RSHIFTEQ 303 259#define SCON 304 260#define SHORT 305 261#define SIGNED 306 262#define SIZEOF 307 263#define STATIC 308 264#define STRUCT 309 265#define SUBEQ 310 266#define SWITCH 311 267#define TYPEDEF 312 268#define UNION 313 269#define UNSIGNED 314 270#define VOID 315 271#define VOLATILE 316 272#define WHILE 317 273#define XOREQ 318 274#define EOI 319 275 276typedef unsigned int uint; 277typedef unsigned char uchar; 278 279#define BSIZE 8192 280 281#define YYCTYPE uchar 282#define YYCURSOR cursor 283#define YYLIMIT s->lim 284#define YYMARKER s->ptr 285#define YYFILL(n) {cursor = fill(s, cursor);} 286 287#define RET(i) {s->cur = cursor; return i;} 288 289typedef struct Scanner { 290 int fd; 291 uchar *bot, *tok, *ptr, *cur, *pos, *lim, *top, *eof; 292 uint line; 293} Scanner; 294 295uchar *fill(Scanner *s, uchar *cursor){ 296 if(!s->eof){ 297 uint cnt = s->tok - s->bot; 298 if(cnt){ 299 memcpy(s->bot, s->tok, s->lim - s->tok); 300 s->tok = s->bot; 301 s->ptr -= cnt; 302 cursor -= cnt; 303 s->pos -= cnt; 304 s->lim -= cnt; 305 } 306 if((s->top - s->lim) < BSIZE){ 307 uchar *buf = (uchar*) 308 malloc(((s->lim - s->bot) + BSIZE)*sizeof(uchar)); 309 memcpy(buf, s->tok, s->lim - s->tok); 310 s->tok = buf; 311 s->ptr = &buf[s->ptr - s->bot]; 312 cursor = &buf[cursor - s->bot]; 313 s->pos = &buf[s->pos - s->bot]; 314 s->lim = &buf[s->lim - s->bot]; 315 s->top = &s->lim[BSIZE]; 316 free(s->bot); 317 s->bot = buf; 318 } 319 if((cnt = read(s->fd, (char*) s->lim, BSIZE)) != BSIZE){ 320 s->eof = &s->lim[cnt]; *(s->eof)++ = '\\n'; 321 } 322 s->lim += cnt; 323 } 324 return cursor; 325} 326 327int scan(Scanner *s){ 328 uchar *cursor = s->cur; 329std: 330 s->tok = cursor; 331/*!re2c 332any = [\\000-\\377]; 333O = [0-7]; 334D = [0-9]; 335L = [a-zA-Z_]; 336H = [a-fA-F0-9]; 337E = [Ee] [+-]? D+; 338FS = [fFlL]; 339IS = [uUlL]*; 340ESC = [\\\\] ([abfnrtv?'"\\\\] | "x" H+ | O+); 341*/ 342 343/*!re2c 344 "/*" { goto comment; } 345 346 "auto" { RET(AUTO); } 347 "break" { RET(BREAK); } 348 "case" { RET(CASE); } 349 "char" { RET(CHAR); } 350 "const" { RET(CONST); } 351 "continue" { RET(CONTINUE); } 352 "default" { RET(DEFAULT); } 353 "do" { RET(DO); } 354 "double" { RET(DOUBLE); } 355 "else" { RET(ELSE); } 356 "enum" { RET(ENUM); } 357 "extern" { RET(EXTERN); } 358 "float" { RET(FLOAT); } 359 "for" { RET(FOR); } 360 "goto" { RET(GOTO); } 361 "if" { RET(IF); } 362 "int" { RET(INT); } 363 "long" { RET(LONG); } 364 "register" { RET(REGISTER); } 365 "return" { RET(RETURN); } 366 "short" { RET(SHORT); } 367 "signed" { RET(SIGNED); } 368 "sizeof" { RET(SIZEOF); } 369 "static" { RET(STATIC); } 370 "struct" { RET(STRUCT); } 371 "switch" { RET(SWITCH); } 372 "typedef" { RET(TYPEDEF); } 373 "union" { RET(UNION); } 374 "unsigned" { RET(UNSIGNED); } 375 "void" { RET(VOID); } 376 "volatile" { RET(VOLATILE); } 377 "while" { RET(WHILE); } 378 379 L (L|D)* { RET(ID); } 380 381 ("0" [xX] H+ IS?) | ("0" D+ IS?) | (D+ IS?) | 382 (['] (ESC|any\\[\\n\\\\'])* [']) 383 { RET(ICON); } 384 385 (D+ E FS?) | (D* "." D+ E? FS?) | (D+ "." D* E? FS?) 386 { RET(FCON); } 387 388 (["] (ESC|any\\[\\n\\\\"])* ["]) 389 { RET(SCON); } 390 391 "..." { RET(ELLIPSIS); } 392 ">>=" { RET(RSHIFTEQ); } 393 "<<=" { RET(LSHIFTEQ); } 394 "+=" { RET(ADDEQ); } 395 "-=" { RET(SUBEQ); } 396 "*=" { RET(MULEQ); } 397 "/=" { RET(DIVEQ); } 398 "%=" { RET(MODEQ); } 399 "&=" { RET(ANDEQ); } 400 "^=" { RET(XOREQ); } 401 "|=" { RET(OREQ); } 402 ">>" { RET(RSHIFT); } 403 "<<" { RET(LSHIFT); } 404 "++" { RET(INCR); } 405 "--" { RET(DECR); } 406 "->" { RET(DEREF); } 407 "&&" { RET(ANDAND); } 408 "||" { RET(OROR); } 409 "<=" { RET(LEQ); } 410 ">=" { RET(GEQ); } 411 "==" { RET(EQL); } 412 "!=" { RET(NEQ); } 413 ";" { RET(';'); } 414 "{" { RET('{'); } 415 "}" { RET('}'); } 416 "," { RET(','); } 417 ":" { RET(':'); } 418 "=" { RET('='); } 419 "(" { RET('('); } 420 ")" { RET(')'); } 421 "[" { RET('['); } 422 "]" { RET(']'); } 423 "." { RET('.'); } 424 "&" { RET('&'); } 425 "!" { RET('!'); } 426 "~" { RET('~'); } 427 "-" { RET('-'); } 428 "+" { RET('+'); } 429 "*" { RET('*'); } 430 "/" { RET('/'); } 431 "%" { RET('%'); } 432 "<" { RET('<'); } 433 ">" { RET('>'); } 434 "^" { RET('^'); } 435 "|" { RET('|'); } 436 "?" { RET('?'); } 437 438 439 [ \\t\\v\\f]+ { goto std; } 440 441 "\\n" 442 { 443 if(cursor == s->eof) RET(EOI); 444 s->pos = cursor; s->line++; 445 goto std; 446 } 447 448 any 449 { 450 printf("unexpected character: %c\\n", *s->tok); 451 goto std; 452 } 453*/ 454 455comment: 456/*!re2c 457 "*/" { goto std; } 458 "\\n" 459 { 460 if(cursor == s->eof) RET(EOI); 461 s->tok = s->pos = cursor; s->line++; 462 goto comment; 463 } 464 any { goto comment; } 465*/ 466} 467 468main(){ 469 Scanner in; 470 int t; 471 memset((char*) &in, 0, sizeof(in)); 472 in.fd = 0; 473 while((t = scan(&in)) != EOI){ 474/* 475 printf("%d\\t%.*s\\n", t, in.cur - in.tok, in.tok); 476 printf("%d\\n", t); 477*/ 478 } 479 close(in.fd); 480} 481.fi 482.in -3 483 484.SH "SEE ALSO" 485.LP 486flex(1), lex(1). 487 488.SH FEATURES 489.LP 490\*(re does not provide a default action: 491the generated code assumes that the input 492will consist of a sequence of tokens. 493Typically this can be dealt with by adding a rule such as the one for 494unexpected characters in the example above. 495.LP 496The user must arrange for a sentinel token to appear at the end of input 497(and provide a rule for matching it): 498\*(re does not provide an \fC<<EOF>>\fP expression. 499If the source is from a null-byte terminated string, a 500rule matching a null character will suffice. If the source is from a 501file then the approach taken in the example can be used: pad the input with 502a newline (or some other character that can't appear within another token); 503upon recognizing such a character check to see if it is the sentinel 504and act accordingly. 505.LP 506\*(re does not provide start conditions: use a separate scanner 507specification for each start condition (as illustrated in the above example). 508.LP 509No [^x]. Use difference instead. 510.SH BUGS 511.LP 512Only fixed length trailing context can be handled. 513.LP 514The maximum value appearing as a parameter \fIn\fP to \fCYYFILL\fP is not 515provided to the generated code (this value is needed for constructing 516the interface code). 517Note that this value is usually relatively small: for 518typical programming languages \fIn\fP will be the length of the longest 519keyword plus one. 520.LP 521Difference only works for character sets. 522.LP 523The \*(re internal algorithms need documentation. 524 525.SH AUTHOR 526.LP 527Please send bug reports, fixes and feedback to: 528.LP 529.nf 530Peter Bumbulis 531Computer Systems Group 532University of Waterloo 533Waterloo, Ontario 534N2L 3G1 535Internet: peterr@csg.uwaterloo.ca 536.fi 537