1ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh# 2ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh# Secret Labs' Regular Expression Engine 3ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh# 4ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh# re-compatible interface for the sre matching engine 5ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh# 6ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh# Copyright (c) 1998-2001 by Secret Labs AB. All rights reserved. 7ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh# 8ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh# This version of the SRE library can be redistributed under CNRI's 9ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh# Python 1.6 license. For any other use, please contact Secret Labs 10ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh# AB (info@pythonware.com). 11ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh# 12ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh# Portions of this engine have been developed in cooperation with 13ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh# CNRI. Hewlett-Packard provided funding for 1.6 integration and 14ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh# other compatibility work. 15ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh# 16ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh 17ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsiehr"""Support for regular expressions (RE). 18ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh 19ffab958fd8d42ed7227d83007350e61555a1fa36Andrew HsiehThis module provides regular expression matching operations similar to 20ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsiehthose found in Perl. It supports both 8-bit and Unicode strings; both 21ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsiehthe pattern and the strings being processed can contain null bytes and 22ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsiehcharacters outside the US ASCII range. 23ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh 24ffab958fd8d42ed7227d83007350e61555a1fa36Andrew HsiehRegular expressions can contain both special and ordinary characters. 25ffab958fd8d42ed7227d83007350e61555a1fa36Andrew HsiehMost ordinary characters, like "A", "a", or "0", are the simplest 26ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsiehregular expressions; they simply match themselves. You can 27ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsiehconcatenate ordinary characters, so last matches the string 'last'. 28ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh 29ffab958fd8d42ed7227d83007350e61555a1fa36Andrew HsiehThe special characters are: 30ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh "." Matches any character except a newline. 31ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh "^" Matches the start of the string. 32ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh "$" Matches the end of the string or just before the newline at 33ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh the end of the string. 34ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh "*" Matches 0 or more (greedy) repetitions of the preceding RE. 35ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh Greedy means that it will match as many repetitions as possible. 36ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh "+" Matches 1 or more (greedy) repetitions of the preceding RE. 37ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh "?" Matches 0 or 1 (greedy) of the preceding RE. 38ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh *?,+?,?? Non-greedy versions of the previous three special characters. 39ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh {m,n} Matches from m to n repetitions of the preceding RE. 40ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh {m,n}? Non-greedy version of the above. 41ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh "\\" Either escapes special characters or signals a special sequence. 42ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh [] Indicates a set of characters. 43ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh A "^" as the first character indicates a complementing set. 44ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh "|" A|B, creates an RE that will match either A or B. 45ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh (...) Matches the RE inside the parentheses. 46ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh The contents can be retrieved or matched later in the string. 47ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh (?iLmsux) Set the I, L, M, S, U, or X flag for the RE (see below). 48ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh (?:...) Non-grouping version of regular parentheses. 49ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh (?P<name>...) The substring matched by the group is accessible by name. 50ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh (?P=name) Matches the text matched earlier by the group named name. 51ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh (?#...) A comment; ignored. 52ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh (?=...) Matches if ... matches next, but doesn't consume the string. 53ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh (?!...) Matches if ... doesn't match next. 54ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh (?<=...) Matches if preceded by ... (must be fixed length). 55ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh (?<!...) Matches if not preceded by ... (must be fixed length). 56ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh (?(id/name)yes|no) Matches yes pattern if the group with id/name matched, 57ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh the (optional) no pattern otherwise. 58ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh 59ffab958fd8d42ed7227d83007350e61555a1fa36Andrew HsiehThe special sequences consist of "\\" and a character from the list 60ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsiehbelow. If the ordinary character is not on the list, then the 61ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsiehresulting RE will match the second character. 62ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh \number Matches the contents of the group of the same number. 63ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh \A Matches only at the start of the string. 64ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh \Z Matches only at the end of the string. 65ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh \b Matches the empty string, but only at the start or end of a word. 66ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh \B Matches the empty string, but not at the start or end of a word. 67ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh \d Matches any decimal digit; equivalent to the set [0-9]. 68ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh \D Matches any non-digit character; equivalent to the set [^0-9]. 69ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh \s Matches any whitespace character; equivalent to [ \t\n\r\f\v]. 70ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh \S Matches any non-whitespace character; equiv. to [^ \t\n\r\f\v]. 71ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh \w Matches any alphanumeric character; equivalent to [a-zA-Z0-9_]. 72ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh With LOCALE, it will match the set [0-9_] plus characters defined 73ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh as letters for the current locale. 74ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh \W Matches the complement of \w. 75ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh \\ Matches a literal backslash. 76ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh 77ffab958fd8d42ed7227d83007350e61555a1fa36Andrew HsiehThis module exports the following functions: 78ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh match Match a regular expression pattern to the beginning of a string. 79ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh search Search a string for the presence of a pattern. 80ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh sub Substitute occurrences of a pattern found in a string. 81ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh subn Same as sub, but also return the number of substitutions made. 82ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh split Split a string by the occurrences of a pattern. 83ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh findall Find all occurrences of a pattern in a string. 84ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh finditer Return an iterator yielding a match object for each match. 85ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh compile Compile a pattern into a RegexObject. 86ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh purge Clear the regular expression cache. 87ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh escape Backslash all non-alphanumerics in a string. 88ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh 89ffab958fd8d42ed7227d83007350e61555a1fa36Andrew HsiehSome of the functions in this module takes flags as optional parameters: 90ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh I IGNORECASE Perform case-insensitive matching. 91ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh L LOCALE Make \w, \W, \b, \B, dependent on the current locale. 92ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh M MULTILINE "^" matches the beginning of lines (after a newline) 93ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh as well as the string. 94ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh "$" matches the end of lines (before a newline) as well 95ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh as the end of the string. 96ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh S DOTALL "." matches any character at all, including the newline. 97ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh X VERBOSE Ignore whitespace and comments for nicer looking RE's. 98ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh U UNICODE Make \w, \W, \b, \B, dependent on the Unicode locale. 99ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh 100ffab958fd8d42ed7227d83007350e61555a1fa36Andrew HsiehThis module also defines an exception 'error'. 101ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh 102ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh""" 103ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh 104ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsiehimport sys 105ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsiehimport sre_compile 106ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsiehimport sre_parse 107ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh 108ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh# public symbols 109ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh__all__ = [ "match", "search", "sub", "subn", "split", "findall", 110ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh "compile", "purge", "template", "escape", "I", "L", "M", "S", "X", 111ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh "U", "IGNORECASE", "LOCALE", "MULTILINE", "DOTALL", "VERBOSE", 112ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh "UNICODE", "error" ] 113ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh 114ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh__version__ = "2.2.1" 115ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh 116ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh# flags 117ffab958fd8d42ed7227d83007350e61555a1fa36Andrew HsiehI = IGNORECASE = sre_compile.SRE_FLAG_IGNORECASE # ignore case 118ffab958fd8d42ed7227d83007350e61555a1fa36Andrew HsiehL = LOCALE = sre_compile.SRE_FLAG_LOCALE # assume current 8-bit locale 119ffab958fd8d42ed7227d83007350e61555a1fa36Andrew HsiehU = UNICODE = sre_compile.SRE_FLAG_UNICODE # assume unicode locale 120ffab958fd8d42ed7227d83007350e61555a1fa36Andrew HsiehM = MULTILINE = sre_compile.SRE_FLAG_MULTILINE # make anchors look for newline 121ffab958fd8d42ed7227d83007350e61555a1fa36Andrew HsiehS = DOTALL = sre_compile.SRE_FLAG_DOTALL # make dot match newline 122ffab958fd8d42ed7227d83007350e61555a1fa36Andrew HsiehX = VERBOSE = sre_compile.SRE_FLAG_VERBOSE # ignore whitespace and comments 123ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh 124ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh# sre extensions (experimental, don't rely on these) 125ffab958fd8d42ed7227d83007350e61555a1fa36Andrew HsiehT = TEMPLATE = sre_compile.SRE_FLAG_TEMPLATE # disable backtracking 126ffab958fd8d42ed7227d83007350e61555a1fa36Andrew HsiehDEBUG = sre_compile.SRE_FLAG_DEBUG # dump pattern after compilation 127ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh 128ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh# sre exception 129ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieherror = sre_compile.error 130ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh 131ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh# -------------------------------------------------------------------- 132ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh# public interface 133ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh 134ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsiehdef match(pattern, string, flags=0): 135ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh """Try to apply the pattern at the start of the string, returning 136ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh a match object, or None if no match was found.""" 137ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh return _compile(pattern, flags).match(string) 138ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh 139ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsiehdef search(pattern, string, flags=0): 140ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh """Scan through string looking for a match to the pattern, returning 141ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh a match object, or None if no match was found.""" 142ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh return _compile(pattern, flags).search(string) 143ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh 144ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsiehdef sub(pattern, repl, string, count=0, flags=0): 145ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh """Return the string obtained by replacing the leftmost 146ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh non-overlapping occurrences of the pattern in string by the 147ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh replacement repl. repl can be either a string or a callable; 148ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh if a string, backslash escapes in it are processed. If it is 149ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh a callable, it's passed the match object and must return 150ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh a replacement string to be used.""" 151ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh return _compile(pattern, flags).sub(repl, string, count) 152ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh 153ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsiehdef subn(pattern, repl, string, count=0, flags=0): 154ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh """Return a 2-tuple containing (new_string, number). 155ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh new_string is the string obtained by replacing the leftmost 156ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh non-overlapping occurrences of the pattern in the source 157ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh string by the replacement repl. number is the number of 158ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh substitutions that were made. repl can be either a string or a 159ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh callable; if a string, backslash escapes in it are processed. 160ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh If it is a callable, it's passed the match object and must 161ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh return a replacement string to be used.""" 162ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh return _compile(pattern, flags).subn(repl, string, count) 163ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh 164ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsiehdef split(pattern, string, maxsplit=0, flags=0): 165ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh """Split the source string by the occurrences of the pattern, 166ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh returning a list containing the resulting substrings.""" 167ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh return _compile(pattern, flags).split(string, maxsplit) 168ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh 169ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsiehdef findall(pattern, string, flags=0): 170ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh """Return a list of all non-overlapping matches in the string. 171ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh 172ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh If one or more groups are present in the pattern, return a 173ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh list of groups; this will be a list of tuples if the pattern 174ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh has more than one group. 175ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh 176ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh Empty matches are included in the result.""" 177ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh return _compile(pattern, flags).findall(string) 178ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh 179ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsiehif sys.hexversion >= 0x02020000: 180ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh __all__.append("finditer") 181ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh def finditer(pattern, string, flags=0): 182ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh """Return an iterator over all non-overlapping matches in the 183ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh string. For each match, the iterator returns a match object. 184ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh 185ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh Empty matches are included in the result.""" 186ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh return _compile(pattern, flags).finditer(string) 187ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh 188ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsiehdef compile(pattern, flags=0): 189ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh "Compile a regular expression pattern, returning a pattern object." 190ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh return _compile(pattern, flags) 191ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh 192ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsiehdef purge(): 193ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh "Clear the regular expression cache" 194ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh _cache.clear() 195ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh _cache_repl.clear() 196ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh 197ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsiehdef template(pattern, flags=0): 198ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh "Compile a template pattern, returning a pattern object" 199ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh return _compile(pattern, flags|T) 200ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh 201ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh_alphanum = frozenset( 202ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789") 203ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh 204ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsiehdef escape(pattern): 205ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh "Escape all non-alphanumeric characters in pattern." 206ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh s = list(pattern) 207ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh alphanum = _alphanum 208ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh for i, c in enumerate(pattern): 209ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh if c not in alphanum: 210ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh if c == "\000": 211ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh s[i] = "\\000" 212ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh else: 213ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh s[i] = "\\" + c 214ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh return pattern[:0].join(s) 215ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh 216ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh# -------------------------------------------------------------------- 217ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh# internals 218ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh 219ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh_cache = {} 220ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh_cache_repl = {} 221ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh 222ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh_pattern_type = type(sre_compile.compile("", 0)) 223ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh 224ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh_MAXCACHE = 100 225ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh 226ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsiehdef _compile(*key): 227ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh # internal: compile pattern 228ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh cachekey = (type(key[0]),) + key 229ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh p = _cache.get(cachekey) 230ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh if p is not None: 231ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh return p 232ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh pattern, flags = key 233ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh if isinstance(pattern, _pattern_type): 234ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh if flags: 235ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh raise ValueError('Cannot process flags argument with a compiled pattern') 236ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh return pattern 237ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh if not sre_compile.isstring(pattern): 238ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh raise TypeError, "first argument must be string or compiled pattern" 239ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh try: 240ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh p = sre_compile.compile(pattern, flags) 241ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh except error, v: 242ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh raise error, v # invalid expression 243ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh if len(_cache) >= _MAXCACHE: 244ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh _cache.clear() 245ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh _cache[cachekey] = p 246ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh return p 247ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh 248ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsiehdef _compile_repl(*key): 249ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh # internal: compile replacement pattern 250ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh p = _cache_repl.get(key) 251ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh if p is not None: 252ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh return p 253ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh repl, pattern = key 254ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh try: 255ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh p = sre_parse.parse_template(repl, pattern) 256ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh except error, v: 257ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh raise error, v # invalid expression 258ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh if len(_cache_repl) >= _MAXCACHE: 259ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh _cache_repl.clear() 260ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh _cache_repl[key] = p 261ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh return p 262ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh 263ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsiehdef _expand(pattern, match, template): 264ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh # internal: match.expand implementation hook 265ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh template = sre_parse.parse_template(template, pattern) 266ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh return sre_parse.expand_template(template, match) 267ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh 268ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsiehdef _subx(pattern, template): 269ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh # internal: pattern.sub/subn implementation helper 270ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh template = _compile_repl(template, pattern) 271ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh if not template[0] and len(template[1]) == 1: 272ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh # literal replacement 273ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh return template[1][0] 274ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh def filter(match, template=template): 275ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh return sre_parse.expand_template(template, match) 276ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh return filter 277ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh 278ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh# register myself for pickling 279ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh 280ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsiehimport copy_reg 281ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh 282ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsiehdef _pickle(p): 283ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh return _compile, (p.pattern, p.flags) 284ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh 285ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsiehcopy_reg.pickle(_pattern_type, _pickle, _compile) 286ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh 287ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh# -------------------------------------------------------------------- 288ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh# experimental stuff (see python-dev discussions for details) 289ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh 290ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsiehclass Scanner: 291ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh def __init__(self, lexicon, flags=0): 292ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh from sre_constants import BRANCH, SUBPATTERN 293ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh self.lexicon = lexicon 294ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh # combine phrases into a compound pattern 295ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh p = [] 296ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh s = sre_parse.Pattern() 297ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh s.flags = flags 298ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh for phrase, action in lexicon: 299ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh p.append(sre_parse.SubPattern(s, [ 300ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh (SUBPATTERN, (len(p)+1, sre_parse.parse(phrase, flags))), 301ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh ])) 302ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh s.groups = len(p)+1 303ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh p = sre_parse.SubPattern(s, [(BRANCH, (None, p))]) 304ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh self.scanner = sre_compile.compile(p) 305ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh def scan(self, string): 306ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh result = [] 307ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh append = result.append 308ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh match = self.scanner.scanner(string).match 309ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh i = 0 310ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh while 1: 311ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh m = match() 312ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh if not m: 313ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh break 314ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh j = m.end() 315ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh if i == j: 316ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh break 317ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh action = self.lexicon[m.lastindex-1][1] 318ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh if hasattr(action, '__call__'): 319ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh self.match = m 320ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh action = action(self, m.group()) 321ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh if action is not None: 322ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh append(action) 323ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh i = j 324ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh return result, string[i:] 325