1ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh#
2ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh# Secret Labs' Regular Expression Engine
3ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh#
4ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh# re-compatible interface for the sre matching engine
5ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh#
6ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh# Copyright (c) 1998-2001 by Secret Labs AB.  All rights reserved.
7ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh#
8ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh# This version of the SRE library can be redistributed under CNRI's
9ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh# Python 1.6 license.  For any other use, please contact Secret Labs
10ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh# AB (info@pythonware.com).
11ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh#
12ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh# Portions of this engine have been developed in cooperation with
13ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh# CNRI.  Hewlett-Packard provided funding for 1.6 integration and
14ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh# other compatibility work.
15ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh#
16ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh
17ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsiehr"""Support for regular expressions (RE).
18ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh
19ffab958fd8d42ed7227d83007350e61555a1fa36Andrew HsiehThis module provides regular expression matching operations similar to
20ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsiehthose found in Perl.  It supports both 8-bit and Unicode strings; both
21ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsiehthe pattern and the strings being processed can contain null bytes and
22ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsiehcharacters outside the US ASCII range.
23ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh
24ffab958fd8d42ed7227d83007350e61555a1fa36Andrew HsiehRegular expressions can contain both special and ordinary characters.
25ffab958fd8d42ed7227d83007350e61555a1fa36Andrew HsiehMost ordinary characters, like "A", "a", or "0", are the simplest
26ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsiehregular expressions; they simply match themselves.  You can
27ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsiehconcatenate ordinary characters, so last matches the string 'last'.
28ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh
29ffab958fd8d42ed7227d83007350e61555a1fa36Andrew HsiehThe special characters are:
30ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    "."      Matches any character except a newline.
31ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    "^"      Matches the start of the string.
32ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    "$"      Matches the end of the string or just before the newline at
33ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh             the end of the string.
34ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    "*"      Matches 0 or more (greedy) repetitions of the preceding RE.
35ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh             Greedy means that it will match as many repetitions as possible.
36ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    "+"      Matches 1 or more (greedy) repetitions of the preceding RE.
37ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    "?"      Matches 0 or 1 (greedy) of the preceding RE.
38ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    *?,+?,?? Non-greedy versions of the previous three special characters.
39ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    {m,n}    Matches from m to n repetitions of the preceding RE.
40ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    {m,n}?   Non-greedy version of the above.
41ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    "\\"     Either escapes special characters or signals a special sequence.
42ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    []       Indicates a set of characters.
43ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh             A "^" as the first character indicates a complementing set.
44ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    "|"      A|B, creates an RE that will match either A or B.
45ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    (...)    Matches the RE inside the parentheses.
46ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh             The contents can be retrieved or matched later in the string.
47ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    (?iLmsux) Set the I, L, M, S, U, or X flag for the RE (see below).
48ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    (?:...)  Non-grouping version of regular parentheses.
49ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    (?P<name>...) The substring matched by the group is accessible by name.
50ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    (?P=name)     Matches the text matched earlier by the group named name.
51ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    (?#...)  A comment; ignored.
52ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    (?=...)  Matches if ... matches next, but doesn't consume the string.
53ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    (?!...)  Matches if ... doesn't match next.
54ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    (?<=...) Matches if preceded by ... (must be fixed length).
55ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    (?<!...) Matches if not preceded by ... (must be fixed length).
56ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    (?(id/name)yes|no) Matches yes pattern if the group with id/name matched,
57ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh                       the (optional) no pattern otherwise.
58ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh
59ffab958fd8d42ed7227d83007350e61555a1fa36Andrew HsiehThe special sequences consist of "\\" and a character from the list
60ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsiehbelow.  If the ordinary character is not on the list, then the
61ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsiehresulting RE will match the second character.
62ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    \number  Matches the contents of the group of the same number.
63ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    \A       Matches only at the start of the string.
64ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    \Z       Matches only at the end of the string.
65ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    \b       Matches the empty string, but only at the start or end of a word.
66ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    \B       Matches the empty string, but not at the start or end of a word.
67ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    \d       Matches any decimal digit; equivalent to the set [0-9].
68ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    \D       Matches any non-digit character; equivalent to the set [^0-9].
69ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    \s       Matches any whitespace character; equivalent to [ \t\n\r\f\v].
70ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    \S       Matches any non-whitespace character; equiv. to [^ \t\n\r\f\v].
71ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    \w       Matches any alphanumeric character; equivalent to [a-zA-Z0-9_].
72ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh             With LOCALE, it will match the set [0-9_] plus characters defined
73ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh             as letters for the current locale.
74ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    \W       Matches the complement of \w.
75ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    \\       Matches a literal backslash.
76ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh
77ffab958fd8d42ed7227d83007350e61555a1fa36Andrew HsiehThis module exports the following functions:
78ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    match    Match a regular expression pattern to the beginning of a string.
79ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    search   Search a string for the presence of a pattern.
80ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    sub      Substitute occurrences of a pattern found in a string.
81ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    subn     Same as sub, but also return the number of substitutions made.
82ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    split    Split a string by the occurrences of a pattern.
83ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    findall  Find all occurrences of a pattern in a string.
84ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    finditer Return an iterator yielding a match object for each match.
85ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    compile  Compile a pattern into a RegexObject.
86ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    purge    Clear the regular expression cache.
87ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    escape   Backslash all non-alphanumerics in a string.
88ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh
89ffab958fd8d42ed7227d83007350e61555a1fa36Andrew HsiehSome of the functions in this module takes flags as optional parameters:
90ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    I  IGNORECASE  Perform case-insensitive matching.
91ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    L  LOCALE      Make \w, \W, \b, \B, dependent on the current locale.
92ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    M  MULTILINE   "^" matches the beginning of lines (after a newline)
93ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh                   as well as the string.
94ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh                   "$" matches the end of lines (before a newline) as well
95ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh                   as the end of the string.
96ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    S  DOTALL      "." matches any character at all, including the newline.
97ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    X  VERBOSE     Ignore whitespace and comments for nicer looking RE's.
98ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    U  UNICODE     Make \w, \W, \b, \B, dependent on the Unicode locale.
99ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh
100ffab958fd8d42ed7227d83007350e61555a1fa36Andrew HsiehThis module also defines an exception 'error'.
101ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh
102ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh"""
103ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh
104ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsiehimport sys
105ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsiehimport sre_compile
106ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsiehimport sre_parse
107ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh
108ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh# public symbols
109ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh__all__ = [ "match", "search", "sub", "subn", "split", "findall",
110ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    "compile", "purge", "template", "escape", "I", "L", "M", "S", "X",
111ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    "U", "IGNORECASE", "LOCALE", "MULTILINE", "DOTALL", "VERBOSE",
112ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    "UNICODE", "error" ]
113ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh
114ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh__version__ = "2.2.1"
115ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh
116ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh# flags
117ffab958fd8d42ed7227d83007350e61555a1fa36Andrew HsiehI = IGNORECASE = sre_compile.SRE_FLAG_IGNORECASE # ignore case
118ffab958fd8d42ed7227d83007350e61555a1fa36Andrew HsiehL = LOCALE = sre_compile.SRE_FLAG_LOCALE # assume current 8-bit locale
119ffab958fd8d42ed7227d83007350e61555a1fa36Andrew HsiehU = UNICODE = sre_compile.SRE_FLAG_UNICODE # assume unicode locale
120ffab958fd8d42ed7227d83007350e61555a1fa36Andrew HsiehM = MULTILINE = sre_compile.SRE_FLAG_MULTILINE # make anchors look for newline
121ffab958fd8d42ed7227d83007350e61555a1fa36Andrew HsiehS = DOTALL = sre_compile.SRE_FLAG_DOTALL # make dot match newline
122ffab958fd8d42ed7227d83007350e61555a1fa36Andrew HsiehX = VERBOSE = sre_compile.SRE_FLAG_VERBOSE # ignore whitespace and comments
123ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh
124ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh# sre extensions (experimental, don't rely on these)
125ffab958fd8d42ed7227d83007350e61555a1fa36Andrew HsiehT = TEMPLATE = sre_compile.SRE_FLAG_TEMPLATE # disable backtracking
126ffab958fd8d42ed7227d83007350e61555a1fa36Andrew HsiehDEBUG = sre_compile.SRE_FLAG_DEBUG # dump pattern after compilation
127ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh
128ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh# sre exception
129ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieherror = sre_compile.error
130ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh
131ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh# --------------------------------------------------------------------
132ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh# public interface
133ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh
134ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsiehdef match(pattern, string, flags=0):
135ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    """Try to apply the pattern at the start of the string, returning
136ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    a match object, or None if no match was found."""
137ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    return _compile(pattern, flags).match(string)
138ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh
139ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsiehdef search(pattern, string, flags=0):
140ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    """Scan through string looking for a match to the pattern, returning
141ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    a match object, or None if no match was found."""
142ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    return _compile(pattern, flags).search(string)
143ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh
144ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsiehdef sub(pattern, repl, string, count=0, flags=0):
145ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    """Return the string obtained by replacing the leftmost
146ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    non-overlapping occurrences of the pattern in string by the
147ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    replacement repl.  repl can be either a string or a callable;
148ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    if a string, backslash escapes in it are processed.  If it is
149ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    a callable, it's passed the match object and must return
150ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    a replacement string to be used."""
151ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    return _compile(pattern, flags).sub(repl, string, count)
152ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh
153ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsiehdef subn(pattern, repl, string, count=0, flags=0):
154ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    """Return a 2-tuple containing (new_string, number).
155ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    new_string is the string obtained by replacing the leftmost
156ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    non-overlapping occurrences of the pattern in the source
157ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    string by the replacement repl.  number is the number of
158ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    substitutions that were made. repl can be either a string or a
159ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    callable; if a string, backslash escapes in it are processed.
160ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    If it is a callable, it's passed the match object and must
161ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    return a replacement string to be used."""
162ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    return _compile(pattern, flags).subn(repl, string, count)
163ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh
164ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsiehdef split(pattern, string, maxsplit=0, flags=0):
165ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    """Split the source string by the occurrences of the pattern,
166ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    returning a list containing the resulting substrings."""
167ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    return _compile(pattern, flags).split(string, maxsplit)
168ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh
169ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsiehdef findall(pattern, string, flags=0):
170ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    """Return a list of all non-overlapping matches in the string.
171ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh
172ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    If one or more groups are present in the pattern, return a
173ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    list of groups; this will be a list of tuples if the pattern
174ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    has more than one group.
175ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh
176ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    Empty matches are included in the result."""
177ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    return _compile(pattern, flags).findall(string)
178ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh
179ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsiehif sys.hexversion >= 0x02020000:
180ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    __all__.append("finditer")
181ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    def finditer(pattern, string, flags=0):
182ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh        """Return an iterator over all non-overlapping matches in the
183ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh        string.  For each match, the iterator returns a match object.
184ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh
185ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh        Empty matches are included in the result."""
186ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh        return _compile(pattern, flags).finditer(string)
187ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh
188ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsiehdef compile(pattern, flags=0):
189ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    "Compile a regular expression pattern, returning a pattern object."
190ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    return _compile(pattern, flags)
191ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh
192ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsiehdef purge():
193ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    "Clear the regular expression cache"
194ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    _cache.clear()
195ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    _cache_repl.clear()
196ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh
197ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsiehdef template(pattern, flags=0):
198ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    "Compile a template pattern, returning a pattern object"
199ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    return _compile(pattern, flags|T)
200ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh
201ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh_alphanum = frozenset(
202ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789")
203ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh
204ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsiehdef escape(pattern):
205ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    "Escape all non-alphanumeric characters in pattern."
206ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    s = list(pattern)
207ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    alphanum = _alphanum
208ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    for i, c in enumerate(pattern):
209ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh        if c not in alphanum:
210ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh            if c == "\000":
211ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh                s[i] = "\\000"
212ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh            else:
213ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh                s[i] = "\\" + c
214ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    return pattern[:0].join(s)
215ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh
216ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh# --------------------------------------------------------------------
217ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh# internals
218ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh
219ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh_cache = {}
220ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh_cache_repl = {}
221ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh
222ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh_pattern_type = type(sre_compile.compile("", 0))
223ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh
224ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh_MAXCACHE = 100
225ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh
226ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsiehdef _compile(*key):
227ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    # internal: compile pattern
228ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    cachekey = (type(key[0]),) + key
229ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    p = _cache.get(cachekey)
230ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    if p is not None:
231ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh        return p
232ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    pattern, flags = key
233ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    if isinstance(pattern, _pattern_type):
234ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh        if flags:
235ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh            raise ValueError('Cannot process flags argument with a compiled pattern')
236ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh        return pattern
237ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    if not sre_compile.isstring(pattern):
238ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh        raise TypeError, "first argument must be string or compiled pattern"
239ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    try:
240ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh        p = sre_compile.compile(pattern, flags)
241ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    except error, v:
242ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh        raise error, v # invalid expression
243ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    if len(_cache) >= _MAXCACHE:
244ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh        _cache.clear()
245ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    _cache[cachekey] = p
246ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    return p
247ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh
248ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsiehdef _compile_repl(*key):
249ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    # internal: compile replacement pattern
250ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    p = _cache_repl.get(key)
251ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    if p is not None:
252ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh        return p
253ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    repl, pattern = key
254ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    try:
255ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh        p = sre_parse.parse_template(repl, pattern)
256ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    except error, v:
257ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh        raise error, v # invalid expression
258ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    if len(_cache_repl) >= _MAXCACHE:
259ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh        _cache_repl.clear()
260ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    _cache_repl[key] = p
261ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    return p
262ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh
263ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsiehdef _expand(pattern, match, template):
264ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    # internal: match.expand implementation hook
265ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    template = sre_parse.parse_template(template, pattern)
266ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    return sre_parse.expand_template(template, match)
267ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh
268ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsiehdef _subx(pattern, template):
269ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    # internal: pattern.sub/subn implementation helper
270ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    template = _compile_repl(template, pattern)
271ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    if not template[0] and len(template[1]) == 1:
272ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh        # literal replacement
273ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh        return template[1][0]
274ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    def filter(match, template=template):
275ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh        return sre_parse.expand_template(template, match)
276ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    return filter
277ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh
278ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh# register myself for pickling
279ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh
280ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsiehimport copy_reg
281ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh
282ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsiehdef _pickle(p):
283ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    return _compile, (p.pattern, p.flags)
284ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh
285ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsiehcopy_reg.pickle(_pattern_type, _pickle, _compile)
286ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh
287ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh# --------------------------------------------------------------------
288ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh# experimental stuff (see python-dev discussions for details)
289ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh
290ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsiehclass Scanner:
291ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    def __init__(self, lexicon, flags=0):
292ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh        from sre_constants import BRANCH, SUBPATTERN
293ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh        self.lexicon = lexicon
294ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh        # combine phrases into a compound pattern
295ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh        p = []
296ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh        s = sre_parse.Pattern()
297ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh        s.flags = flags
298ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh        for phrase, action in lexicon:
299ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh            p.append(sre_parse.SubPattern(s, [
300ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh                (SUBPATTERN, (len(p)+1, sre_parse.parse(phrase, flags))),
301ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh                ]))
302ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh        s.groups = len(p)+1
303ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh        p = sre_parse.SubPattern(s, [(BRANCH, (None, p))])
304ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh        self.scanner = sre_compile.compile(p)
305ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh    def scan(self, string):
306ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh        result = []
307ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh        append = result.append
308ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh        match = self.scanner.scanner(string).match
309ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh        i = 0
310ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh        while 1:
311ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh            m = match()
312ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh            if not m:
313ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh                break
314ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh            j = m.end()
315ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh            if i == j:
316ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh                break
317ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh            action = self.lexicon[m.lastindex-1][1]
318ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh            if hasattr(action, '__call__'):
319ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh                self.match = m
320ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh                action = action(self, m.group())
321ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh            if action is not None:
322ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh                append(action)
323ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh            i = j
324ffab958fd8d42ed7227d83007350e61555a1fa36Andrew Hsieh        return result, string[i:]
325