1e64493206b76bce6e3af0598790ec076094e8c37Simran Basi"""Beautiful Soup
2e64493206b76bce6e3af0598790ec076094e8c37Simran BasiElixir and Tonic
3e64493206b76bce6e3af0598790ec076094e8c37Simran Basi"The Screen-Scraper's Friend"
4e64493206b76bce6e3af0598790ec076094e8c37Simran Basihttp://www.crummy.com/software/BeautifulSoup/
5e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
6e64493206b76bce6e3af0598790ec076094e8c37Simran BasiBeautiful Soup parses a (possibly invalid) XML or HTML document into a
7e64493206b76bce6e3af0598790ec076094e8c37Simran Basitree representation. It provides methods and Pythonic idioms that make
8e64493206b76bce6e3af0598790ec076094e8c37Simran Basiit easy to navigate, search, and modify the tree.
9e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
10e64493206b76bce6e3af0598790ec076094e8c37Simran BasiA well-formed XML/HTML document yields a well-formed data
11e64493206b76bce6e3af0598790ec076094e8c37Simran Basistructure. An ill-formed XML/HTML document yields a correspondingly
12e64493206b76bce6e3af0598790ec076094e8c37Simran Basiill-formed data structure. If your document is only locally
13e64493206b76bce6e3af0598790ec076094e8c37Simran Basiwell-formed, you can use this library to find and process the
14e64493206b76bce6e3af0598790ec076094e8c37Simran Basiwell-formed part of it.
15e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
16e64493206b76bce6e3af0598790ec076094e8c37Simran BasiBeautiful Soup works with Python 2.2 and up. It has no external
17e64493206b76bce6e3af0598790ec076094e8c37Simran Basidependencies, but you'll have more success at converting data to UTF-8
18e64493206b76bce6e3af0598790ec076094e8c37Simran Basiif you also install these three packages:
19e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
20e64493206b76bce6e3af0598790ec076094e8c37Simran Basi* chardet, for auto-detecting character encodings
21e64493206b76bce6e3af0598790ec076094e8c37Simran Basi  http://chardet.feedparser.org/
22e64493206b76bce6e3af0598790ec076094e8c37Simran Basi* cjkcodecs and iconv_codec, which add more encodings to the ones supported
23e64493206b76bce6e3af0598790ec076094e8c37Simran Basi  by stock Python.
24e64493206b76bce6e3af0598790ec076094e8c37Simran Basi  http://cjkpython.i18n.org/
25e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
26e64493206b76bce6e3af0598790ec076094e8c37Simran BasiBeautiful Soup defines classes for two main parsing strategies:
27e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
28e64493206b76bce6e3af0598790ec076094e8c37Simran Basi * BeautifulStoneSoup, for parsing XML, SGML, or your domain-specific
29e64493206b76bce6e3af0598790ec076094e8c37Simran Basi   language that kind of looks like XML.
30e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
31e64493206b76bce6e3af0598790ec076094e8c37Simran Basi * BeautifulSoup, for parsing run-of-the-mill HTML code, be it valid
32e64493206b76bce6e3af0598790ec076094e8c37Simran Basi   or invalid. This class has web browser-like heuristics for
33e64493206b76bce6e3af0598790ec076094e8c37Simran Basi   obtaining a sensible parse tree in the face of common HTML errors.
34e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
35e64493206b76bce6e3af0598790ec076094e8c37Simran BasiBeautiful Soup also defines a class (UnicodeDammit) for autodetecting
36e64493206b76bce6e3af0598790ec076094e8c37Simran Basithe encoding of an HTML or XML document, and converting it to
37e64493206b76bce6e3af0598790ec076094e8c37Simran BasiUnicode. Much of this code is taken from Mark Pilgrim's Universal Feed Parser.
38e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
39e64493206b76bce6e3af0598790ec076094e8c37Simran BasiFor more than you ever wanted to know about Beautiful Soup, see the
40e64493206b76bce6e3af0598790ec076094e8c37Simran Basidocumentation:
41e64493206b76bce6e3af0598790ec076094e8c37Simran Basihttp://www.crummy.com/software/BeautifulSoup/documentation.html
42e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
43e64493206b76bce6e3af0598790ec076094e8c37Simran BasiHere, have some legalese:
44e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
45e64493206b76bce6e3af0598790ec076094e8c37Simran BasiCopyright (c) 2004-2010, Leonard Richardson
46e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
47e64493206b76bce6e3af0598790ec076094e8c37Simran BasiAll rights reserved.
48e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
49e64493206b76bce6e3af0598790ec076094e8c37Simran BasiRedistribution and use in source and binary forms, with or without
50e64493206b76bce6e3af0598790ec076094e8c37Simran Basimodification, are permitted provided that the following conditions are
51e64493206b76bce6e3af0598790ec076094e8c37Simran Basimet:
52e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
53e64493206b76bce6e3af0598790ec076094e8c37Simran Basi  * Redistributions of source code must retain the above copyright
54e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    notice, this list of conditions and the following disclaimer.
55e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
56e64493206b76bce6e3af0598790ec076094e8c37Simran Basi  * Redistributions in binary form must reproduce the above
57e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    copyright notice, this list of conditions and the following
58e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    disclaimer in the documentation and/or other materials provided
59e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    with the distribution.
60e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
61e64493206b76bce6e3af0598790ec076094e8c37Simran Basi  * Neither the name of the the Beautiful Soup Consortium and All
62e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    Night Kosher Bakery nor the names of its contributors may be
63e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    used to endorse or promote products derived from this software
64e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    without specific prior written permission.
65e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
66e64493206b76bce6e3af0598790ec076094e8c37Simran BasiTHIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
67e64493206b76bce6e3af0598790ec076094e8c37Simran Basi"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
68e64493206b76bce6e3af0598790ec076094e8c37Simran BasiLIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
69e64493206b76bce6e3af0598790ec076094e8c37Simran BasiA PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
70e64493206b76bce6e3af0598790ec076094e8c37Simran BasiCONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
71e64493206b76bce6e3af0598790ec076094e8c37Simran BasiEXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
72e64493206b76bce6e3af0598790ec076094e8c37Simran BasiPROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
73e64493206b76bce6e3af0598790ec076094e8c37Simran BasiPROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
74e64493206b76bce6e3af0598790ec076094e8c37Simran BasiLIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
75e64493206b76bce6e3af0598790ec076094e8c37Simran BasiNEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
76e64493206b76bce6e3af0598790ec076094e8c37Simran BasiSOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE, DAMMIT.
77e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
78e64493206b76bce6e3af0598790ec076094e8c37Simran Basi"""
79e64493206b76bce6e3af0598790ec076094e8c37Simran Basifrom __future__ import generators
80e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
81e64493206b76bce6e3af0598790ec076094e8c37Simran Basi__author__ = "Leonard Richardson (leonardr@segfault.org)"
82e64493206b76bce6e3af0598790ec076094e8c37Simran Basi__version__ = "3.2.1"
83e64493206b76bce6e3af0598790ec076094e8c37Simran Basi__copyright__ = "Copyright (c) 2004-2012 Leonard Richardson"
84e64493206b76bce6e3af0598790ec076094e8c37Simran Basi__license__ = "New-style BSD"
85e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
86e64493206b76bce6e3af0598790ec076094e8c37Simran Basifrom sgmllib import SGMLParser, SGMLParseError
87e64493206b76bce6e3af0598790ec076094e8c37Simran Basiimport codecs
88e64493206b76bce6e3af0598790ec076094e8c37Simran Basiimport markupbase
89e64493206b76bce6e3af0598790ec076094e8c37Simran Basiimport types
90e64493206b76bce6e3af0598790ec076094e8c37Simran Basiimport re
91e64493206b76bce6e3af0598790ec076094e8c37Simran Basiimport sgmllib
92e64493206b76bce6e3af0598790ec076094e8c37Simran Basitry:
93e64493206b76bce6e3af0598790ec076094e8c37Simran Basi  from htmlentitydefs import name2codepoint
94e64493206b76bce6e3af0598790ec076094e8c37Simran Basiexcept ImportError:
95e64493206b76bce6e3af0598790ec076094e8c37Simran Basi  name2codepoint = {}
96e64493206b76bce6e3af0598790ec076094e8c37Simran Basitry:
97e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    set
98e64493206b76bce6e3af0598790ec076094e8c37Simran Basiexcept NameError:
99e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    from sets import Set as set
100e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
101e64493206b76bce6e3af0598790ec076094e8c37Simran Basi#These hacks make Beautiful Soup able to parse XML with namespaces
102e64493206b76bce6e3af0598790ec076094e8c37Simran Basisgmllib.tagfind = re.compile('[a-zA-Z][-_.:a-zA-Z0-9]*')
103e64493206b76bce6e3af0598790ec076094e8c37Simran Basimarkupbase._declname_match = re.compile(r'[a-zA-Z][-_.:a-zA-Z0-9]*\s*').match
104e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
105e64493206b76bce6e3af0598790ec076094e8c37Simran BasiDEFAULT_OUTPUT_ENCODING = "utf-8"
106e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
107e64493206b76bce6e3af0598790ec076094e8c37Simran Basidef _match_css_class(str):
108e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    """Build a RE to match the given CSS class."""
109e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    return re.compile(r"(^|.*\s)%s($|\s)" % str)
110e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
111e64493206b76bce6e3af0598790ec076094e8c37Simran Basi# First, the classes that represent markup elements.
112e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
113e64493206b76bce6e3af0598790ec076094e8c37Simran Basiclass PageElement(object):
114e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    """Contains the navigational information for some part of the page
115e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    (either a tag or a piece of text)"""
116e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
117e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def _invert(h):
118e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        "Cheap function to invert a hash."
119e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        i = {}
120e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        for k,v in h.items():
121e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            i[v] = k
122e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return i
123e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
124e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    XML_ENTITIES_TO_SPECIAL_CHARS = { "apos" : "'",
125e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                                      "quot" : '"',
126e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                                      "amp" : "&",
127e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                                      "lt" : "<",
128e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                                      "gt" : ">" }
129e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
130e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    XML_SPECIAL_CHARS_TO_ENTITIES = _invert(XML_ENTITIES_TO_SPECIAL_CHARS)
131e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
132e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def setup(self, parent=None, previous=None):
133e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        """Sets up the initial relations between this element and
134e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        other elements."""
135e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.parent = parent
136e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.previous = previous
137e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.next = None
138e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.previousSibling = None
139e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.nextSibling = None
140e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if self.parent and self.parent.contents:
141e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            self.previousSibling = self.parent.contents[-1]
142e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            self.previousSibling.nextSibling = self
143e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
144e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def replaceWith(self, replaceWith):
145e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        oldParent = self.parent
146e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        myIndex = self.parent.index(self)
147e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if hasattr(replaceWith, "parent")\
148e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                  and replaceWith.parent is self.parent:
149e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            # We're replacing this element with one of its siblings.
150e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            index = replaceWith.parent.index(replaceWith)
151e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            if index and index < myIndex:
152e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                # Furthermore, it comes before this element. That
153e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                # means that when we extract it, the index of this
154e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                # element will change.
155e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                myIndex = myIndex - 1
156e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.extract()
157e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        oldParent.insert(myIndex, replaceWith)
158e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
159e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def replaceWithChildren(self):
160e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        myParent = self.parent
161e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        myIndex = self.parent.index(self)
162e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.extract()
163e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        reversedChildren = list(self.contents)
164e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        reversedChildren.reverse()
165e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        for child in reversedChildren:
166e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            myParent.insert(myIndex, child)
167e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
168e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def extract(self):
169e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        """Destructively rips this element out of the tree."""
170e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if self.parent:
171e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            try:
172e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                del self.parent.contents[self.parent.index(self)]
173e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            except ValueError:
174e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                pass
175e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
176e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        #Find the two elements that would be next to each other if
177e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        #this element (and any children) hadn't been parsed. Connect
178e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        #the two.
179e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        lastChild = self._lastRecursiveChild()
180e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        nextElement = lastChild.next
181e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
182e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if self.previous:
183e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            self.previous.next = nextElement
184e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if nextElement:
185e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            nextElement.previous = self.previous
186e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.previous = None
187e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        lastChild.next = None
188e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
189e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.parent = None
190e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if self.previousSibling:
191e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            self.previousSibling.nextSibling = self.nextSibling
192e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if self.nextSibling:
193e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            self.nextSibling.previousSibling = self.previousSibling
194e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.previousSibling = self.nextSibling = None
195e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return self
196e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
197e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def _lastRecursiveChild(self):
198e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        "Finds the last element beneath this object to be parsed."
199e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        lastChild = self
200e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        while hasattr(lastChild, 'contents') and lastChild.contents:
201e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            lastChild = lastChild.contents[-1]
202e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return lastChild
203e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
204e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def insert(self, position, newChild):
205e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if isinstance(newChild, basestring) \
206e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            and not isinstance(newChild, NavigableString):
207e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            newChild = NavigableString(newChild)
208e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
209e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        position =  min(position, len(self.contents))
210e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if hasattr(newChild, 'parent') and newChild.parent is not None:
211e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            # We're 'inserting' an element that's already one
212e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            # of this object's children.
213e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            if newChild.parent is self:
214e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                index = self.index(newChild)
215e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                if index > position:
216e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    # Furthermore we're moving it further down the
217e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    # list of this object's children. That means that
218e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    # when we extract this element, our target index
219e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    # will jump down one.
220e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    position = position - 1
221e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            newChild.extract()
222e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
223e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        newChild.parent = self
224e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        previousChild = None
225e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if position == 0:
226e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            newChild.previousSibling = None
227e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            newChild.previous = self
228e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        else:
229e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            previousChild = self.contents[position-1]
230e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            newChild.previousSibling = previousChild
231e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            newChild.previousSibling.nextSibling = newChild
232e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            newChild.previous = previousChild._lastRecursiveChild()
233e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if newChild.previous:
234e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            newChild.previous.next = newChild
235e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
236e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        newChildsLastElement = newChild._lastRecursiveChild()
237e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
238e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if position >= len(self.contents):
239e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            newChild.nextSibling = None
240e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
241e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            parent = self
242e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            parentsNextSibling = None
243e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            while not parentsNextSibling:
244e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                parentsNextSibling = parent.nextSibling
245e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                parent = parent.parent
246e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                if not parent: # This is the last element in the document.
247e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    break
248e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            if parentsNextSibling:
249e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                newChildsLastElement.next = parentsNextSibling
250e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            else:
251e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                newChildsLastElement.next = None
252e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        else:
253e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            nextChild = self.contents[position]
254e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            newChild.nextSibling = nextChild
255e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            if newChild.nextSibling:
256e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                newChild.nextSibling.previousSibling = newChild
257e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            newChildsLastElement.next = nextChild
258e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
259e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if newChildsLastElement.next:
260e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            newChildsLastElement.next.previous = newChildsLastElement
261e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.contents.insert(position, newChild)
262e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
263e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def append(self, tag):
264e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        """Appends the given tag to the contents of this tag."""
265e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.insert(len(self.contents), tag)
266e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
267e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def findNext(self, name=None, attrs={}, text=None, **kwargs):
268e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        """Returns the first item that matches the given criteria and
269e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        appears after this Tag in the document."""
270e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return self._findOne(self.findAllNext, name, attrs, text, **kwargs)
271e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
272e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def findAllNext(self, name=None, attrs={}, text=None, limit=None,
273e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    **kwargs):
274e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        """Returns all items that match the given criteria and appear
275e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        after this Tag in the document."""
276e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return self._findAll(name, attrs, text, limit, self.nextGenerator,
277e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                             **kwargs)
278e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
279e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def findNextSibling(self, name=None, attrs={}, text=None, **kwargs):
280e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        """Returns the closest sibling to this Tag that matches the
281e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        given criteria and appears after this Tag in the document."""
282e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return self._findOne(self.findNextSiblings, name, attrs, text,
283e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                             **kwargs)
284e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
285e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def findNextSiblings(self, name=None, attrs={}, text=None, limit=None,
286e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                         **kwargs):
287e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        """Returns the siblings of this Tag that match the given
288e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        criteria and appear after this Tag in the document."""
289e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return self._findAll(name, attrs, text, limit,
290e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                             self.nextSiblingGenerator, **kwargs)
291e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    fetchNextSiblings = findNextSiblings # Compatibility with pre-3.x
292e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
293e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def findPrevious(self, name=None, attrs={}, text=None, **kwargs):
294e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        """Returns the first item that matches the given criteria and
295e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        appears before this Tag in the document."""
296e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return self._findOne(self.findAllPrevious, name, attrs, text, **kwargs)
297e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
298e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def findAllPrevious(self, name=None, attrs={}, text=None, limit=None,
299e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                        **kwargs):
300e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        """Returns all items that match the given criteria and appear
301e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        before this Tag in the document."""
302e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return self._findAll(name, attrs, text, limit, self.previousGenerator,
303e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                           **kwargs)
304e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    fetchPrevious = findAllPrevious # Compatibility with pre-3.x
305e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
306e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def findPreviousSibling(self, name=None, attrs={}, text=None, **kwargs):
307e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        """Returns the closest sibling to this Tag that matches the
308e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        given criteria and appears before this Tag in the document."""
309e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return self._findOne(self.findPreviousSiblings, name, attrs, text,
310e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                             **kwargs)
311e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
312e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def findPreviousSiblings(self, name=None, attrs={}, text=None,
313e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                             limit=None, **kwargs):
314e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        """Returns the siblings of this Tag that match the given
315e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        criteria and appear before this Tag in the document."""
316e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return self._findAll(name, attrs, text, limit,
317e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                             self.previousSiblingGenerator, **kwargs)
318e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    fetchPreviousSiblings = findPreviousSiblings # Compatibility with pre-3.x
319e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
320e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def findParent(self, name=None, attrs={}, **kwargs):
321e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        """Returns the closest parent of this Tag that matches the given
322e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        criteria."""
323e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        # NOTE: We can't use _findOne because findParents takes a different
324e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        # set of arguments.
325e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        r = None
326e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        l = self.findParents(name, attrs, 1)
327e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if l:
328e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            r = l[0]
329e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return r
330e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
331e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def findParents(self, name=None, attrs={}, limit=None, **kwargs):
332e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        """Returns the parents of this Tag that match the given
333e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        criteria."""
334e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
335e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return self._findAll(name, attrs, None, limit, self.parentGenerator,
336e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                             **kwargs)
337e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    fetchParents = findParents # Compatibility with pre-3.x
338e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
339e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    #These methods do the real heavy lifting.
340e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
341e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def _findOne(self, method, name, attrs, text, **kwargs):
342e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        r = None
343e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        l = method(name, attrs, text, 1, **kwargs)
344e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if l:
345e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            r = l[0]
346e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return r
347e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
348e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def _findAll(self, name, attrs, text, limit, generator, **kwargs):
349e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        "Iterates over a generator looking for things that match."
350e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
351e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if isinstance(name, SoupStrainer):
352e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            strainer = name
353e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        # (Possibly) special case some findAll*(...) searches
354e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        elif text is None and not limit and not attrs and not kwargs:
355e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            # findAll*(True)
356e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            if name is True:
357e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                return [element for element in generator()
358e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                        if isinstance(element, Tag)]
359e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            # findAll*('tag-name')
360e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            elif isinstance(name, basestring):
361e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                return [element for element in generator()
362e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                        if isinstance(element, Tag) and
363e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                        element.name == name]
364e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            else:
365e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                strainer = SoupStrainer(name, attrs, text, **kwargs)
366e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        # Build a SoupStrainer
367e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        else:
368e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            strainer = SoupStrainer(name, attrs, text, **kwargs)
369e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        results = ResultSet(strainer)
370e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        g = generator()
371e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        while True:
372e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            try:
373e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                i = g.next()
374e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            except StopIteration:
375e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                break
376e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            if i:
377e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                found = strainer.search(i)
378e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                if found:
379e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    results.append(found)
380e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    if limit and len(results) >= limit:
381e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                        break
382e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return results
383e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
384e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    #These Generators can be used to navigate starting from both
385e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    #NavigableStrings and Tags.
386e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def nextGenerator(self):
387e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        i = self
388e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        while i is not None:
389e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            i = i.next
390e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            yield i
391e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
392e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def nextSiblingGenerator(self):
393e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        i = self
394e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        while i is not None:
395e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            i = i.nextSibling
396e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            yield i
397e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
398e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def previousGenerator(self):
399e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        i = self
400e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        while i is not None:
401e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            i = i.previous
402e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            yield i
403e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
404e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def previousSiblingGenerator(self):
405e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        i = self
406e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        while i is not None:
407e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            i = i.previousSibling
408e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            yield i
409e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
410e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def parentGenerator(self):
411e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        i = self
412e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        while i is not None:
413e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            i = i.parent
414e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            yield i
415e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
416e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    # Utility methods
417e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def substituteEncoding(self, str, encoding=None):
418e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        encoding = encoding or "utf-8"
419e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return str.replace("%SOUP-ENCODING%", encoding)
420e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
421e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def toEncoding(self, s, encoding=None):
422e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        """Encodes an object to a string in some encoding, or to Unicode.
423e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        ."""
424e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if isinstance(s, unicode):
425e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            if encoding:
426e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                s = s.encode(encoding)
427e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        elif isinstance(s, str):
428e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            if encoding:
429e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                s = s.encode(encoding)
430e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            else:
431e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                s = unicode(s)
432e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        else:
433e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            if encoding:
434e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                s  = self.toEncoding(str(s), encoding)
435e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            else:
436e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                s = unicode(s)
437e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return s
438e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
439e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    BARE_AMPERSAND_OR_BRACKET = re.compile("([<>]|"
440e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                                           + "&(?!#\d+;|#x[0-9a-fA-F]+;|\w+;)"
441e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                                           + ")")
442e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
443e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def _sub_entity(self, x):
444e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        """Used with a regular expression to substitute the
445e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        appropriate XML entity for an XML special character."""
446e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return "&" + self.XML_SPECIAL_CHARS_TO_ENTITIES[x.group(0)[0]] + ";"
447e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
448e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
449e64493206b76bce6e3af0598790ec076094e8c37Simran Basiclass NavigableString(unicode, PageElement):
450e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
451e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def __new__(cls, value):
452e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        """Create a new NavigableString.
453e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
454e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        When unpickling a NavigableString, this method is called with
455e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        the string in DEFAULT_OUTPUT_ENCODING. That encoding needs to be
456e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        passed in to the superclass's __new__ or the superclass won't know
457e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        how to handle non-ASCII characters.
458e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        """
459e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if isinstance(value, unicode):
460e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            return unicode.__new__(cls, value)
461e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return unicode.__new__(cls, value, DEFAULT_OUTPUT_ENCODING)
462e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
463e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def __getnewargs__(self):
464e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return (NavigableString.__str__(self),)
465e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
466e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def __getattr__(self, attr):
467e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        """text.string gives you text. This is for backwards
468e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        compatibility for Navigable*String, but for CData* it lets you
469e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        get the string without the CData wrapper."""
470e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if attr == 'string':
471e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            return self
472e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        else:
473e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            raise AttributeError, "'%s' object has no attribute '%s'" % (self.__class__.__name__, attr)
474e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
475e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def __unicode__(self):
476e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return str(self).decode(DEFAULT_OUTPUT_ENCODING)
477e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
478e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING):
479e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        # Substitute outgoing XML entities.
480e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        data = self.BARE_AMPERSAND_OR_BRACKET.sub(self._sub_entity, self)
481e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if encoding:
482e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            return data.encode(encoding)
483e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        else:
484e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            return data
485e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
486e64493206b76bce6e3af0598790ec076094e8c37Simran Basiclass CData(NavigableString):
487e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
488e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING):
489e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return "<![CDATA[%s]]>" % NavigableString.__str__(self, encoding)
490e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
491e64493206b76bce6e3af0598790ec076094e8c37Simran Basiclass ProcessingInstruction(NavigableString):
492e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING):
493e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        output = self
494e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if "%SOUP-ENCODING%" in output:
495e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            output = self.substituteEncoding(output, encoding)
496e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return "<?%s?>" % self.toEncoding(output, encoding)
497e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
498e64493206b76bce6e3af0598790ec076094e8c37Simran Basiclass Comment(NavigableString):
499e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING):
500e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return "<!--%s-->" % NavigableString.__str__(self, encoding)
501e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
502e64493206b76bce6e3af0598790ec076094e8c37Simran Basiclass Declaration(NavigableString):
503e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING):
504e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return "<!%s>" % NavigableString.__str__(self, encoding)
505e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
506e64493206b76bce6e3af0598790ec076094e8c37Simran Basiclass Tag(PageElement):
507e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
508e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    """Represents a found HTML tag with its attributes and contents."""
509e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
510e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def _convertEntities(self, match):
511e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        """Used in a call to re.sub to replace HTML, XML, and numeric
512e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        entities with the appropriate Unicode characters. If HTML
513e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        entities are being converted, any unrecognized entities are
514e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        escaped."""
515e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        x = match.group(1)
516e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if self.convertHTMLEntities and x in name2codepoint:
517e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            return unichr(name2codepoint[x])
518e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        elif x in self.XML_ENTITIES_TO_SPECIAL_CHARS:
519e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            if self.convertXMLEntities:
520e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                return self.XML_ENTITIES_TO_SPECIAL_CHARS[x]
521e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            else:
522e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                return u'&%s;' % x
523e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        elif len(x) > 0 and x[0] == '#':
524e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            # Handle numeric entities
525e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            if len(x) > 1 and x[1] == 'x':
526e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                return unichr(int(x[2:], 16))
527e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            else:
528e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                return unichr(int(x[1:]))
529e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
530e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        elif self.escapeUnrecognizedEntities:
531e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            return u'&amp;%s;' % x
532e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        else:
533e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            return u'&%s;' % x
534e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
535e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def __init__(self, parser, name, attrs=None, parent=None,
536e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                 previous=None):
537e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        "Basic constructor."
538e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
539e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        # We don't actually store the parser object: that lets extracted
540e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        # chunks be garbage-collected
541e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.parserClass = parser.__class__
542e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.isSelfClosing = parser.isSelfClosingTag(name)
543e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.name = name
544e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if attrs is None:
545e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            attrs = []
546e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        elif isinstance(attrs, dict):
547e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            attrs = attrs.items()
548e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.attrs = attrs
549e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.contents = []
550e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.setup(parent, previous)
551e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.hidden = False
552e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.containsSubstitutions = False
553e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.convertHTMLEntities = parser.convertHTMLEntities
554e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.convertXMLEntities = parser.convertXMLEntities
555e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.escapeUnrecognizedEntities = parser.escapeUnrecognizedEntities
556e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
557e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        # Convert any HTML, XML, or numeric entities in the attribute values.
558e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        convert = lambda(k, val): (k,
559e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                                   re.sub("&(#\d+|#x[0-9a-fA-F]+|\w+);",
560e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                                          self._convertEntities,
561e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                                          val))
562e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.attrs = map(convert, self.attrs)
563e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
564e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def getString(self):
565e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if (len(self.contents) == 1
566e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            and isinstance(self.contents[0], NavigableString)):
567e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            return self.contents[0]
568e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
569e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def setString(self, string):
570e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        """Replace the contents of the tag with a string"""
571e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.clear()
572e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.append(string)
573e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
574e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    string = property(getString, setString)
575e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
576e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def getText(self, separator=u""):
577e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if not len(self.contents):
578e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            return u""
579e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        stopNode = self._lastRecursiveChild().next
580e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        strings = []
581e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        current = self.contents[0]
582e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        while current is not stopNode:
583e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            if isinstance(current, NavigableString):
584e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                strings.append(current.strip())
585e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            current = current.next
586e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return separator.join(strings)
587e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
588e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    text = property(getText)
589e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
590e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def get(self, key, default=None):
591e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        """Returns the value of the 'key' attribute for the tag, or
592e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        the value given for 'default' if it doesn't have that
593e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        attribute."""
594e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return self._getAttrMap().get(key, default)
595e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
596e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def clear(self):
597e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        """Extract all children."""
598e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        for child in self.contents[:]:
599e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            child.extract()
600e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
601e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def index(self, element):
602e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        for i, child in enumerate(self.contents):
603e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            if child is element:
604e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                return i
605e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        raise ValueError("Tag.index: element not in tag")
606e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
607e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def has_key(self, key):
608e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return self._getAttrMap().has_key(key)
609e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
610e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def __getitem__(self, key):
611e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        """tag[key] returns the value of the 'key' attribute for the tag,
612e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        and throws an exception if it's not there."""
613e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return self._getAttrMap()[key]
614e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
615e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def __iter__(self):
616e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        "Iterating over a tag iterates over its contents."
617e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return iter(self.contents)
618e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
619e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def __len__(self):
620e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        "The length of a tag is the length of its list of contents."
621e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return len(self.contents)
622e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
623e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def __contains__(self, x):
624e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return x in self.contents
625e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
626e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def __nonzero__(self):
627e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        "A tag is non-None even if it has no contents."
628e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return True
629e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
630e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def __setitem__(self, key, value):
631e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        """Setting tag[key] sets the value of the 'key' attribute for the
632e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        tag."""
633e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self._getAttrMap()
634e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.attrMap[key] = value
635e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        found = False
636e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        for i in range(0, len(self.attrs)):
637e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            if self.attrs[i][0] == key:
638e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                self.attrs[i] = (key, value)
639e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                found = True
640e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if not found:
641e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            self.attrs.append((key, value))
642e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self._getAttrMap()[key] = value
643e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
644e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def __delitem__(self, key):
645e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        "Deleting tag[key] deletes all 'key' attributes for the tag."
646e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        for item in self.attrs:
647e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            if item[0] == key:
648e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                self.attrs.remove(item)
649e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                #We don't break because bad HTML can define the same
650e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                #attribute multiple times.
651e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            self._getAttrMap()
652e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            if self.attrMap.has_key(key):
653e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                del self.attrMap[key]
654e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
655e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def __call__(self, *args, **kwargs):
656e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        """Calling a tag like a function is the same as calling its
657e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        findAll() method. Eg. tag('a') returns a list of all the A tags
658e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        found within this tag."""
659e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return apply(self.findAll, args, kwargs)
660e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
661e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def __getattr__(self, tag):
662e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        #print "Getattr %s.%s" % (self.__class__, tag)
663e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if len(tag) > 3 and tag.rfind('Tag') == len(tag)-3:
664e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            return self.find(tag[:-3])
665e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        elif tag.find('__') != 0:
666e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            return self.find(tag)
667e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        raise AttributeError, "'%s' object has no attribute '%s'" % (self.__class__, tag)
668e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
669e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def __eq__(self, other):
670e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        """Returns true iff this tag has the same name, the same attributes,
671e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        and the same contents (recursively) as the given tag.
672e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
673e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        NOTE: right now this will return false if two tags have the
674e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        same attributes in a different order. Should this be fixed?"""
675e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if other is self:
676e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            return True
677e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if not hasattr(other, 'name') or not hasattr(other, 'attrs') or not hasattr(other, 'contents') or self.name != other.name or self.attrs != other.attrs or len(self) != len(other):
678e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            return False
679e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        for i in range(0, len(self.contents)):
680e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            if self.contents[i] != other.contents[i]:
681e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                return False
682e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return True
683e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
684e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def __ne__(self, other):
685e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        """Returns true iff this tag is not identical to the other tag,
686e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        as defined in __eq__."""
687e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return not self == other
688e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
689e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def __repr__(self, encoding=DEFAULT_OUTPUT_ENCODING):
690e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        """Renders this tag as a string."""
691e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return self.__str__(encoding)
692e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
693e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def __unicode__(self):
694e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return self.__str__(None)
695e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
696e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING,
697e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                prettyPrint=False, indentLevel=0):
698e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        """Returns a string or Unicode representation of this tag and
699e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        its contents. To get Unicode, pass None for encoding.
700e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
701e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        NOTE: since Python's HTML parser consumes whitespace, this
702e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        method is not certain to reproduce the whitespace present in
703e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        the original string."""
704e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
705e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        encodedName = self.toEncoding(self.name, encoding)
706e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
707e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        attrs = []
708e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if self.attrs:
709e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            for key, val in self.attrs:
710e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                fmt = '%s="%s"'
711e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                if isinstance(val, basestring):
712e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    if self.containsSubstitutions and '%SOUP-ENCODING%' in val:
713e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                        val = self.substituteEncoding(val, encoding)
714e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
715e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    # The attribute value either:
716e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    #
717e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    # * Contains no embedded double quotes or single quotes.
718e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    #   No problem: we enclose it in double quotes.
719e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    # * Contains embedded single quotes. No problem:
720e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    #   double quotes work here too.
721e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    # * Contains embedded double quotes. No problem:
722e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    #   we enclose it in single quotes.
723e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    # * Embeds both single _and_ double quotes. This
724e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    #   can't happen naturally, but it can happen if
725e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    #   you modify an attribute value after parsing
726e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    #   the document. Now we have a bit of a
727e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    #   problem. We solve it by enclosing the
728e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    #   attribute in single quotes, and escaping any
729e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    #   embedded single quotes to XML entities.
730e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    if '"' in val:
731e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                        fmt = "%s='%s'"
732e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                        if "'" in val:
733e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                            # TODO: replace with apos when
734e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                            # appropriate.
735e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                            val = val.replace("'", "&squot;")
736e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
737e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    # Now we're okay w/r/t quotes. But the attribute
738e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    # value might also contain angle brackets, or
739e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    # ampersands that aren't part of entities. We need
740e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    # to escape those to XML entities too.
741e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    val = self.BARE_AMPERSAND_OR_BRACKET.sub(self._sub_entity, val)
742e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
743e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                attrs.append(fmt % (self.toEncoding(key, encoding),
744e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                                    self.toEncoding(val, encoding)))
745e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        close = ''
746e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        closeTag = ''
747e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if self.isSelfClosing:
748e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            close = ' /'
749e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        else:
750e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            closeTag = '</%s>' % encodedName
751e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
752e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        indentTag, indentContents = 0, 0
753e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if prettyPrint:
754e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            indentTag = indentLevel
755e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            space = (' ' * (indentTag-1))
756e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            indentContents = indentTag + 1
757e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        contents = self.renderContents(encoding, prettyPrint, indentContents)
758e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if self.hidden:
759e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            s = contents
760e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        else:
761e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            s = []
762e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            attributeString = ''
763e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            if attrs:
764e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                attributeString = ' ' + ' '.join(attrs)
765e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            if prettyPrint:
766e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                s.append(space)
767e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            s.append('<%s%s%s>' % (encodedName, attributeString, close))
768e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            if prettyPrint:
769e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                s.append("\n")
770e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            s.append(contents)
771e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            if prettyPrint and contents and contents[-1] != "\n":
772e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                s.append("\n")
773e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            if prettyPrint and closeTag:
774e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                s.append(space)
775e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            s.append(closeTag)
776e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            if prettyPrint and closeTag and self.nextSibling:
777e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                s.append("\n")
778e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            s = ''.join(s)
779e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return s
780e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
781e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def decompose(self):
782e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        """Recursively destroys the contents of this tree."""
783e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.extract()
784e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if len(self.contents) == 0:
785e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            return
786e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        current = self.contents[0]
787e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        while current is not None:
788e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            next = current.next
789e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            if isinstance(current, Tag):
790e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                del current.contents[:]
791e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            current.parent = None
792e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            current.previous = None
793e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            current.previousSibling = None
794e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            current.next = None
795e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            current.nextSibling = None
796e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            current = next
797e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
798e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def prettify(self, encoding=DEFAULT_OUTPUT_ENCODING):
799e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return self.__str__(encoding, True)
800e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
801e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def renderContents(self, encoding=DEFAULT_OUTPUT_ENCODING,
802e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                       prettyPrint=False, indentLevel=0):
803e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        """Renders the contents of this tag as a string in the given
804e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        encoding. If encoding is None, returns a Unicode string.."""
805e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        s=[]
806e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        for c in self:
807e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            text = None
808e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            if isinstance(c, NavigableString):
809e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                text = c.__str__(encoding)
810e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            elif isinstance(c, Tag):
811e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                s.append(c.__str__(encoding, prettyPrint, indentLevel))
812e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            if text and prettyPrint:
813e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                text = text.strip()
814e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            if text:
815e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                if prettyPrint:
816e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    s.append(" " * (indentLevel-1))
817e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                s.append(text)
818e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                if prettyPrint:
819e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    s.append("\n")
820e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return ''.join(s)
821e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
822e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    #Soup methods
823e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
824e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def find(self, name=None, attrs={}, recursive=True, text=None,
825e64493206b76bce6e3af0598790ec076094e8c37Simran Basi             **kwargs):
826e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        """Return only the first child of this Tag matching the given
827e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        criteria."""
828e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        r = None
829e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        l = self.findAll(name, attrs, recursive, text, 1, **kwargs)
830e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if l:
831e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            r = l[0]
832e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return r
833e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    findChild = find
834e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
835e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def findAll(self, name=None, attrs={}, recursive=True, text=None,
836e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                limit=None, **kwargs):
837e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        """Extracts a list of Tag objects that match the given
838e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        criteria.  You can specify the name of the Tag and any
839e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        attributes you want the Tag to have.
840e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
841e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        The value of a key-value pair in the 'attrs' map can be a
842e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        string, a list of strings, a regular expression object, or a
843e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        callable that takes a string and returns whether or not the
844e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        string matches for some custom definition of 'matches'. The
845e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        same is true of the tag name."""
846e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        generator = self.recursiveChildGenerator
847e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if not recursive:
848e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            generator = self.childGenerator
849e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return self._findAll(name, attrs, text, limit, generator, **kwargs)
850e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    findChildren = findAll
851e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
852e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    # Pre-3.x compatibility methods
853e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    first = find
854e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    fetch = findAll
855e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
856e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def fetchText(self, text=None, recursive=True, limit=None):
857e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return self.findAll(text=text, recursive=recursive, limit=limit)
858e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
859e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def firstText(self, text=None, recursive=True):
860e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return self.find(text=text, recursive=recursive)
861e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
862e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    #Private methods
863e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
864e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def _getAttrMap(self):
865e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        """Initializes a map representation of this tag's attributes,
866e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if not already initialized."""
867e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if not getattr(self, 'attrMap'):
868e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            self.attrMap = {}
869e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            for (key, value) in self.attrs:
870e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                self.attrMap[key] = value
871e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return self.attrMap
872e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
873e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    #Generator methods
874e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def childGenerator(self):
875e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        # Just use the iterator from the contents
876e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return iter(self.contents)
877e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
878e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def recursiveChildGenerator(self):
879e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if not len(self.contents):
880e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            raise StopIteration
881e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        stopNode = self._lastRecursiveChild().next
882e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        current = self.contents[0]
883e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        while current is not stopNode:
884e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            yield current
885e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            current = current.next
886e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
887e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
888e64493206b76bce6e3af0598790ec076094e8c37Simran Basi# Next, a couple classes to represent queries and their results.
889e64493206b76bce6e3af0598790ec076094e8c37Simran Basiclass SoupStrainer:
890e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    """Encapsulates a number of ways of matching a markup element (tag or
891e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    text)."""
892e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
893e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def __init__(self, name=None, attrs={}, text=None, **kwargs):
894e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.name = name
895e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if isinstance(attrs, basestring):
896e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            kwargs['class'] = _match_css_class(attrs)
897e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            attrs = None
898e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if kwargs:
899e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            if attrs:
900e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                attrs = attrs.copy()
901e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                attrs.update(kwargs)
902e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            else:
903e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                attrs = kwargs
904e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.attrs = attrs
905e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.text = text
906e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
907e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def __str__(self):
908e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if self.text:
909e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            return self.text
910e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        else:
911e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            return "%s|%s" % (self.name, self.attrs)
912e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
913e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def searchTag(self, markupName=None, markupAttrs={}):
914e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        found = None
915e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        markup = None
916e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if isinstance(markupName, Tag):
917e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            markup = markupName
918e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            markupAttrs = markup
919e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        callFunctionWithTagData = callable(self.name) \
920e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                                and not isinstance(markupName, Tag)
921e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
922e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if (not self.name) \
923e64493206b76bce6e3af0598790ec076094e8c37Simran Basi               or callFunctionWithTagData \
924e64493206b76bce6e3af0598790ec076094e8c37Simran Basi               or (markup and self._matches(markup, self.name)) \
925e64493206b76bce6e3af0598790ec076094e8c37Simran Basi               or (not markup and self._matches(markupName, self.name)):
926e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            if callFunctionWithTagData:
927e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                match = self.name(markupName, markupAttrs)
928e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            else:
929e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                match = True
930e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                markupAttrMap = None
931e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                for attr, matchAgainst in self.attrs.items():
932e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    if not markupAttrMap:
933e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                         if hasattr(markupAttrs, 'get'):
934e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                            markupAttrMap = markupAttrs
935e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                         else:
936e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                            markupAttrMap = {}
937e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                            for k,v in markupAttrs:
938e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                                markupAttrMap[k] = v
939e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    attrValue = markupAttrMap.get(attr)
940e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    if not self._matches(attrValue, matchAgainst):
941e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                        match = False
942e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                        break
943e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            if match:
944e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                if markup:
945e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    found = markup
946e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                else:
947e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    found = markupName
948e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return found
949e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
950e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def search(self, markup):
951e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        #print 'looking for %s in %s' % (self, markup)
952e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        found = None
953e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        # If given a list of items, scan it for a text element that
954e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        # matches.
955e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if hasattr(markup, "__iter__") \
956e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                and not isinstance(markup, Tag):
957e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            for element in markup:
958e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                if isinstance(element, NavigableString) \
959e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                       and self.search(element):
960e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    found = element
961e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    break
962e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        # If it's a Tag, make sure its name or attributes match.
963e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        # Don't bother with Tags if we're searching for text.
964e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        elif isinstance(markup, Tag):
965e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            if not self.text:
966e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                found = self.searchTag(markup)
967e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        # If it's text, make sure the text matches.
968e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        elif isinstance(markup, NavigableString) or \
969e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                 isinstance(markup, basestring):
970e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            if self._matches(markup, self.text):
971e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                found = markup
972e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        else:
973e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            raise Exception, "I don't know how to match against a %s" \
974e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                  % markup.__class__
975e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return found
976e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
977e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def _matches(self, markup, matchAgainst):
978e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        #print "Matching %s against %s" % (markup, matchAgainst)
979e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        result = False
980e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if matchAgainst is True:
981e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            result = markup is not None
982e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        elif callable(matchAgainst):
983e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            result = matchAgainst(markup)
984e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        else:
985e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            #Custom match methods take the tag as an argument, but all
986e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            #other ways of matching match the tag name as a string.
987e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            if isinstance(markup, Tag):
988e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                markup = markup.name
989e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            if markup and not isinstance(markup, basestring):
990e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                markup = unicode(markup)
991e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            #Now we know that chunk is either a string, or None.
992e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            if hasattr(matchAgainst, 'match'):
993e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                # It's a regexp object.
994e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                result = markup and matchAgainst.search(markup)
995e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            elif hasattr(matchAgainst, '__iter__'): # list-like
996e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                result = markup in matchAgainst
997e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            elif hasattr(matchAgainst, 'items'):
998e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                result = markup.has_key(matchAgainst)
999e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            elif matchAgainst and isinstance(markup, basestring):
1000e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                if isinstance(markup, unicode):
1001e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    matchAgainst = unicode(matchAgainst)
1002e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                else:
1003e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    matchAgainst = str(matchAgainst)
1004e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1005e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            if not result:
1006e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                result = matchAgainst == markup
1007e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return result
1008e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1009e64493206b76bce6e3af0598790ec076094e8c37Simran Basiclass ResultSet(list):
1010e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    """A ResultSet is just a list that keeps track of the SoupStrainer
1011e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    that created it."""
1012e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def __init__(self, source):
1013e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        list.__init__([])
1014e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.source = source
1015e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1016e64493206b76bce6e3af0598790ec076094e8c37Simran Basi# Now, some helper functions.
1017e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1018e64493206b76bce6e3af0598790ec076094e8c37Simran Basidef buildTagMap(default, *args):
1019e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    """Turns a list of maps, lists, or scalars into a single map.
1020e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    Used to build the SELF_CLOSING_TAGS, NESTABLE_TAGS, and
1021e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    NESTING_RESET_TAGS maps out of lists and partial maps."""
1022e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    built = {}
1023e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    for portion in args:
1024e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if hasattr(portion, 'items'):
1025e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            #It's a map. Merge it.
1026e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            for k,v in portion.items():
1027e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                built[k] = v
1028e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        elif hasattr(portion, '__iter__'): # is a list
1029e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            #It's a list. Map each item to the default.
1030e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            for k in portion:
1031e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                built[k] = default
1032e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        else:
1033e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            #It's a scalar. Map it to the default.
1034e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            built[portion] = default
1035e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    return built
1036e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1037e64493206b76bce6e3af0598790ec076094e8c37Simran Basi# Now, the parser classes.
1038e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1039e64493206b76bce6e3af0598790ec076094e8c37Simran Basiclass BeautifulStoneSoup(Tag, SGMLParser):
1040e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1041e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    """This class contains the basic parser and search code. It defines
1042e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    a parser that knows nothing about tag behavior except for the
1043e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    following:
1044e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1045e64493206b76bce6e3af0598790ec076094e8c37Simran Basi      You can't close a tag without closing all the tags it encloses.
1046e64493206b76bce6e3af0598790ec076094e8c37Simran Basi      That is, "<foo><bar></foo>" actually means
1047e64493206b76bce6e3af0598790ec076094e8c37Simran Basi      "<foo><bar></bar></foo>".
1048e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1049e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    [Another possible explanation is "<foo><bar /></foo>", but since
1050e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    this class defines no SELF_CLOSING_TAGS, it will never use that
1051e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    explanation.]
1052e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1053e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    This class is useful for parsing XML or made-up markup languages,
1054e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    or when BeautifulSoup makes an assumption counter to what you were
1055e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    expecting."""
1056e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1057e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    SELF_CLOSING_TAGS = {}
1058e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    NESTABLE_TAGS = {}
1059e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    RESET_NESTING_TAGS = {}
1060e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    QUOTE_TAGS = {}
1061e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    PRESERVE_WHITESPACE_TAGS = []
1062e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1063e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    MARKUP_MASSAGE = [(re.compile('(<[^<>]*)/>'),
1064e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                       lambda x: x.group(1) + ' />'),
1065e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                      (re.compile('<!\s+([^<>]*)>'),
1066e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                       lambda x: '<!' + x.group(1) + '>')
1067e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                      ]
1068e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1069e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    ROOT_TAG_NAME = u'[document]'
1070e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1071e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    HTML_ENTITIES = "html"
1072e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    XML_ENTITIES = "xml"
1073e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    XHTML_ENTITIES = "xhtml"
1074e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    # TODO: This only exists for backwards-compatibility
1075e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    ALL_ENTITIES = XHTML_ENTITIES
1076e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1077e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    # Used when determining whether a text node is all whitespace and
1078e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    # can be replaced with a single space. A text node that contains
1079e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    # fancy Unicode spaces (usually non-breaking) should be left
1080e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    # alone.
1081e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    STRIP_ASCII_SPACES = { 9: None, 10: None, 12: None, 13: None, 32: None, }
1082e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1083e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def __init__(self, markup="", parseOnlyThese=None, fromEncoding=None,
1084e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                 markupMassage=True, smartQuotesTo=XML_ENTITIES,
1085e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                 convertEntities=None, selfClosingTags=None, isHTML=False):
1086e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        """The Soup object is initialized as the 'root tag', and the
1087e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        provided markup (which can be a string or a file-like object)
1088e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        is fed into the underlying parser.
1089e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1090e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        sgmllib will process most bad HTML, and the BeautifulSoup
1091e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        class has some tricks for dealing with some HTML that kills
1092e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        sgmllib, but Beautiful Soup can nonetheless choke or lose data
1093e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if your data uses self-closing tags or declarations
1094e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        incorrectly.
1095e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1096e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        By default, Beautiful Soup uses regexes to sanitize input,
1097e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        avoiding the vast majority of these problems. If the problems
1098e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        don't apply to you, pass in False for markupMassage, and
1099e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        you'll get better performance.
1100e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1101e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        The default parser massage techniques fix the two most common
1102e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        instances of invalid HTML that choke sgmllib:
1103e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1104e64493206b76bce6e3af0598790ec076094e8c37Simran Basi         <br/> (No space between name of closing tag and tag close)
1105e64493206b76bce6e3af0598790ec076094e8c37Simran Basi         <! --Comment--> (Extraneous whitespace in declaration)
1106e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1107e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        You can pass in a custom list of (RE object, replace method)
1108e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        tuples to get Beautiful Soup to scrub your input the way you
1109e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        want."""
1110e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1111e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.parseOnlyThese = parseOnlyThese
1112e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.fromEncoding = fromEncoding
1113e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.smartQuotesTo = smartQuotesTo
1114e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.convertEntities = convertEntities
1115e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        # Set the rules for how we'll deal with the entities we
1116e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        # encounter
1117e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if self.convertEntities:
1118e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            # It doesn't make sense to convert encoded characters to
1119e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            # entities even while you're converting entities to Unicode.
1120e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            # Just convert it all to Unicode.
1121e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            self.smartQuotesTo = None
1122e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            if convertEntities == self.HTML_ENTITIES:
1123e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                self.convertXMLEntities = False
1124e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                self.convertHTMLEntities = True
1125e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                self.escapeUnrecognizedEntities = True
1126e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            elif convertEntities == self.XHTML_ENTITIES:
1127e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                self.convertXMLEntities = True
1128e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                self.convertHTMLEntities = True
1129e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                self.escapeUnrecognizedEntities = False
1130e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            elif convertEntities == self.XML_ENTITIES:
1131e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                self.convertXMLEntities = True
1132e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                self.convertHTMLEntities = False
1133e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                self.escapeUnrecognizedEntities = False
1134e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        else:
1135e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            self.convertXMLEntities = False
1136e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            self.convertHTMLEntities = False
1137e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            self.escapeUnrecognizedEntities = False
1138e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1139e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.instanceSelfClosingTags = buildTagMap(None, selfClosingTags)
1140e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        SGMLParser.__init__(self)
1141e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1142e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if hasattr(markup, 'read'):        # It's a file-type object.
1143e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            markup = markup.read()
1144e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.markup = markup
1145e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.markupMassage = markupMassage
1146e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        try:
1147e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            self._feed(isHTML=isHTML)
1148e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        except StopParsing:
1149e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            pass
1150e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.markup = None                 # The markup can now be GCed
1151e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1152e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def convert_charref(self, name):
1153e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        """This method fixes a bug in Python's SGMLParser."""
1154e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        try:
1155e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            n = int(name)
1156e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        except ValueError:
1157e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            return
1158e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if not 0 <= n <= 127 : # ASCII ends at 127, not 255
1159e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            return
1160e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return self.convert_codepoint(n)
1161e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1162e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def _feed(self, inDocumentEncoding=None, isHTML=False):
1163e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        # Convert the document to Unicode.
1164e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        markup = self.markup
1165e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if isinstance(markup, unicode):
1166e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            if not hasattr(self, 'originalEncoding'):
1167e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                self.originalEncoding = None
1168e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        else:
1169e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            dammit = UnicodeDammit\
1170e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                     (markup, [self.fromEncoding, inDocumentEncoding],
1171e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                      smartQuotesTo=self.smartQuotesTo, isHTML=isHTML)
1172e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            markup = dammit.unicode
1173e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            self.originalEncoding = dammit.originalEncoding
1174e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            self.declaredHTMLEncoding = dammit.declaredHTMLEncoding
1175e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if markup:
1176e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            if self.markupMassage:
1177e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                if not hasattr(self.markupMassage, "__iter__"):
1178e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    self.markupMassage = self.MARKUP_MASSAGE
1179e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                for fix, m in self.markupMassage:
1180e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    markup = fix.sub(m, markup)
1181e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                # TODO: We get rid of markupMassage so that the
1182e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                # soup object can be deepcopied later on. Some
1183e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                # Python installations can't copy regexes. If anyone
1184e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                # was relying on the existence of markupMassage, this
1185e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                # might cause problems.
1186e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                del(self.markupMassage)
1187e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.reset()
1188e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1189e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        SGMLParser.feed(self, markup)
1190e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        # Close out any unfinished strings and close all the open tags.
1191e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.endData()
1192e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        while self.currentTag.name != self.ROOT_TAG_NAME:
1193e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            self.popTag()
1194e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1195e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def __getattr__(self, methodName):
1196e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        """This method routes method call requests to either the SGMLParser
1197e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        superclass or the Tag superclass, depending on the method name."""
1198e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        #print "__getattr__ called on %s.%s" % (self.__class__, methodName)
1199e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1200e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if methodName.startswith('start_') or methodName.startswith('end_') \
1201e64493206b76bce6e3af0598790ec076094e8c37Simran Basi               or methodName.startswith('do_'):
1202e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            return SGMLParser.__getattr__(self, methodName)
1203e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        elif not methodName.startswith('__'):
1204e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            return Tag.__getattr__(self, methodName)
1205e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        else:
1206e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            raise AttributeError
1207e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1208e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def isSelfClosingTag(self, name):
1209e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        """Returns true iff the given string is the name of a
1210e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self-closing tag according to this parser."""
1211e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return self.SELF_CLOSING_TAGS.has_key(name) \
1212e64493206b76bce6e3af0598790ec076094e8c37Simran Basi               or self.instanceSelfClosingTags.has_key(name)
1213e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1214e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def reset(self):
1215e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        Tag.__init__(self, self, self.ROOT_TAG_NAME)
1216e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.hidden = 1
1217e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        SGMLParser.reset(self)
1218e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.currentData = []
1219e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.currentTag = None
1220e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.tagStack = []
1221e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.quoteStack = []
1222e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.pushTag(self)
1223e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1224e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def popTag(self):
1225e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        tag = self.tagStack.pop()
1226e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1227e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        #print "Pop", tag.name
1228e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if self.tagStack:
1229e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            self.currentTag = self.tagStack[-1]
1230e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return self.currentTag
1231e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1232e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def pushTag(self, tag):
1233e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        #print "Push", tag.name
1234e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if self.currentTag:
1235e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            self.currentTag.contents.append(tag)
1236e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.tagStack.append(tag)
1237e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.currentTag = self.tagStack[-1]
1238e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1239e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def endData(self, containerClass=NavigableString):
1240e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if self.currentData:
1241e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            currentData = u''.join(self.currentData)
1242e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            if (currentData.translate(self.STRIP_ASCII_SPACES) == '' and
1243e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                not set([tag.name for tag in self.tagStack]).intersection(
1244e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    self.PRESERVE_WHITESPACE_TAGS)):
1245e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                if '\n' in currentData:
1246e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    currentData = '\n'
1247e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                else:
1248e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    currentData = ' '
1249e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            self.currentData = []
1250e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            if self.parseOnlyThese and len(self.tagStack) <= 1 and \
1251e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                   (not self.parseOnlyThese.text or \
1252e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    not self.parseOnlyThese.search(currentData)):
1253e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                return
1254e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            o = containerClass(currentData)
1255e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            o.setup(self.currentTag, self.previous)
1256e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            if self.previous:
1257e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                self.previous.next = o
1258e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            self.previous = o
1259e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            self.currentTag.contents.append(o)
1260e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1261e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1262e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def _popToTag(self, name, inclusivePop=True):
1263e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        """Pops the tag stack up to and including the most recent
1264e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        instance of the given tag. If inclusivePop is false, pops the tag
1265e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        stack up to but *not* including the most recent instqance of
1266e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        the given tag."""
1267e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        #print "Popping to %s" % name
1268e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if name == self.ROOT_TAG_NAME:
1269e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            return
1270e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1271e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        numPops = 0
1272e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        mostRecentTag = None
1273e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        for i in range(len(self.tagStack)-1, 0, -1):
1274e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            if name == self.tagStack[i].name:
1275e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                numPops = len(self.tagStack)-i
1276e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                break
1277e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if not inclusivePop:
1278e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            numPops = numPops - 1
1279e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1280e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        for i in range(0, numPops):
1281e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            mostRecentTag = self.popTag()
1282e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return mostRecentTag
1283e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1284e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def _smartPop(self, name):
1285e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1286e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        """We need to pop up to the previous tag of this type, unless
1287e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        one of this tag's nesting reset triggers comes between this
1288e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        tag and the previous tag of this type, OR unless this tag is a
1289e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        generic nesting trigger and another generic nesting trigger
1290e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        comes between this tag and the previous tag of this type.
1291e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1292e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        Examples:
1293e64493206b76bce6e3af0598790ec076094e8c37Simran Basi         <p>Foo<b>Bar *<p>* should pop to 'p', not 'b'.
1294e64493206b76bce6e3af0598790ec076094e8c37Simran Basi         <p>Foo<table>Bar *<p>* should pop to 'table', not 'p'.
1295e64493206b76bce6e3af0598790ec076094e8c37Simran Basi         <p>Foo<table><tr>Bar *<p>* should pop to 'tr', not 'p'.
1296e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1297e64493206b76bce6e3af0598790ec076094e8c37Simran Basi         <li><ul><li> *<li>* should pop to 'ul', not the first 'li'.
1298e64493206b76bce6e3af0598790ec076094e8c37Simran Basi         <tr><table><tr> *<tr>* should pop to 'table', not the first 'tr'
1299e64493206b76bce6e3af0598790ec076094e8c37Simran Basi         <td><tr><td> *<td>* should pop to 'tr', not the first 'td'
1300e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        """
1301e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1302e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        nestingResetTriggers = self.NESTABLE_TAGS.get(name)
1303e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        isNestable = nestingResetTriggers != None
1304e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        isResetNesting = self.RESET_NESTING_TAGS.has_key(name)
1305e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        popTo = None
1306e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        inclusive = True
1307e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        for i in range(len(self.tagStack)-1, 0, -1):
1308e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            p = self.tagStack[i]
1309e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            if (not p or p.name == name) and not isNestable:
1310e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                #Non-nestable tags get popped to the top or to their
1311e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                #last occurance.
1312e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                popTo = name
1313e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                break
1314e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            if (nestingResetTriggers is not None
1315e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                and p.name in nestingResetTriggers) \
1316e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                or (nestingResetTriggers is None and isResetNesting
1317e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    and self.RESET_NESTING_TAGS.has_key(p.name)):
1318e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1319e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                #If we encounter one of the nesting reset triggers
1320e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                #peculiar to this tag, or we encounter another tag
1321e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                #that causes nesting to reset, pop up to but not
1322e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                #including that tag.
1323e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                popTo = p.name
1324e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                inclusive = False
1325e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                break
1326e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            p = p.parent
1327e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if popTo:
1328e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            self._popToTag(popTo, inclusive)
1329e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1330e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def unknown_starttag(self, name, attrs, selfClosing=0):
1331e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        #print "Start tag %s: %s" % (name, attrs)
1332e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if self.quoteStack:
1333e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            #This is not a real tag.
1334e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            #print "<%s> is not real!" % name
1335e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            attrs = ''.join([' %s="%s"' % (x, y) for x, y in attrs])
1336e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            self.handle_data('<%s%s>' % (name, attrs))
1337e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            return
1338e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.endData()
1339e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1340e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if not self.isSelfClosingTag(name) and not selfClosing:
1341e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            self._smartPop(name)
1342e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1343e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if self.parseOnlyThese and len(self.tagStack) <= 1 \
1344e64493206b76bce6e3af0598790ec076094e8c37Simran Basi               and (self.parseOnlyThese.text or not self.parseOnlyThese.searchTag(name, attrs)):
1345e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            return
1346e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1347e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        tag = Tag(self, name, attrs, self.currentTag, self.previous)
1348e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if self.previous:
1349e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            self.previous.next = tag
1350e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.previous = tag
1351e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.pushTag(tag)
1352e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if selfClosing or self.isSelfClosingTag(name):
1353e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            self.popTag()
1354e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if name in self.QUOTE_TAGS:
1355e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            #print "Beginning quote (%s)" % name
1356e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            self.quoteStack.append(name)
1357e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            self.literal = 1
1358e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return tag
1359e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1360e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def unknown_endtag(self, name):
1361e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        #print "End tag %s" % name
1362e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if self.quoteStack and self.quoteStack[-1] != name:
1363e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            #This is not a real end tag.
1364e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            #print "</%s> is not real!" % name
1365e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            self.handle_data('</%s>' % name)
1366e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            return
1367e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.endData()
1368e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self._popToTag(name)
1369e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if self.quoteStack and self.quoteStack[-1] == name:
1370e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            self.quoteStack.pop()
1371e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            self.literal = (len(self.quoteStack) > 0)
1372e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1373e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def handle_data(self, data):
1374e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.currentData.append(data)
1375e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1376e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def _toStringSubclass(self, text, subclass):
1377e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        """Adds a certain piece of text to the tree as a NavigableString
1378e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        subclass."""
1379e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.endData()
1380e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.handle_data(text)
1381e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.endData(subclass)
1382e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1383e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def handle_pi(self, text):
1384e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        """Handle a processing instruction as a ProcessingInstruction
1385e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        object, possibly one with a %SOUP-ENCODING% slot into which an
1386e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        encoding will be plugged later."""
1387e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if text[:3] == "xml":
1388e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            text = u"xml version='1.0' encoding='%SOUP-ENCODING%'"
1389e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self._toStringSubclass(text, ProcessingInstruction)
1390e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1391e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def handle_comment(self, text):
1392e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        "Handle comments as Comment objects."
1393e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self._toStringSubclass(text, Comment)
1394e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1395e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def handle_charref(self, ref):
1396e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        "Handle character references as data."
1397e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if self.convertEntities:
1398e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            data = unichr(int(ref))
1399e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        else:
1400e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            data = '&#%s;' % ref
1401e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.handle_data(data)
1402e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1403e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def handle_entityref(self, ref):
1404e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        """Handle entity references as data, possibly converting known
1405e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        HTML and/or XML entity references to the corresponding Unicode
1406e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        characters."""
1407e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        data = None
1408e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if self.convertHTMLEntities:
1409e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            try:
1410e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                data = unichr(name2codepoint[ref])
1411e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            except KeyError:
1412e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                pass
1413e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1414e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if not data and self.convertXMLEntities:
1415e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                data = self.XML_ENTITIES_TO_SPECIAL_CHARS.get(ref)
1416e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1417e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if not data and self.convertHTMLEntities and \
1418e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            not self.XML_ENTITIES_TO_SPECIAL_CHARS.get(ref):
1419e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                # TODO: We've got a problem here. We're told this is
1420e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                # an entity reference, but it's not an XML entity
1421e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                # reference or an HTML entity reference. Nonetheless,
1422e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                # the logical thing to do is to pass it through as an
1423e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                # unrecognized entity reference.
1424e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                #
1425e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                # Except: when the input is "&carol;" this function
1426e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                # will be called with input "carol". When the input is
1427e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                # "AT&T", this function will be called with input
1428e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                # "T". We have no way of knowing whether a semicolon
1429e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                # was present originally, so we don't know whether
1430e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                # this is an unknown entity or just a misplaced
1431e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                # ampersand.
1432e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                #
1433e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                # The more common case is a misplaced ampersand, so I
1434e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                # escape the ampersand and omit the trailing semicolon.
1435e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                data = "&amp;%s" % ref
1436e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if not data:
1437e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            # This case is different from the one above, because we
1438e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            # haven't already gone through a supposedly comprehensive
1439e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            # mapping of entities to Unicode characters. We might not
1440e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            # have gone through any mapping at all. So the chances are
1441e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            # very high that this is a real entity, and not a
1442e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            # misplaced ampersand.
1443e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            data = "&%s;" % ref
1444e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.handle_data(data)
1445e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1446e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def handle_decl(self, data):
1447e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        "Handle DOCTYPEs and the like as Declaration objects."
1448e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self._toStringSubclass(data, Declaration)
1449e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1450e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def parse_declaration(self, i):
1451e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        """Treat a bogus SGML declaration as raw data. Treat a CDATA
1452e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        declaration as a CData object."""
1453e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        j = None
1454e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if self.rawdata[i:i+9] == '<![CDATA[':
1455e64493206b76bce6e3af0598790ec076094e8c37Simran Basi             k = self.rawdata.find(']]>', i)
1456e64493206b76bce6e3af0598790ec076094e8c37Simran Basi             if k == -1:
1457e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                 k = len(self.rawdata)
1458e64493206b76bce6e3af0598790ec076094e8c37Simran Basi             data = self.rawdata[i+9:k]
1459e64493206b76bce6e3af0598790ec076094e8c37Simran Basi             j = k+3
1460e64493206b76bce6e3af0598790ec076094e8c37Simran Basi             self._toStringSubclass(data, CData)
1461e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        else:
1462e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            try:
1463e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                j = SGMLParser.parse_declaration(self, i)
1464e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            except SGMLParseError:
1465e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                toHandle = self.rawdata[i:]
1466e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                self.handle_data(toHandle)
1467e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                j = i + len(toHandle)
1468e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return j
1469e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1470e64493206b76bce6e3af0598790ec076094e8c37Simran Basiclass BeautifulSoup(BeautifulStoneSoup):
1471e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1472e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    """This parser knows the following facts about HTML:
1473e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1474e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    * Some tags have no closing tag and should be interpreted as being
1475e64493206b76bce6e3af0598790ec076094e8c37Simran Basi      closed as soon as they are encountered.
1476e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1477e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    * The text inside some tags (ie. 'script') may contain tags which
1478e64493206b76bce6e3af0598790ec076094e8c37Simran Basi      are not really part of the document and which should be parsed
1479e64493206b76bce6e3af0598790ec076094e8c37Simran Basi      as text, not tags. If you want to parse the text as tags, you can
1480e64493206b76bce6e3af0598790ec076094e8c37Simran Basi      always fetch it and parse it explicitly.
1481e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1482e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    * Tag nesting rules:
1483e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1484e64493206b76bce6e3af0598790ec076094e8c37Simran Basi      Most tags can't be nested at all. For instance, the occurance of
1485e64493206b76bce6e3af0598790ec076094e8c37Simran Basi      a <p> tag should implicitly close the previous <p> tag.
1486e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1487e64493206b76bce6e3af0598790ec076094e8c37Simran Basi       <p>Para1<p>Para2
1488e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        should be transformed into:
1489e64493206b76bce6e3af0598790ec076094e8c37Simran Basi       <p>Para1</p><p>Para2
1490e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1491e64493206b76bce6e3af0598790ec076094e8c37Simran Basi      Some tags can be nested arbitrarily. For instance, the occurance
1492e64493206b76bce6e3af0598790ec076094e8c37Simran Basi      of a <blockquote> tag should _not_ implicitly close the previous
1493e64493206b76bce6e3af0598790ec076094e8c37Simran Basi      <blockquote> tag.
1494e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1495e64493206b76bce6e3af0598790ec076094e8c37Simran Basi       Alice said: <blockquote>Bob said: <blockquote>Blah
1496e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        should NOT be transformed into:
1497e64493206b76bce6e3af0598790ec076094e8c37Simran Basi       Alice said: <blockquote>Bob said: </blockquote><blockquote>Blah
1498e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1499e64493206b76bce6e3af0598790ec076094e8c37Simran Basi      Some tags can be nested, but the nesting is reset by the
1500e64493206b76bce6e3af0598790ec076094e8c37Simran Basi      interposition of other tags. For instance, a <tr> tag should
1501e64493206b76bce6e3af0598790ec076094e8c37Simran Basi      implicitly close the previous <tr> tag within the same <table>,
1502e64493206b76bce6e3af0598790ec076094e8c37Simran Basi      but not close a <tr> tag in another table.
1503e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1504e64493206b76bce6e3af0598790ec076094e8c37Simran Basi       <table><tr>Blah<tr>Blah
1505e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        should be transformed into:
1506e64493206b76bce6e3af0598790ec076094e8c37Simran Basi       <table><tr>Blah</tr><tr>Blah
1507e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        but,
1508e64493206b76bce6e3af0598790ec076094e8c37Simran Basi       <tr>Blah<table><tr>Blah
1509e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        should NOT be transformed into
1510e64493206b76bce6e3af0598790ec076094e8c37Simran Basi       <tr>Blah<table></tr><tr>Blah
1511e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1512e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    Differing assumptions about tag nesting rules are a major source
1513e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    of problems with the BeautifulSoup class. If BeautifulSoup is not
1514e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    treating as nestable a tag your page author treats as nestable,
1515e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    try ICantBelieveItsBeautifulSoup, MinimalSoup, or
1516e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    BeautifulStoneSoup before writing your own subclass."""
1517e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1518e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def __init__(self, *args, **kwargs):
1519e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if not kwargs.has_key('smartQuotesTo'):
1520e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            kwargs['smartQuotesTo'] = self.HTML_ENTITIES
1521e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        kwargs['isHTML'] = True
1522e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        BeautifulStoneSoup.__init__(self, *args, **kwargs)
1523e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1524e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    SELF_CLOSING_TAGS = buildTagMap(None,
1525e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                                    ('br' , 'hr', 'input', 'img', 'meta',
1526e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                                    'spacer', 'link', 'frame', 'base', 'col'))
1527e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1528e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    PRESERVE_WHITESPACE_TAGS = set(['pre', 'textarea'])
1529e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1530e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    QUOTE_TAGS = {'script' : None, 'textarea' : None}
1531e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1532e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    #According to the HTML standard, each of these inline tags can
1533e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    #contain another tag of the same type. Furthermore, it's common
1534e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    #to actually use these tags this way.
1535e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    NESTABLE_INLINE_TAGS = ('span', 'font', 'q', 'object', 'bdo', 'sub', 'sup',
1536e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                            'center')
1537e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1538e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    #According to the HTML standard, these block tags can contain
1539e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    #another tag of the same type. Furthermore, it's common
1540e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    #to actually use these tags this way.
1541e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    NESTABLE_BLOCK_TAGS = ('blockquote', 'div', 'fieldset', 'ins', 'del')
1542e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1543e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    #Lists can contain other lists, but there are restrictions.
1544e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    NESTABLE_LIST_TAGS = { 'ol' : [],
1545e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                           'ul' : [],
1546e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                           'li' : ['ul', 'ol'],
1547e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                           'dl' : [],
1548e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                           'dd' : ['dl'],
1549e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                           'dt' : ['dl'] }
1550e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1551e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    #Tables can contain other tables, but there are restrictions.
1552e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    NESTABLE_TABLE_TAGS = {'table' : [],
1553e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                           'tr' : ['table', 'tbody', 'tfoot', 'thead'],
1554e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                           'td' : ['tr'],
1555e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                           'th' : ['tr'],
1556e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                           'thead' : ['table'],
1557e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                           'tbody' : ['table'],
1558e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                           'tfoot' : ['table'],
1559e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                           }
1560e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1561e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    NON_NESTABLE_BLOCK_TAGS = ('address', 'form', 'p', 'pre')
1562e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1563e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    #If one of these tags is encountered, all tags up to the next tag of
1564e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    #this type are popped.
1565e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    RESET_NESTING_TAGS = buildTagMap(None, NESTABLE_BLOCK_TAGS, 'noscript',
1566e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                                     NON_NESTABLE_BLOCK_TAGS,
1567e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                                     NESTABLE_LIST_TAGS,
1568e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                                     NESTABLE_TABLE_TAGS)
1569e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1570e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    NESTABLE_TAGS = buildTagMap([], NESTABLE_INLINE_TAGS, NESTABLE_BLOCK_TAGS,
1571e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                                NESTABLE_LIST_TAGS, NESTABLE_TABLE_TAGS)
1572e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1573e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    # Used to detect the charset in a META tag; see start_meta
1574e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    CHARSET_RE = re.compile("((^|;)\s*charset=)([^;]*)", re.M)
1575e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1576e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def start_meta(self, attrs):
1577e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        """Beautiful Soup can detect a charset included in a META tag,
1578e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        try to convert the document to that charset, and re-parse the
1579e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        document from the beginning."""
1580e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        httpEquiv = None
1581e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        contentType = None
1582e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        contentTypeIndex = None
1583e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        tagNeedsEncodingSubstitution = False
1584e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1585e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        for i in range(0, len(attrs)):
1586e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            key, value = attrs[i]
1587e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            key = key.lower()
1588e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            if key == 'http-equiv':
1589e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                httpEquiv = value
1590e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            elif key == 'content':
1591e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                contentType = value
1592e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                contentTypeIndex = i
1593e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1594e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if httpEquiv and contentType: # It's an interesting meta tag.
1595e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            match = self.CHARSET_RE.search(contentType)
1596e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            if match:
1597e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                if (self.declaredHTMLEncoding is not None or
1598e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    self.originalEncoding == self.fromEncoding):
1599e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    # An HTML encoding was sniffed while converting
1600e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    # the document to Unicode, or an HTML encoding was
1601e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    # sniffed during a previous pass through the
1602e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    # document, or an encoding was specified
1603e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    # explicitly and it worked. Rewrite the meta tag.
1604e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    def rewrite(match):
1605e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                        return match.group(1) + "%SOUP-ENCODING%"
1606e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    newAttr = self.CHARSET_RE.sub(rewrite, contentType)
1607e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    attrs[contentTypeIndex] = (attrs[contentTypeIndex][0],
1608e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                                               newAttr)
1609e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    tagNeedsEncodingSubstitution = True
1610e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                else:
1611e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    # This is our first pass through the document.
1612e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    # Go through it again with the encoding information.
1613e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    newCharset = match.group(3)
1614e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    if newCharset and newCharset != self.originalEncoding:
1615e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                        self.declaredHTMLEncoding = newCharset
1616e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                        self._feed(self.declaredHTMLEncoding)
1617e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                        raise StopParsing
1618e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    pass
1619e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        tag = self.unknown_starttag("meta", attrs)
1620e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if tag and tagNeedsEncodingSubstitution:
1621e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            tag.containsSubstitutions = True
1622e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1623e64493206b76bce6e3af0598790ec076094e8c37Simran Basiclass StopParsing(Exception):
1624e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    pass
1625e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1626e64493206b76bce6e3af0598790ec076094e8c37Simran Basiclass ICantBelieveItsBeautifulSoup(BeautifulSoup):
1627e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1628e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    """The BeautifulSoup class is oriented towards skipping over
1629e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    common HTML errors like unclosed tags. However, sometimes it makes
1630e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    errors of its own. For instance, consider this fragment:
1631e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1632e64493206b76bce6e3af0598790ec076094e8c37Simran Basi     <b>Foo<b>Bar</b></b>
1633e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1634e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    This is perfectly valid (if bizarre) HTML. However, the
1635e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    BeautifulSoup class will implicitly close the first b tag when it
1636e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    encounters the second 'b'. It will think the author wrote
1637e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    "<b>Foo<b>Bar", and didn't close the first 'b' tag, because
1638e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    there's no real-world reason to bold something that's already
1639e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    bold. When it encounters '</b></b>' it will close two more 'b'
1640e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    tags, for a grand total of three tags closed instead of two. This
1641e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    can throw off the rest of your document structure. The same is
1642e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    true of a number of other tags, listed below.
1643e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1644e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    It's much more common for someone to forget to close a 'b' tag
1645e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    than to actually use nested 'b' tags, and the BeautifulSoup class
1646e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    handles the common case. This class handles the not-co-common
1647e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    case: where you can't believe someone wrote what they did, but
1648e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    it's valid HTML and BeautifulSoup screwed up by assuming it
1649e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    wouldn't be."""
1650e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1651e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    I_CANT_BELIEVE_THEYRE_NESTABLE_INLINE_TAGS = \
1652e64493206b76bce6e3af0598790ec076094e8c37Simran Basi     ('em', 'big', 'i', 'small', 'tt', 'abbr', 'acronym', 'strong',
1653e64493206b76bce6e3af0598790ec076094e8c37Simran Basi      'cite', 'code', 'dfn', 'kbd', 'samp', 'strong', 'var', 'b',
1654e64493206b76bce6e3af0598790ec076094e8c37Simran Basi      'big')
1655e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1656e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    I_CANT_BELIEVE_THEYRE_NESTABLE_BLOCK_TAGS = ('noscript',)
1657e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1658e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    NESTABLE_TAGS = buildTagMap([], BeautifulSoup.NESTABLE_TAGS,
1659e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                                I_CANT_BELIEVE_THEYRE_NESTABLE_BLOCK_TAGS,
1660e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                                I_CANT_BELIEVE_THEYRE_NESTABLE_INLINE_TAGS)
1661e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1662e64493206b76bce6e3af0598790ec076094e8c37Simran Basiclass MinimalSoup(BeautifulSoup):
1663e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    """The MinimalSoup class is for parsing HTML that contains
1664e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    pathologically bad markup. It makes no assumptions about tag
1665e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    nesting, but it does know which tags are self-closing, that
1666e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    <script> tags contain Javascript and should not be parsed, that
1667e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    META tags may contain encoding information, and so on.
1668e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1669e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    This also makes it better for subclassing than BeautifulStoneSoup
1670e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    or BeautifulSoup."""
1671e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1672e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    RESET_NESTING_TAGS = buildTagMap('noscript')
1673e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    NESTABLE_TAGS = {}
1674e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1675e64493206b76bce6e3af0598790ec076094e8c37Simran Basiclass BeautifulSOAP(BeautifulStoneSoup):
1676e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    """This class will push a tag with only a single string child into
1677e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    the tag's parent as an attribute. The attribute's name is the tag
1678e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    name, and the value is the string child. An example should give
1679e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    the flavor of the change:
1680e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1681e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    <foo><bar>baz</bar></foo>
1682e64493206b76bce6e3af0598790ec076094e8c37Simran Basi     =>
1683e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    <foo bar="baz"><bar>baz</bar></foo>
1684e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1685e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    You can then access fooTag['bar'] instead of fooTag.barTag.string.
1686e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1687e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    This is, of course, useful for scraping structures that tend to
1688e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    use subelements instead of attributes, such as SOAP messages. Note
1689e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    that it modifies its input, so don't print the modified version
1690e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    out.
1691e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1692e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    I'm not sure how many people really want to use this class; let me
1693e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    know if you do. Mainly I like the name."""
1694e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1695e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def popTag(self):
1696e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if len(self.tagStack) > 1:
1697e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            tag = self.tagStack[-1]
1698e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            parent = self.tagStack[-2]
1699e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            parent._getAttrMap()
1700e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            if (isinstance(tag, Tag) and len(tag.contents) == 1 and
1701e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                isinstance(tag.contents[0], NavigableString) and
1702e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                not parent.attrMap.has_key(tag.name)):
1703e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                parent[tag.name] = tag.contents[0]
1704e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        BeautifulStoneSoup.popTag(self)
1705e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1706e64493206b76bce6e3af0598790ec076094e8c37Simran Basi#Enterprise class names! It has come to our attention that some people
1707e64493206b76bce6e3af0598790ec076094e8c37Simran Basi#think the names of the Beautiful Soup parser classes are too silly
1708e64493206b76bce6e3af0598790ec076094e8c37Simran Basi#and "unprofessional" for use in enterprise screen-scraping. We feel
1709e64493206b76bce6e3af0598790ec076094e8c37Simran Basi#your pain! For such-minded folk, the Beautiful Soup Consortium And
1710e64493206b76bce6e3af0598790ec076094e8c37Simran Basi#All-Night Kosher Bakery recommends renaming this file to
1711e64493206b76bce6e3af0598790ec076094e8c37Simran Basi#"RobustParser.py" (or, in cases of extreme enterprisiness,
1712e64493206b76bce6e3af0598790ec076094e8c37Simran Basi#"RobustParserBeanInterface.class") and using the following
1713e64493206b76bce6e3af0598790ec076094e8c37Simran Basi#enterprise-friendly class aliases:
1714e64493206b76bce6e3af0598790ec076094e8c37Simran Basiclass RobustXMLParser(BeautifulStoneSoup):
1715e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    pass
1716e64493206b76bce6e3af0598790ec076094e8c37Simran Basiclass RobustHTMLParser(BeautifulSoup):
1717e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    pass
1718e64493206b76bce6e3af0598790ec076094e8c37Simran Basiclass RobustWackAssHTMLParser(ICantBelieveItsBeautifulSoup):
1719e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    pass
1720e64493206b76bce6e3af0598790ec076094e8c37Simran Basiclass RobustInsanelyWackAssHTMLParser(MinimalSoup):
1721e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    pass
1722e64493206b76bce6e3af0598790ec076094e8c37Simran Basiclass SimplifyingSOAPParser(BeautifulSOAP):
1723e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    pass
1724e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1725e64493206b76bce6e3af0598790ec076094e8c37Simran Basi######################################################
1726e64493206b76bce6e3af0598790ec076094e8c37Simran Basi#
1727e64493206b76bce6e3af0598790ec076094e8c37Simran Basi# Bonus library: Unicode, Dammit
1728e64493206b76bce6e3af0598790ec076094e8c37Simran Basi#
1729e64493206b76bce6e3af0598790ec076094e8c37Simran Basi# This class forces XML data into a standard format (usually to UTF-8
1730e64493206b76bce6e3af0598790ec076094e8c37Simran Basi# or Unicode).  It is heavily based on code from Mark Pilgrim's
1731e64493206b76bce6e3af0598790ec076094e8c37Simran Basi# Universal Feed Parser. It does not rewrite the XML or HTML to
1732e64493206b76bce6e3af0598790ec076094e8c37Simran Basi# reflect a new encoding: that happens in BeautifulStoneSoup.handle_pi
1733e64493206b76bce6e3af0598790ec076094e8c37Simran Basi# (XML) and BeautifulSoup.start_meta (HTML).
1734e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1735e64493206b76bce6e3af0598790ec076094e8c37Simran Basi# Autodetects character encodings.
1736e64493206b76bce6e3af0598790ec076094e8c37Simran Basi# Download from http://chardet.feedparser.org/
1737e64493206b76bce6e3af0598790ec076094e8c37Simran Basitry:
1738e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    import chardet
1739e64493206b76bce6e3af0598790ec076094e8c37Simran Basi#    import chardet.constants
1740e64493206b76bce6e3af0598790ec076094e8c37Simran Basi#    chardet.constants._debug = 1
1741e64493206b76bce6e3af0598790ec076094e8c37Simran Basiexcept ImportError:
1742e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    chardet = None
1743e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1744e64493206b76bce6e3af0598790ec076094e8c37Simran Basi# cjkcodecs and iconv_codec make Python know about more character encodings.
1745e64493206b76bce6e3af0598790ec076094e8c37Simran Basi# Both are available from http://cjkpython.i18n.org/
1746e64493206b76bce6e3af0598790ec076094e8c37Simran Basi# They're built in if you use Python 2.4.
1747e64493206b76bce6e3af0598790ec076094e8c37Simran Basitry:
1748e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    import cjkcodecs.aliases
1749e64493206b76bce6e3af0598790ec076094e8c37Simran Basiexcept ImportError:
1750e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    pass
1751e64493206b76bce6e3af0598790ec076094e8c37Simran Basitry:
1752e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    import iconv_codec
1753e64493206b76bce6e3af0598790ec076094e8c37Simran Basiexcept ImportError:
1754e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    pass
1755e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1756e64493206b76bce6e3af0598790ec076094e8c37Simran Basiclass UnicodeDammit:
1757e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    """A class for detecting the encoding of a *ML document and
1758e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    converting it to a Unicode string. If the source encoding is
1759e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    windows-1252, can replace MS smart quotes with their HTML or XML
1760e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    equivalents."""
1761e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1762e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    # This dictionary maps commonly seen values for "charset" in HTML
1763e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    # meta tags to the corresponding Python codec names. It only covers
1764e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    # values that aren't in Python's aliases and can't be determined
1765e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    # by the heuristics in find_codec.
1766e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    CHARSET_ALIASES = { "macintosh" : "mac-roman",
1767e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                        "x-sjis" : "shift-jis" }
1768e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1769e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def __init__(self, markup, overrideEncodings=[],
1770e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                 smartQuotesTo='xml', isHTML=False):
1771e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.declaredHTMLEncoding = None
1772e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.markup, documentEncoding, sniffedEncoding = \
1773e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                     self._detectEncoding(markup, isHTML)
1774e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.smartQuotesTo = smartQuotesTo
1775e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.triedEncodings = []
1776e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if markup == '' or isinstance(markup, unicode):
1777e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            self.originalEncoding = None
1778e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            self.unicode = unicode(markup)
1779e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            return
1780e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1781e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        u = None
1782e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        for proposedEncoding in overrideEncodings:
1783e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            u = self._convertFrom(proposedEncoding)
1784e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            if u: break
1785e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if not u:
1786e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            for proposedEncoding in (documentEncoding, sniffedEncoding):
1787e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                u = self._convertFrom(proposedEncoding)
1788e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                if u: break
1789e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1790e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        # If no luck and we have auto-detection library, try that:
1791e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if not u and chardet and not isinstance(self.markup, unicode):
1792e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            u = self._convertFrom(chardet.detect(self.markup)['encoding'])
1793e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1794e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        # As a last resort, try utf-8 and windows-1252:
1795e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if not u:
1796e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            for proposed_encoding in ("utf-8", "windows-1252"):
1797e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                u = self._convertFrom(proposed_encoding)
1798e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                if u: break
1799e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1800e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.unicode = u
1801e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if not u: self.originalEncoding = None
1802e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1803e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def _subMSChar(self, orig):
1804e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        """Changes a MS smart quote character to an XML or HTML
1805e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        entity."""
1806e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        sub = self.MS_CHARS.get(orig)
1807e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if isinstance(sub, tuple):
1808e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            if self.smartQuotesTo == 'xml':
1809e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                sub = '&#x%s;' % sub[1]
1810e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            else:
1811e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                sub = '&%s;' % sub[0]
1812e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return sub
1813e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1814e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def _convertFrom(self, proposed):
1815e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        proposed = self.find_codec(proposed)
1816e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if not proposed or proposed in self.triedEncodings:
1817e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            return None
1818e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        self.triedEncodings.append(proposed)
1819e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        markup = self.markup
1820e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1821e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        # Convert smart quotes to HTML if coming from an encoding
1822e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        # that might have them.
1823e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if self.smartQuotesTo and proposed.lower() in("windows-1252",
1824e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                                                      "iso-8859-1",
1825e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                                                      "iso-8859-2"):
1826e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            markup = re.compile("([\x80-\x9f])").sub \
1827e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                     (lambda(x): self._subMSChar(x.group(1)),
1828e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                      markup)
1829e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1830e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        try:
1831e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            # print "Trying to convert document to %s" % proposed
1832e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            u = self._toUnicode(markup, proposed)
1833e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            self.markup = u
1834e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            self.originalEncoding = proposed
1835e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        except Exception, e:
1836e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            # print "That didn't work!"
1837e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            # print e
1838e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            return None
1839e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        #print "Correct encoding: %s" % proposed
1840e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return self.markup
1841e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1842e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def _toUnicode(self, data, encoding):
1843e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        '''Given a string and its encoding, decodes the string into Unicode.
1844e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        %encoding is a string recognized by encodings.aliases'''
1845e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1846e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        # strip Byte Order Mark (if present)
1847e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if (len(data) >= 4) and (data[:2] == '\xfe\xff') \
1848e64493206b76bce6e3af0598790ec076094e8c37Simran Basi               and (data[2:4] != '\x00\x00'):
1849e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            encoding = 'utf-16be'
1850e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            data = data[2:]
1851e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        elif (len(data) >= 4) and (data[:2] == '\xff\xfe') \
1852e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                 and (data[2:4] != '\x00\x00'):
1853e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            encoding = 'utf-16le'
1854e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            data = data[2:]
1855e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        elif data[:3] == '\xef\xbb\xbf':
1856e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            encoding = 'utf-8'
1857e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            data = data[3:]
1858e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        elif data[:4] == '\x00\x00\xfe\xff':
1859e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            encoding = 'utf-32be'
1860e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            data = data[4:]
1861e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        elif data[:4] == '\xff\xfe\x00\x00':
1862e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            encoding = 'utf-32le'
1863e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            data = data[4:]
1864e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        newdata = unicode(data, encoding)
1865e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return newdata
1866e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1867e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def _detectEncoding(self, xml_data, isHTML=False):
1868e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        """Given a document, tries to detect its XML encoding."""
1869e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        xml_encoding = sniffed_xml_encoding = None
1870e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        try:
1871e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            if xml_data[:4] == '\x4c\x6f\xa7\x94':
1872e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                # EBCDIC
1873e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                xml_data = self._ebcdic_to_ascii(xml_data)
1874e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            elif xml_data[:4] == '\x00\x3c\x00\x3f':
1875e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                # UTF-16BE
1876e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                sniffed_xml_encoding = 'utf-16be'
1877e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                xml_data = unicode(xml_data, 'utf-16be').encode('utf-8')
1878e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            elif (len(xml_data) >= 4) and (xml_data[:2] == '\xfe\xff') \
1879e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                     and (xml_data[2:4] != '\x00\x00'):
1880e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                # UTF-16BE with BOM
1881e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                sniffed_xml_encoding = 'utf-16be'
1882e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                xml_data = unicode(xml_data[2:], 'utf-16be').encode('utf-8')
1883e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            elif xml_data[:4] == '\x3c\x00\x3f\x00':
1884e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                # UTF-16LE
1885e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                sniffed_xml_encoding = 'utf-16le'
1886e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                xml_data = unicode(xml_data, 'utf-16le').encode('utf-8')
1887e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            elif (len(xml_data) >= 4) and (xml_data[:2] == '\xff\xfe') and \
1888e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                     (xml_data[2:4] != '\x00\x00'):
1889e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                # UTF-16LE with BOM
1890e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                sniffed_xml_encoding = 'utf-16le'
1891e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                xml_data = unicode(xml_data[2:], 'utf-16le').encode('utf-8')
1892e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            elif xml_data[:4] == '\x00\x00\x00\x3c':
1893e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                # UTF-32BE
1894e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                sniffed_xml_encoding = 'utf-32be'
1895e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                xml_data = unicode(xml_data, 'utf-32be').encode('utf-8')
1896e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            elif xml_data[:4] == '\x3c\x00\x00\x00':
1897e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                # UTF-32LE
1898e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                sniffed_xml_encoding = 'utf-32le'
1899e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                xml_data = unicode(xml_data, 'utf-32le').encode('utf-8')
1900e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            elif xml_data[:4] == '\x00\x00\xfe\xff':
1901e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                # UTF-32BE with BOM
1902e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                sniffed_xml_encoding = 'utf-32be'
1903e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                xml_data = unicode(xml_data[4:], 'utf-32be').encode('utf-8')
1904e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            elif xml_data[:4] == '\xff\xfe\x00\x00':
1905e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                # UTF-32LE with BOM
1906e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                sniffed_xml_encoding = 'utf-32le'
1907e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                xml_data = unicode(xml_data[4:], 'utf-32le').encode('utf-8')
1908e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            elif xml_data[:3] == '\xef\xbb\xbf':
1909e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                # UTF-8 with BOM
1910e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                sniffed_xml_encoding = 'utf-8'
1911e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                xml_data = unicode(xml_data[3:], 'utf-8').encode('utf-8')
1912e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            else:
1913e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                sniffed_xml_encoding = 'ascii'
1914e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                pass
1915e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        except:
1916e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            xml_encoding_match = None
1917e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        xml_encoding_match = re.compile(
1918e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            '^<\?.*encoding=[\'"](.*?)[\'"].*\?>').match(xml_data)
1919e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if not xml_encoding_match and isHTML:
1920e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            regexp = re.compile('<\s*meta[^>]+charset=([^>]*?)[;\'">]', re.I)
1921e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            xml_encoding_match = regexp.search(xml_data)
1922e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if xml_encoding_match is not None:
1923e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            xml_encoding = xml_encoding_match.groups()[0].lower()
1924e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            if isHTML:
1925e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                self.declaredHTMLEncoding = xml_encoding
1926e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            if sniffed_xml_encoding and \
1927e64493206b76bce6e3af0598790ec076094e8c37Simran Basi               (xml_encoding in ('iso-10646-ucs-2', 'ucs-2', 'csunicode',
1928e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                                 'iso-10646-ucs-4', 'ucs-4', 'csucs4',
1929e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                                 'utf-16', 'utf-32', 'utf_16', 'utf_32',
1930e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                                 'utf16', 'u16')):
1931e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                xml_encoding = sniffed_xml_encoding
1932e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return xml_data, xml_encoding, sniffed_xml_encoding
1933e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1934e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1935e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def find_codec(self, charset):
1936e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return self._codec(self.CHARSET_ALIASES.get(charset, charset)) \
1937e64493206b76bce6e3af0598790ec076094e8c37Simran Basi               or (charset and self._codec(charset.replace("-", ""))) \
1938e64493206b76bce6e3af0598790ec076094e8c37Simran Basi               or (charset and self._codec(charset.replace("-", "_"))) \
1939e64493206b76bce6e3af0598790ec076094e8c37Simran Basi               or charset
1940e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1941e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def _codec(self, charset):
1942e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if not charset: return charset
1943e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        codec = None
1944e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        try:
1945e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            codecs.lookup(charset)
1946e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            codec = charset
1947e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        except (LookupError, ValueError):
1948e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            pass
1949e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return codec
1950e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1951e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    EBCDIC_TO_ASCII_MAP = None
1952e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    def _ebcdic_to_ascii(self, s):
1953e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        c = self.__class__
1954e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        if not c.EBCDIC_TO_ASCII_MAP:
1955e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            emap = (0,1,2,3,156,9,134,127,151,141,142,11,12,13,14,15,
1956e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    16,17,18,19,157,133,8,135,24,25,146,143,28,29,30,31,
1957e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    128,129,130,131,132,10,23,27,136,137,138,139,140,5,6,7,
1958e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    144,145,22,147,148,149,150,4,152,153,154,155,20,21,158,26,
1959e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    32,160,161,162,163,164,165,166,167,168,91,46,60,40,43,33,
1960e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    38,169,170,171,172,173,174,175,176,177,93,36,42,41,59,94,
1961e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    45,47,178,179,180,181,182,183,184,185,124,44,37,95,62,63,
1962e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    186,187,188,189,190,191,192,193,194,96,58,35,64,39,61,34,
1963e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    195,97,98,99,100,101,102,103,104,105,196,197,198,199,200,
1964e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    201,202,106,107,108,109,110,111,112,113,114,203,204,205,
1965e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    206,207,208,209,126,115,116,117,118,119,120,121,122,210,
1966e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    211,212,213,214,215,216,217,218,219,220,221,222,223,224,
1967e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    225,226,227,228,229,230,231,123,65,66,67,68,69,70,71,72,
1968e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    73,232,233,234,235,236,237,125,74,75,76,77,78,79,80,81,
1969e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    82,238,239,240,241,242,243,92,159,83,84,85,86,87,88,89,
1970e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    90,244,245,246,247,248,249,48,49,50,51,52,53,54,55,56,57,
1971e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                    250,251,252,253,254,255)
1972e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            import string
1973e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            c.EBCDIC_TO_ASCII_MAP = string.maketrans( \
1974e64493206b76bce6e3af0598790ec076094e8c37Simran Basi            ''.join(map(chr, range(256))), ''.join(map(chr, emap)))
1975e64493206b76bce6e3af0598790ec076094e8c37Simran Basi        return s.translate(c.EBCDIC_TO_ASCII_MAP)
1976e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
1977e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    MS_CHARS = { '\x80' : ('euro', '20AC'),
1978e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                 '\x81' : ' ',
1979e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                 '\x82' : ('sbquo', '201A'),
1980e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                 '\x83' : ('fnof', '192'),
1981e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                 '\x84' : ('bdquo', '201E'),
1982e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                 '\x85' : ('hellip', '2026'),
1983e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                 '\x86' : ('dagger', '2020'),
1984e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                 '\x87' : ('Dagger', '2021'),
1985e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                 '\x88' : ('circ', '2C6'),
1986e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                 '\x89' : ('permil', '2030'),
1987e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                 '\x8A' : ('Scaron', '160'),
1988e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                 '\x8B' : ('lsaquo', '2039'),
1989e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                 '\x8C' : ('OElig', '152'),
1990e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                 '\x8D' : '?',
1991e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                 '\x8E' : ('#x17D', '17D'),
1992e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                 '\x8F' : '?',
1993e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                 '\x90' : '?',
1994e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                 '\x91' : ('lsquo', '2018'),
1995e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                 '\x92' : ('rsquo', '2019'),
1996e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                 '\x93' : ('ldquo', '201C'),
1997e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                 '\x94' : ('rdquo', '201D'),
1998e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                 '\x95' : ('bull', '2022'),
1999e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                 '\x96' : ('ndash', '2013'),
2000e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                 '\x97' : ('mdash', '2014'),
2001e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                 '\x98' : ('tilde', '2DC'),
2002e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                 '\x99' : ('trade', '2122'),
2003e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                 '\x9a' : ('scaron', '161'),
2004e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                 '\x9b' : ('rsaquo', '203A'),
2005e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                 '\x9c' : ('oelig', '153'),
2006e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                 '\x9d' : '?',
2007e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                 '\x9e' : ('#x17E', '17E'),
2008e64493206b76bce6e3af0598790ec076094e8c37Simran Basi                 '\x9f' : ('Yuml', ''),}
2009e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
2010e64493206b76bce6e3af0598790ec076094e8c37Simran Basi#######################################################################
2011e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
2012e64493206b76bce6e3af0598790ec076094e8c37Simran Basi
2013e64493206b76bce6e3af0598790ec076094e8c37Simran Basi#By default, act as an HTML pretty-printer.
2014e64493206b76bce6e3af0598790ec076094e8c37Simran Basiif __name__ == '__main__':
2015e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    import sys
2016e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    soup = BeautifulSoup(sys.stdin)
2017e64493206b76bce6e3af0598790ec076094e8c37Simran Basi    print soup.prettify()
2018