10bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch"""Beautiful Soup
20bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben MurdochElixir and Tonic
30bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch"The Screen-Scraper's Friend"
40bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdochhttp://www.crummy.com/software/BeautifulSoup/
50bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
60bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben MurdochBeautiful Soup parses a (possibly invalid) XML or HTML document into a
70bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdochtree representation. It provides methods and Pythonic idioms that make
80bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdochit easy to navigate, search, and modify the tree.
90bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
100bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben MurdochA well-formed XML/HTML document yields a well-formed data
110bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdochstructure. An ill-formed XML/HTML document yields a correspondingly
120bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdochill-formed data structure. If your document is only locally
130bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdochwell-formed, you can use this library to find and process the
140bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdochwell-formed part of it.
150bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
160bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben MurdochBeautiful Soup works with Python 2.2 and up. It has no external
170bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdochdependencies, but you'll have more success at converting data to UTF-8
180bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdochif you also install these three packages:
190bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
200bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch* chardet, for auto-detecting character encodings
210bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch  http://chardet.feedparser.org/
220bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch* cjkcodecs and iconv_codec, which add more encodings to the ones supported
230bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch  by stock Python.
240bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch  http://cjkpython.i18n.org/
250bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
260bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben MurdochBeautiful Soup defines classes for two main parsing strategies:
270bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
280bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch * BeautifulStoneSoup, for parsing XML, SGML, or your domain-specific
290bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch   language that kind of looks like XML.
300bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
310bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch * BeautifulSoup, for parsing run-of-the-mill HTML code, be it valid
320bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch   or invalid. This class has web browser-like heuristics for
330bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch   obtaining a sensible parse tree in the face of common HTML errors.
340bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
350bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben MurdochBeautiful Soup also defines a class (UnicodeDammit) for autodetecting
360bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdochthe encoding of an HTML or XML document, and converting it to
370bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben MurdochUnicode. Much of this code is taken from Mark Pilgrim's Universal Feed Parser.
380bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
390bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben MurdochFor more than you ever wanted to know about Beautiful Soup, see the
400bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdochdocumentation:
410bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdochhttp://www.crummy.com/software/BeautifulSoup/documentation.html
420bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
430bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben MurdochHere, have some legalese:
440bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
450bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben MurdochCopyright (c) 2004-2009, Leonard Richardson
460bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
470bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben MurdochAll rights reserved.
480bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
490bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben MurdochRedistribution and use in source and binary forms, with or without
500bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdochmodification, are permitted provided that the following conditions are
510bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdochmet:
520bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
530bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch  * Redistributions of source code must retain the above copyright
540bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    notice, this list of conditions and the following disclaimer.
550bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
560bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch  * Redistributions in binary form must reproduce the above
570bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    copyright notice, this list of conditions and the following
580bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    disclaimer in the documentation and/or other materials provided
590bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    with the distribution.
600bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
610bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch  * Neither the name of the the Beautiful Soup Consortium and All
620bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    Night Kosher Bakery nor the names of its contributors may be
630bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    used to endorse or promote products derived from this software
640bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    without specific prior written permission.
650bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
660bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben MurdochTHIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
670bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
680bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben MurdochLIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
690bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben MurdochA PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
700bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben MurdochCONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
710bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben MurdochEXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
720bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben MurdochPROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
730bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben MurdochPROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
740bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben MurdochLIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
750bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben MurdochNEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
760bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben MurdochSOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE, DAMMIT.
770bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
780bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch"""
790bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdochfrom __future__ import generators
800bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
810bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch__author__ = "Leonard Richardson (leonardr@segfault.org)"
820bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch__version__ = "3.1.0.1"
830bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch__copyright__ = "Copyright (c) 2004-2009 Leonard Richardson"
840bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch__license__ = "New-style BSD"
850bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
860bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdochimport codecs
870bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdochimport markupbase
880bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdochimport types
890bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdochimport re
900bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdochfrom HTMLParser import HTMLParser, HTMLParseError
910bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdochtry:
920bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    from htmlentitydefs import name2codepoint
930bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdochexcept ImportError:
940bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    name2codepoint = {}
950bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdochtry:
960bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    set
970bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdochexcept NameError:
980bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    from sets import Set as set
990bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
1000bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch#These hacks make Beautiful Soup able to parse XML with namespaces
1010bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdochmarkupbase._declname_match = re.compile(r'[a-zA-Z][-_.:a-zA-Z0-9]*\s*').match
1020bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
1030bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben MurdochDEFAULT_OUTPUT_ENCODING = "utf-8"
1040bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
1050bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch# First, the classes that represent markup elements.
1060bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
1070bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdochdef sob(unicode, encoding):
1080bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    """Returns either the given Unicode string or its encoding."""
1090bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    if encoding is None:
1100bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return unicode
1110bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    else:
1120bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return unicode.encode(encoding)
1130bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
1140bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdochclass PageElement:
1150bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    """Contains the navigational information for some part of the page
1160bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    (either a tag or a piece of text)"""
1170bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
1180bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def setup(self, parent=None, previous=None):
1190bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        """Sets up the initial relations between this element and
1200bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        other elements."""
1210bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.parent = parent
1220bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.previous = previous
1230bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.next = None
1240bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.previousSibling = None
1250bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.nextSibling = None
1260bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if self.parent and self.parent.contents:
1270bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            self.previousSibling = self.parent.contents[-1]
1280bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            self.previousSibling.nextSibling = self
1290bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
1300bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def replaceWith(self, replaceWith):
1310bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        oldParent = self.parent
1320bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        myIndex = self.parent.contents.index(self)
1330bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if hasattr(replaceWith, 'parent') and replaceWith.parent == self.parent:
1340bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            # We're replacing this element with one of its siblings.
1350bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            index = self.parent.contents.index(replaceWith)
1360bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            if index and index < myIndex:
1370bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                # Furthermore, it comes before this element. That
1380bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                # means that when we extract it, the index of this
1390bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                # element will change.
1400bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                myIndex = myIndex - 1
1410bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.extract()
1420bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        oldParent.insert(myIndex, replaceWith)
1430bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
1440bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def extract(self):
1450bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        """Destructively rips this element out of the tree."""
1460bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if self.parent:
1470bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            try:
1480bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                self.parent.contents.remove(self)
1490bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            except ValueError:
1500bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                pass
1510bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
1520bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        #Find the two elements that would be next to each other if
1530bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        #this element (and any children) hadn't been parsed. Connect
1540bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        #the two.
1550bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        lastChild = self._lastRecursiveChild()
1560bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        nextElement = lastChild.next
1570bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
1580bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if self.previous:
1590bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            self.previous.next = nextElement
1600bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if nextElement:
1610bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            nextElement.previous = self.previous
1620bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.previous = None
1630bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        lastChild.next = None
1640bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
1650bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.parent = None
1660bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if self.previousSibling:
1670bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            self.previousSibling.nextSibling = self.nextSibling
1680bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if self.nextSibling:
1690bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            self.nextSibling.previousSibling = self.previousSibling
1700bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.previousSibling = self.nextSibling = None
1710bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return self
1720bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
1730bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def _lastRecursiveChild(self):
1740bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        "Finds the last element beneath this object to be parsed."
1750bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        lastChild = self
1760bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        while hasattr(lastChild, 'contents') and lastChild.contents:
1770bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            lastChild = lastChild.contents[-1]
1780bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return lastChild
1790bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
1800bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def insert(self, position, newChild):
1810bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if (isinstance(newChild, basestring)
1820bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            or isinstance(newChild, unicode)) \
1830bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            and not isinstance(newChild, NavigableString):
1840bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            newChild = NavigableString(newChild)
1850bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
1860bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        position =  min(position, len(self.contents))
1870bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if hasattr(newChild, 'parent') and newChild.parent != None:
1880bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            # We're 'inserting' an element that's already one
1890bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            # of this object's children.
1900bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            if newChild.parent == self:
1910bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                index = self.find(newChild)
1920bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                if index and index < position:
1930bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    # Furthermore we're moving it further down the
1940bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    # list of this object's children. That means that
1950bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    # when we extract this element, our target index
1960bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    # will jump down one.
1970bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    position = position - 1
1980bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            newChild.extract()
1990bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
2000bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        newChild.parent = self
2010bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        previousChild = None
2020bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if position == 0:
2030bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            newChild.previousSibling = None
2040bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            newChild.previous = self
2050bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        else:
2060bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            previousChild = self.contents[position-1]
2070bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            newChild.previousSibling = previousChild
2080bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            newChild.previousSibling.nextSibling = newChild
2090bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            newChild.previous = previousChild._lastRecursiveChild()
2100bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if newChild.previous:
2110bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            newChild.previous.next = newChild
2120bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
2130bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        newChildsLastElement = newChild._lastRecursiveChild()
2140bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
2150bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if position >= len(self.contents):
2160bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            newChild.nextSibling = None
2170bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
2180bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            parent = self
2190bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            parentsNextSibling = None
2200bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            while not parentsNextSibling:
2210bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                parentsNextSibling = parent.nextSibling
2220bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                parent = parent.parent
2230bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                if not parent: # This is the last element in the document.
2240bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    break
2250bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            if parentsNextSibling:
2260bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                newChildsLastElement.next = parentsNextSibling
2270bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            else:
2280bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                newChildsLastElement.next = None
2290bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        else:
2300bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            nextChild = self.contents[position]
2310bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            newChild.nextSibling = nextChild
2320bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            if newChild.nextSibling:
2330bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                newChild.nextSibling.previousSibling = newChild
2340bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            newChildsLastElement.next = nextChild
2350bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
2360bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if newChildsLastElement.next:
2370bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            newChildsLastElement.next.previous = newChildsLastElement
2380bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.contents.insert(position, newChild)
2390bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
2400bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def append(self, tag):
2410bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        """Appends the given tag to the contents of this tag."""
2420bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.insert(len(self.contents), tag)
2430bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
2440bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def findNext(self, name=None, attrs={}, text=None, **kwargs):
2450bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        """Returns the first item that matches the given criteria and
2460bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        appears after this Tag in the document."""
2470bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return self._findOne(self.findAllNext, name, attrs, text, **kwargs)
2480bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
2490bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def findAllNext(self, name=None, attrs={}, text=None, limit=None,
2500bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    **kwargs):
2510bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        """Returns all items that match the given criteria and appear
2520bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        after this Tag in the document."""
2530bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return self._findAll(name, attrs, text, limit, self.nextGenerator,
2540bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                             **kwargs)
2550bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
2560bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def findNextSibling(self, name=None, attrs={}, text=None, **kwargs):
2570bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        """Returns the closest sibling to this Tag that matches the
2580bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        given criteria and appears after this Tag in the document."""
2590bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return self._findOne(self.findNextSiblings, name, attrs, text,
2600bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                             **kwargs)
2610bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
2620bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def findNextSiblings(self, name=None, attrs={}, text=None, limit=None,
2630bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                         **kwargs):
2640bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        """Returns the siblings of this Tag that match the given
2650bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        criteria and appear after this Tag in the document."""
2660bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return self._findAll(name, attrs, text, limit,
2670bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                             self.nextSiblingGenerator, **kwargs)
2680bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    fetchNextSiblings = findNextSiblings # Compatibility with pre-3.x
2690bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
2700bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def findPrevious(self, name=None, attrs={}, text=None, **kwargs):
2710bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        """Returns the first item that matches the given criteria and
2720bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        appears before this Tag in the document."""
2730bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return self._findOne(self.findAllPrevious, name, attrs, text, **kwargs)
2740bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
2750bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def findAllPrevious(self, name=None, attrs={}, text=None, limit=None,
2760bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                        **kwargs):
2770bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        """Returns all items that match the given criteria and appear
2780bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        before this Tag in the document."""
2790bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return self._findAll(name, attrs, text, limit, self.previousGenerator,
2800bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                           **kwargs)
2810bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    fetchPrevious = findAllPrevious # Compatibility with pre-3.x
2820bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
2830bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def findPreviousSibling(self, name=None, attrs={}, text=None, **kwargs):
2840bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        """Returns the closest sibling to this Tag that matches the
2850bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        given criteria and appears before this Tag in the document."""
2860bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return self._findOne(self.findPreviousSiblings, name, attrs, text,
2870bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                             **kwargs)
2880bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
2890bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def findPreviousSiblings(self, name=None, attrs={}, text=None,
2900bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                             limit=None, **kwargs):
2910bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        """Returns the siblings of this Tag that match the given
2920bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        criteria and appear before this Tag in the document."""
2930bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return self._findAll(name, attrs, text, limit,
2940bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                             self.previousSiblingGenerator, **kwargs)
2950bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    fetchPreviousSiblings = findPreviousSiblings # Compatibility with pre-3.x
2960bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
2970bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def findParent(self, name=None, attrs={}, **kwargs):
2980bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        """Returns the closest parent of this Tag that matches the given
2990bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        criteria."""
3000bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        # NOTE: We can't use _findOne because findParents takes a different
3010bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        # set of arguments.
3020bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        r = None
3030bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        l = self.findParents(name, attrs, 1)
3040bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if l:
3050bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            r = l[0]
3060bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return r
3070bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
3080bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def findParents(self, name=None, attrs={}, limit=None, **kwargs):
3090bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        """Returns the parents of this Tag that match the given
3100bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        criteria."""
3110bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
3120bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return self._findAll(name, attrs, None, limit, self.parentGenerator,
3130bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                             **kwargs)
3140bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    fetchParents = findParents # Compatibility with pre-3.x
3150bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
3160bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    #These methods do the real heavy lifting.
3170bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
3180bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def _findOne(self, method, name, attrs, text, **kwargs):
3190bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        r = None
3200bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        l = method(name, attrs, text, 1, **kwargs)
3210bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if l:
3220bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            r = l[0]
3230bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return r
3240bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
3250bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def _findAll(self, name, attrs, text, limit, generator, **kwargs):
3260bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        "Iterates over a generator looking for things that match."
3270bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
3280bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if isinstance(name, SoupStrainer):
3290bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            strainer = name
3300bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        else:
3310bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            # Build a SoupStrainer
3320bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            strainer = SoupStrainer(name, attrs, text, **kwargs)
3330bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        results = ResultSet(strainer)
3340bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        g = generator()
3350bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        while True:
3360bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            try:
3370bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                i = g.next()
3380bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            except StopIteration:
3390bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                break
3400bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            if i:
3410bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                found = strainer.search(i)
3420bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                if found:
3430bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    results.append(found)
3440bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    if limit and len(results) >= limit:
3450bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                        break
3460bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return results
3470bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
3480bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    #These Generators can be used to navigate starting from both
3490bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    #NavigableStrings and Tags.
3500bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def nextGenerator(self):
3510bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        i = self
3520bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        while i:
3530bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            i = i.next
3540bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            yield i
3550bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
3560bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def nextSiblingGenerator(self):
3570bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        i = self
3580bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        while i:
3590bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            i = i.nextSibling
3600bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            yield i
3610bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
3620bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def previousGenerator(self):
3630bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        i = self
3640bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        while i:
3650bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            i = i.previous
3660bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            yield i
3670bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
3680bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def previousSiblingGenerator(self):
3690bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        i = self
3700bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        while i:
3710bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            i = i.previousSibling
3720bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            yield i
3730bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
3740bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def parentGenerator(self):
3750bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        i = self
3760bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        while i:
3770bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            i = i.parent
3780bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            yield i
3790bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
3800bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    # Utility methods
3810bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def substituteEncoding(self, str, encoding=None):
3820bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        encoding = encoding or "utf-8"
3830bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return str.replace("%SOUP-ENCODING%", encoding)
3840bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
3850bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def toEncoding(self, s, encoding=None):
3860bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        """Encodes an object to a string in some encoding, or to Unicode.
3870bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        ."""
3880bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if isinstance(s, unicode):
3890bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            if encoding:
3900bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                s = s.encode(encoding)
3910bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        elif isinstance(s, str):
3920bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            if encoding:
3930bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                s = s.encode(encoding)
3940bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            else:
3950bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                s = unicode(s)
3960bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        else:
3970bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            if encoding:
3980bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                s  = self.toEncoding(str(s), encoding)
3990bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            else:
4000bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                s = unicode(s)
4010bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return s
4020bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
4030bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdochclass NavigableString(unicode, PageElement):
4040bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
4050bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def __new__(cls, value):
4060bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        """Create a new NavigableString.
4070bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
4080bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        When unpickling a NavigableString, this method is called with
4090bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        the string in DEFAULT_OUTPUT_ENCODING. That encoding needs to be
4100bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        passed in to the superclass's __new__ or the superclass won't know
4110bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        how to handle non-ASCII characters.
4120bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        """
4130bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if isinstance(value, unicode):
4140bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            return unicode.__new__(cls, value)
4150bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return unicode.__new__(cls, value, DEFAULT_OUTPUT_ENCODING)
4160bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
4170bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def __getnewargs__(self):
4180bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return (unicode(self),)
4190bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
4200bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def __getattr__(self, attr):
4210bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        """text.string gives you text. This is for backwards
4220bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        compatibility for Navigable*String, but for CData* it lets you
4230bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        get the string without the CData wrapper."""
4240bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if attr == 'string':
4250bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            return self
4260bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        else:
4270bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            raise AttributeError, "'%s' object has no attribute '%s'" % (self.__class__.__name__, attr)
4280bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
4290bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def encode(self, encoding=DEFAULT_OUTPUT_ENCODING):
4300bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return self.decode().encode(encoding)
4310bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
4320bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def decodeGivenEventualEncoding(self, eventualEncoding):
4330bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return self
4340bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
4350bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdochclass CData(NavigableString):
4360bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
4370bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def decodeGivenEventualEncoding(self, eventualEncoding):
4380bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return u'<![CDATA[' + self + u']]>'
4390bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
4400bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdochclass ProcessingInstruction(NavigableString):
4410bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
4420bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def decodeGivenEventualEncoding(self, eventualEncoding):
4430bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        output = self
4440bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if u'%SOUP-ENCODING%' in output:
4450bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            output = self.substituteEncoding(output, eventualEncoding)
4460bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return u'<?' + output + u'?>'
4470bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
4480bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdochclass Comment(NavigableString):
4490bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def decodeGivenEventualEncoding(self, eventualEncoding):
4500bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return u'<!--' + self + u'-->'
4510bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
4520bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdochclass Declaration(NavigableString):
4530bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def decodeGivenEventualEncoding(self, eventualEncoding):
4540bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return u'<!' + self + u'>'
4550bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
4560bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdochclass Tag(PageElement):
4570bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
4580bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    """Represents a found HTML tag with its attributes and contents."""
4590bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
4600bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def _invert(h):
4610bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        "Cheap function to invert a hash."
4620bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        i = {}
4630bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        for k,v in h.items():
4640bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            i[v] = k
4650bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return i
4660bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
4670bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    XML_ENTITIES_TO_SPECIAL_CHARS = { "apos" : "'",
4680bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                                      "quot" : '"',
4690bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                                      "amp" : "&",
4700bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                                      "lt" : "<",
4710bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                                      "gt" : ">" }
4720bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
4730bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    XML_SPECIAL_CHARS_TO_ENTITIES = _invert(XML_ENTITIES_TO_SPECIAL_CHARS)
4740bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
4750bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def _convertEntities(self, match):
4760bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        """Used in a call to re.sub to replace HTML, XML, and numeric
4770bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        entities with the appropriate Unicode characters. If HTML
4780bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        entities are being converted, any unrecognized entities are
4790bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        escaped."""
4800bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        x = match.group(1)
4810bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if self.convertHTMLEntities and x in name2codepoint:
4820bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            return unichr(name2codepoint[x])
4830bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        elif x in self.XML_ENTITIES_TO_SPECIAL_CHARS:
4840bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            if self.convertXMLEntities:
4850bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                return self.XML_ENTITIES_TO_SPECIAL_CHARS[x]
4860bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            else:
4870bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                return u'&%s;' % x
4880bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        elif len(x) > 0 and x[0] == '#':
4890bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            # Handle numeric entities
4900bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            if len(x) > 1 and x[1] == 'x':
4910bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                return unichr(int(x[2:], 16))
4920bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            else:
4930bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                return unichr(int(x[1:]))
4940bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
4950bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        elif self.escapeUnrecognizedEntities:
4960bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            return u'&amp;%s;' % x
4970bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        else:
4980bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            return u'&%s;' % x
4990bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
5000bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def __init__(self, parser, name, attrs=None, parent=None,
5010bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                 previous=None):
5020bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        "Basic constructor."
5030bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
5040bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        # We don't actually store the parser object: that lets extracted
5050bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        # chunks be garbage-collected
5060bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.parserClass = parser.__class__
5070bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.isSelfClosing = parser.isSelfClosingTag(name)
5080bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.name = name
5090bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if attrs == None:
5100bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            attrs = []
5110bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.attrs = attrs
5120bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.contents = []
5130bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.setup(parent, previous)
5140bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.hidden = False
5150bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.containsSubstitutions = False
5160bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.convertHTMLEntities = parser.convertHTMLEntities
5170bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.convertXMLEntities = parser.convertXMLEntities
5180bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.escapeUnrecognizedEntities = parser.escapeUnrecognizedEntities
5190bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
5200bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        def convert(kval):
5210bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            "Converts HTML, XML and numeric entities in the attribute value."
5220bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            k, val = kval
5230bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            if val is None:
5240bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                return kval
5250bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            return (k, re.sub("&(#\d+|#x[0-9a-fA-F]+|\w+);",
5260bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                              self._convertEntities, val))
5270bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.attrs = map(convert, self.attrs)
5280bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
5290bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def get(self, key, default=None):
5300bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        """Returns the value of the 'key' attribute for the tag, or
5310bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        the value given for 'default' if it doesn't have that
5320bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        attribute."""
5330bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return self._getAttrMap().get(key, default)
5340bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
5350bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def has_key(self, key):
5360bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return self._getAttrMap().has_key(key)
5370bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
5380bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def __getitem__(self, key):
5390bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        """tag[key] returns the value of the 'key' attribute for the tag,
5400bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        and throws an exception if it's not there."""
5410bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return self._getAttrMap()[key]
5420bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
5430bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def __iter__(self):
5440bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        "Iterating over a tag iterates over its contents."
5450bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return iter(self.contents)
5460bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
5470bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def __len__(self):
5480bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        "The length of a tag is the length of its list of contents."
5490bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return len(self.contents)
5500bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
5510bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def __contains__(self, x):
5520bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return x in self.contents
5530bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
5540bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def __nonzero__(self):
5550bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        "A tag is non-None even if it has no contents."
5560bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return True
5570bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
5580bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def __setitem__(self, key, value):
5590bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        """Setting tag[key] sets the value of the 'key' attribute for the
5600bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        tag."""
5610bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self._getAttrMap()
5620bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.attrMap[key] = value
5630bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        found = False
5640bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        for i in range(0, len(self.attrs)):
5650bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            if self.attrs[i][0] == key:
5660bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                self.attrs[i] = (key, value)
5670bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                found = True
5680bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if not found:
5690bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            self.attrs.append((key, value))
5700bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self._getAttrMap()[key] = value
5710bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
5720bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def __delitem__(self, key):
5730bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        "Deleting tag[key] deletes all 'key' attributes for the tag."
5740bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        for item in self.attrs:
5750bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            if item[0] == key:
5760bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                self.attrs.remove(item)
5770bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                #We don't break because bad HTML can define the same
5780bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                #attribute multiple times.
5790bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            self._getAttrMap()
5800bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            if self.attrMap.has_key(key):
5810bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                del self.attrMap[key]
5820bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
5830bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def __call__(self, *args, **kwargs):
5840bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        """Calling a tag like a function is the same as calling its
5850bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        findAll() method. Eg. tag('a') returns a list of all the A tags
5860bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        found within this tag."""
5870bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return apply(self.findAll, args, kwargs)
5880bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
5890bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def __getattr__(self, tag):
5900bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        #print "Getattr %s.%s" % (self.__class__, tag)
5910bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if len(tag) > 3 and tag.rfind('Tag') == len(tag)-3:
5920bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            return self.find(tag[:-3])
5930bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        elif tag.find('__') != 0:
5940bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            return self.find(tag)
5950bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        raise AttributeError, "'%s' object has no attribute '%s'" % (self.__class__, tag)
5960bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
5970bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def __eq__(self, other):
5980bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        """Returns true iff this tag has the same name, the same attributes,
5990bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        and the same contents (recursively) as the given tag.
6000bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
6010bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        NOTE: right now this will return false if two tags have the
6020bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        same attributes in a different order. Should this be fixed?"""
6030bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if not hasattr(other, 'name') or not hasattr(other, 'attrs') or not hasattr(other, 'contents') or self.name != other.name or self.attrs != other.attrs or len(self) != len(other):
6040bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            return False
6050bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        for i in range(0, len(self.contents)):
6060bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            if self.contents[i] != other.contents[i]:
6070bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                return False
6080bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return True
6090bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
6100bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def __ne__(self, other):
6110bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        """Returns true iff this tag is not identical to the other tag,
6120bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        as defined in __eq__."""
6130bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return not self == other
6140bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
6150bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def __repr__(self, encoding=DEFAULT_OUTPUT_ENCODING):
6160bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        """Renders this tag as a string."""
6170bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return self.decode(eventualEncoding=encoding)
6180bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
6190bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    BARE_AMPERSAND_OR_BRACKET = re.compile("([<>]|"
6200bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                                           + "&(?!#\d+;|#x[0-9a-fA-F]+;|\w+;)"
6210bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                                           + ")")
6220bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
6230bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def _sub_entity(self, x):
6240bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        """Used with a regular expression to substitute the
6250bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        appropriate XML entity for an XML special character."""
6260bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return "&" + self.XML_SPECIAL_CHARS_TO_ENTITIES[x.group(0)[0]] + ";"
6270bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
6280bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def __unicode__(self):
6290bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return self.decode()
6300bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
6310bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def __str__(self):
6320bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return self.encode()
6330bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
6340bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def encode(self, encoding=DEFAULT_OUTPUT_ENCODING,
6350bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch               prettyPrint=False, indentLevel=0):
6360bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return self.decode(prettyPrint, indentLevel, encoding).encode(encoding)
6370bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
6380bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def decode(self, prettyPrint=False, indentLevel=0,
6390bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch               eventualEncoding=DEFAULT_OUTPUT_ENCODING):
6400bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        """Returns a string or Unicode representation of this tag and
6410bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        its contents. To get Unicode, pass None for encoding."""
6420bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
6430bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        attrs = []
6440bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if self.attrs:
6450bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            for key, val in self.attrs:
6460bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                fmt = '%s="%s"'
6470bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                if isString(val):
6480bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    if (self.containsSubstitutions
6490bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                        and eventualEncoding is not None
6500bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                        and '%SOUP-ENCODING%' in val):
6510bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                        val = self.substituteEncoding(val, eventualEncoding)
6520bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
6530bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    # The attribute value either:
6540bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    #
6550bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    # * Contains no embedded double quotes or single quotes.
6560bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    #   No problem: we enclose it in double quotes.
6570bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    # * Contains embedded single quotes. No problem:
6580bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    #   double quotes work here too.
6590bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    # * Contains embedded double quotes. No problem:
6600bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    #   we enclose it in single quotes.
6610bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    # * Embeds both single _and_ double quotes. This
6620bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    #   can't happen naturally, but it can happen if
6630bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    #   you modify an attribute value after parsing
6640bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    #   the document. Now we have a bit of a
6650bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    #   problem. We solve it by enclosing the
6660bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    #   attribute in single quotes, and escaping any
6670bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    #   embedded single quotes to XML entities.
6680bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    if '"' in val:
6690bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                        fmt = "%s='%s'"
6700bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                        if "'" in val:
6710bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                            # TODO: replace with apos when
6720bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                            # appropriate.
6730bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                            val = val.replace("'", "&squot;")
6740bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
6750bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    # Now we're okay w/r/t quotes. But the attribute
6760bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    # value might also contain angle brackets, or
6770bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    # ampersands that aren't part of entities. We need
6780bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    # to escape those to XML entities too.
6790bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    val = self.BARE_AMPERSAND_OR_BRACKET.sub(self._sub_entity, val)
6800bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                if val is None:
6810bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    # Handle boolean attributes.
6820bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    decoded = key
6830bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                else:
6840bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    decoded = fmt % (key, val)
6850bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                attrs.append(decoded)
6860bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        close = ''
6870bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        closeTag = ''
6880bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if self.isSelfClosing:
6890bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            close = ' /'
6900bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        else:
6910bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            closeTag = '</%s>' % self.name
6920bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
6930bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        indentTag, indentContents = 0, 0
6940bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if prettyPrint:
6950bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            indentTag = indentLevel
6960bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            space = (' ' * (indentTag-1))
6970bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            indentContents = indentTag + 1
6980bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        contents = self.decodeContents(prettyPrint, indentContents,
6990bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                                       eventualEncoding)
7000bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if self.hidden:
7010bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            s = contents
7020bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        else:
7030bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            s = []
7040bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            attributeString = ''
7050bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            if attrs:
7060bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                attributeString = ' ' + ' '.join(attrs)
7070bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            if prettyPrint:
7080bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                s.append(space)
7090bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            s.append('<%s%s%s>' % (self.name, attributeString, close))
7100bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            if prettyPrint:
7110bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                s.append("\n")
7120bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            s.append(contents)
7130bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            if prettyPrint and contents and contents[-1] != "\n":
7140bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                s.append("\n")
7150bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            if prettyPrint and closeTag:
7160bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                s.append(space)
7170bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            s.append(closeTag)
7180bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            if prettyPrint and closeTag and self.nextSibling:
7190bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                s.append("\n")
7200bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            s = ''.join(s)
7210bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return s
7220bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
7230bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def decompose(self):
7240bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        """Recursively destroys the contents of this tree."""
7250bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        contents = [i for i in self.contents]
7260bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        for i in contents:
7270bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            if isinstance(i, Tag):
7280bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                i.decompose()
7290bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            else:
7300bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                i.extract()
7310bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.extract()
7320bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
7330bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def prettify(self, encoding=DEFAULT_OUTPUT_ENCODING):
7340bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return self.encode(encoding, True)
7350bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
7360bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def encodeContents(self, encoding=DEFAULT_OUTPUT_ENCODING,
7370bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                       prettyPrint=False, indentLevel=0):
7380bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return self.decodeContents(prettyPrint, indentLevel).encode(encoding)
7390bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
7400bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def decodeContents(self, prettyPrint=False, indentLevel=0,
7410bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                       eventualEncoding=DEFAULT_OUTPUT_ENCODING):
7420bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        """Renders the contents of this tag as a string in the given
7430bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        encoding. If encoding is None, returns a Unicode string.."""
7440bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        s=[]
7450bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        for c in self:
7460bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            text = None
7470bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            if isinstance(c, NavigableString):
7480bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                text = c.decodeGivenEventualEncoding(eventualEncoding)
7490bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            elif isinstance(c, Tag):
7500bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                s.append(c.decode(prettyPrint, indentLevel, eventualEncoding))
7510bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            if text and prettyPrint:
7520bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                text = text.strip()
7530bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            if text:
7540bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                if prettyPrint:
7550bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    s.append(" " * (indentLevel-1))
7560bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                s.append(text)
7570bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                if prettyPrint:
7580bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    s.append("\n")
7590bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return ''.join(s)
7600bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
7610bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    #Soup methods
7620bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
7630bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def find(self, name=None, attrs={}, recursive=True, text=None,
7640bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch             **kwargs):
7650bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        """Return only the first child of this Tag matching the given
7660bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        criteria."""
7670bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        r = None
7680bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        l = self.findAll(name, attrs, recursive, text, 1, **kwargs)
7690bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if l:
7700bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            r = l[0]
7710bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return r
7720bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    findChild = find
7730bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
7740bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def findAll(self, name=None, attrs={}, recursive=True, text=None,
7750bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                limit=None, **kwargs):
7760bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        """Extracts a list of Tag objects that match the given
7770bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        criteria.  You can specify the name of the Tag and any
7780bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        attributes you want the Tag to have.
7790bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
7800bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        The value of a key-value pair in the 'attrs' map can be a
7810bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        string, a list of strings, a regular expression object, or a
7820bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        callable that takes a string and returns whether or not the
7830bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        string matches for some custom definition of 'matches'. The
7840bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        same is true of the tag name."""
7850bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        generator = self.recursiveChildGenerator
7860bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if not recursive:
7870bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            generator = self.childGenerator
7880bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return self._findAll(name, attrs, text, limit, generator, **kwargs)
7890bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    findChildren = findAll
7900bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
7910bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    # Pre-3.x compatibility methods. Will go away in 4.0.
7920bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    first = find
7930bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    fetch = findAll
7940bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
7950bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def fetchText(self, text=None, recursive=True, limit=None):
7960bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return self.findAll(text=text, recursive=recursive, limit=limit)
7970bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
7980bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def firstText(self, text=None, recursive=True):
7990bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return self.find(text=text, recursive=recursive)
8000bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
8010bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    # 3.x compatibility methods. Will go away in 4.0.
8020bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def renderContents(self, encoding=DEFAULT_OUTPUT_ENCODING,
8030bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                       prettyPrint=False, indentLevel=0):
8040bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if encoding is None:
8050bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            return self.decodeContents(prettyPrint, indentLevel, encoding)
8060bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        else:
8070bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            return self.encodeContents(encoding, prettyPrint, indentLevel)
8080bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
8090bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
8100bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    #Private methods
8110bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
8120bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def _getAttrMap(self):
8130bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        """Initializes a map representation of this tag's attributes,
8140bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if not already initialized."""
8150bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if not getattr(self, 'attrMap'):
8160bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            self.attrMap = {}
8170bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            for (key, value) in self.attrs:
8180bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                self.attrMap[key] = value
8190bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return self.attrMap
8200bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
8210bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    #Generator methods
8220bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def recursiveChildGenerator(self):
8230bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if not len(self.contents):
8240bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            raise StopIteration
8250bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        stopNode = self._lastRecursiveChild().next
8260bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        current = self.contents[0]
8270bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        while current is not stopNode:
8280bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            yield current
8290bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            current = current.next
8300bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
8310bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def childGenerator(self):
8320bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if not len(self.contents):
8330bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            raise StopIteration
8340bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        current = self.contents[0]
8350bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        while current:
8360bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            yield current
8370bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            current = current.nextSibling
8380bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        raise StopIteration
8390bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
8400bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch# Next, a couple classes to represent queries and their results.
8410bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdochclass SoupStrainer:
8420bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    """Encapsulates a number of ways of matching a markup element (tag or
8430bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    text)."""
8440bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
8450bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def __init__(self, name=None, attrs={}, text=None, **kwargs):
8460bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.name = name
8470bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if isString(attrs):
8480bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            kwargs['class'] = attrs
8490bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            attrs = None
8500bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if kwargs:
8510bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            if attrs:
8520bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                attrs = attrs.copy()
8530bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                attrs.update(kwargs)
8540bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            else:
8550bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                attrs = kwargs
8560bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.attrs = attrs
8570bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.text = text
8580bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
8590bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def __str__(self):
8600bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if self.text:
8610bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            return self.text
8620bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        else:
8630bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            return "%s|%s" % (self.name, self.attrs)
8640bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
8650bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def searchTag(self, markupName=None, markupAttrs={}):
8660bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        found = None
8670bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        markup = None
8680bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if isinstance(markupName, Tag):
8690bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            markup = markupName
8700bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            markupAttrs = markup
8710bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        callFunctionWithTagData = callable(self.name) \
8720bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                                and not isinstance(markupName, Tag)
8730bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
8740bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if (not self.name) \
8750bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch               or callFunctionWithTagData \
8760bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch               or (markup and self._matches(markup, self.name)) \
8770bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch               or (not markup and self._matches(markupName, self.name)):
8780bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            if callFunctionWithTagData:
8790bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                match = self.name(markupName, markupAttrs)
8800bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            else:
8810bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                match = True
8820bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                markupAttrMap = None
8830bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                for attr, matchAgainst in self.attrs.items():
8840bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    if not markupAttrMap:
8850bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                         if hasattr(markupAttrs, 'get'):
8860bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                            markupAttrMap = markupAttrs
8870bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                         else:
8880bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                            markupAttrMap = {}
8890bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                            for k,v in markupAttrs:
8900bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                                markupAttrMap[k] = v
8910bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    attrValue = markupAttrMap.get(attr)
8920bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    if not self._matches(attrValue, matchAgainst):
8930bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                        match = False
8940bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                        break
8950bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            if match:
8960bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                if markup:
8970bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    found = markup
8980bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                else:
8990bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    found = markupName
9000bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return found
9010bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
9020bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def search(self, markup):
9030bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        #print 'looking for %s in %s' % (self, markup)
9040bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        found = None
9050bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        # If given a list of items, scan it for a text element that
9060bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        # matches.
9070bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if isList(markup) and not isinstance(markup, Tag):
9080bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            for element in markup:
9090bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                if isinstance(element, NavigableString) \
9100bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                       and self.search(element):
9110bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    found = element
9120bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    break
9130bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        # If it's a Tag, make sure its name or attributes match.
9140bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        # Don't bother with Tags if we're searching for text.
9150bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        elif isinstance(markup, Tag):
9160bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            if not self.text:
9170bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                found = self.searchTag(markup)
9180bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        # If it's text, make sure the text matches.
9190bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        elif isinstance(markup, NavigableString) or \
9200bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                 isString(markup):
9210bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            if self._matches(markup, self.text):
9220bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                found = markup
9230bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        else:
9240bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            raise Exception, "I don't know how to match against a %s" \
9250bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                  % markup.__class__
9260bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return found
9270bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
9280bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def _matches(self, markup, matchAgainst):
9290bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        #print "Matching %s against %s" % (markup, matchAgainst)
9300bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        result = False
9310bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if matchAgainst == True and type(matchAgainst) == types.BooleanType:
9320bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            result = markup != None
9330bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        elif callable(matchAgainst):
9340bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            result = matchAgainst(markup)
9350bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        else:
9360bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            #Custom match methods take the tag as an argument, but all
9370bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            #other ways of matching match the tag name as a string.
9380bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            if isinstance(markup, Tag):
9390bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                markup = markup.name
9400bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            if markup is not None and not isString(markup):
9410bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                markup = unicode(markup)
9420bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            #Now we know that chunk is either a string, or None.
9430bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            if hasattr(matchAgainst, 'match'):
9440bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                # It's a regexp object.
9450bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                result = markup and matchAgainst.search(markup)
9460bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            elif (isList(matchAgainst)
9470bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                  and (markup is not None or not isString(matchAgainst))):
9480bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                result = markup in matchAgainst
9490bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            elif hasattr(matchAgainst, 'items'):
9500bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                result = markup.has_key(matchAgainst)
9510bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            elif matchAgainst and isString(markup):
9520bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                if isinstance(markup, unicode):
9530bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    matchAgainst = unicode(matchAgainst)
9540bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                else:
9550bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    matchAgainst = str(matchAgainst)
9560bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
9570bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            if not result:
9580bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                result = matchAgainst == markup
9590bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return result
9600bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
9610bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdochclass ResultSet(list):
9620bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    """A ResultSet is just a list that keeps track of the SoupStrainer
9630bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    that created it."""
9640bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def __init__(self, source):
9650bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        list.__init__([])
9660bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.source = source
9670bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
9680bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch# Now, some helper functions.
9690bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
9700bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdochdef isList(l):
9710bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    """Convenience method that works with all 2.x versions of Python
9720bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    to determine whether or not something is listlike."""
9730bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    return ((hasattr(l, '__iter__') and not isString(l))
9740bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            or (type(l) in (types.ListType, types.TupleType)))
9750bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
9760bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdochdef isString(s):
9770bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    """Convenience method that works with all 2.x versions of Python
9780bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    to determine whether or not something is stringlike."""
9790bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    try:
9800bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return isinstance(s, unicode) or isinstance(s, basestring)
9810bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    except NameError:
9820bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return isinstance(s, str)
9830bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
9840bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdochdef buildTagMap(default, *args):
9850bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    """Turns a list of maps, lists, or scalars into a single map.
9860bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    Used to build the SELF_CLOSING_TAGS, NESTABLE_TAGS, and
9870bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    NESTING_RESET_TAGS maps out of lists and partial maps."""
9880bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    built = {}
9890bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    for portion in args:
9900bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if hasattr(portion, 'items'):
9910bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            #It's a map. Merge it.
9920bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            for k,v in portion.items():
9930bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                built[k] = v
9940bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        elif isList(portion) and not isString(portion):
9950bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            #It's a list. Map each item to the default.
9960bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            for k in portion:
9970bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                built[k] = default
9980bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        else:
9990bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            #It's a scalar. Map it to the default.
10000bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            built[portion] = default
10010bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    return built
10020bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
10030bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch# Now, the parser classes.
10040bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
10050bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdochclass HTMLParserBuilder(HTMLParser):
10060bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
10070bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def __init__(self, soup):
10080bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        HTMLParser.__init__(self)
10090bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.soup = soup
10100bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
10110bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    # We inherit feed() and reset().
10120bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
10130bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def handle_starttag(self, name, attrs):
10140bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if name == 'meta':
10150bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            self.soup.extractCharsetFromMeta(attrs)
10160bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        else:
10170bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            self.soup.unknown_starttag(name, attrs)
10180bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
10190bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def handle_endtag(self, name):
10200bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.soup.unknown_endtag(name)
10210bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
10220bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def handle_data(self, content):
10230bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.soup.handle_data(content)
10240bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
10250bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def _toStringSubclass(self, text, subclass):
10260bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        """Adds a certain piece of text to the tree as a NavigableString
10270bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        subclass."""
10280bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.soup.endData()
10290bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.handle_data(text)
10300bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.soup.endData(subclass)
10310bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
10320bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def handle_pi(self, text):
10330bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        """Handle a processing instruction as a ProcessingInstruction
10340bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        object, possibly one with a %SOUP-ENCODING% slot into which an
10350bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        encoding will be plugged later."""
10360bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if text[:3] == "xml":
10370bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            text = u"xml version='1.0' encoding='%SOUP-ENCODING%'"
10380bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self._toStringSubclass(text, ProcessingInstruction)
10390bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
10400bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def handle_comment(self, text):
10410bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        "Handle comments as Comment objects."
10420bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self._toStringSubclass(text, Comment)
10430bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
10440bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def handle_charref(self, ref):
10450bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        "Handle character references as data."
10460bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if self.soup.convertEntities:
10470bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            data = unichr(int(ref))
10480bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        else:
10490bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            data = '&#%s;' % ref
10500bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.handle_data(data)
10510bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
10520bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def handle_entityref(self, ref):
10530bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        """Handle entity references as data, possibly converting known
10540bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        HTML and/or XML entity references to the corresponding Unicode
10550bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        characters."""
10560bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        data = None
10570bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if self.soup.convertHTMLEntities:
10580bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            try:
10590bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                data = unichr(name2codepoint[ref])
10600bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            except KeyError:
10610bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                pass
10620bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
10630bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if not data and self.soup.convertXMLEntities:
10640bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                data = self.soup.XML_ENTITIES_TO_SPECIAL_CHARS.get(ref)
10650bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
10660bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if not data and self.soup.convertHTMLEntities and \
10670bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            not self.soup.XML_ENTITIES_TO_SPECIAL_CHARS.get(ref):
10680bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                # TODO: We've got a problem here. We're told this is
10690bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                # an entity reference, but it's not an XML entity
10700bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                # reference or an HTML entity reference. Nonetheless,
10710bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                # the logical thing to do is to pass it through as an
10720bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                # unrecognized entity reference.
10730bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                #
10740bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                # Except: when the input is "&carol;" this function
10750bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                # will be called with input "carol". When the input is
10760bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                # "AT&T", this function will be called with input
10770bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                # "T". We have no way of knowing whether a semicolon
10780bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                # was present originally, so we don't know whether
10790bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                # this is an unknown entity or just a misplaced
10800bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                # ampersand.
10810bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                #
10820bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                # The more common case is a misplaced ampersand, so I
10830bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                # escape the ampersand and omit the trailing semicolon.
10840bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                data = "&amp;%s" % ref
10850bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if not data:
10860bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            # This case is different from the one above, because we
10870bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            # haven't already gone through a supposedly comprehensive
10880bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            # mapping of entities to Unicode characters. We might not
10890bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            # have gone through any mapping at all. So the chances are
10900bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            # very high that this is a real entity, and not a
10910bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            # misplaced ampersand.
10920bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            data = "&%s;" % ref
10930bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.handle_data(data)
10940bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
10950bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def handle_decl(self, data):
10960bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        "Handle DOCTYPEs and the like as Declaration objects."
10970bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self._toStringSubclass(data, Declaration)
10980bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
10990bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def parse_declaration(self, i):
11000bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        """Treat a bogus SGML declaration as raw data. Treat a CDATA
11010bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        declaration as a CData object."""
11020bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        j = None
11030bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if self.rawdata[i:i+9] == '<![CDATA[':
11040bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch             k = self.rawdata.find(']]>', i)
11050bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch             if k == -1:
11060bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                 k = len(self.rawdata)
11070bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch             data = self.rawdata[i+9:k]
11080bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch             j = k+3
11090bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch             self._toStringSubclass(data, CData)
11100bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        else:
11110bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            try:
11120bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                j = HTMLParser.parse_declaration(self, i)
11130bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            except HTMLParseError:
11140bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                toHandle = self.rawdata[i:]
11150bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                self.handle_data(toHandle)
11160bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                j = i + len(toHandle)
11170bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return j
11180bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
11190bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
11200bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdochclass BeautifulStoneSoup(Tag):
11210bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
11220bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    """This class contains the basic parser and search code. It defines
11230bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    a parser that knows nothing about tag behavior except for the
11240bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    following:
11250bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
11260bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch      You can't close a tag without closing all the tags it encloses.
11270bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch      That is, "<foo><bar></foo>" actually means
11280bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch      "<foo><bar></bar></foo>".
11290bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
11300bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    [Another possible explanation is "<foo><bar /></foo>", but since
11310bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    this class defines no SELF_CLOSING_TAGS, it will never use that
11320bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    explanation.]
11330bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
11340bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    This class is useful for parsing XML or made-up markup languages,
11350bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    or when BeautifulSoup makes an assumption counter to what you were
11360bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    expecting."""
11370bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
11380bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    SELF_CLOSING_TAGS = {}
11390bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    NESTABLE_TAGS = {}
11400bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    RESET_NESTING_TAGS = {}
11410bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    QUOTE_TAGS = {}
11420bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    PRESERVE_WHITESPACE_TAGS = []
11430bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
11440bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    MARKUP_MASSAGE = [(re.compile('(<[^<>]*)/>'),
11450bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                       lambda x: x.group(1) + ' />'),
11460bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                      (re.compile('<!\s+([^<>]*)>'),
11470bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                       lambda x: '<!' + x.group(1) + '>')
11480bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                      ]
11490bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
11500bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    ROOT_TAG_NAME = u'[document]'
11510bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
11520bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    HTML_ENTITIES = "html"
11530bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    XML_ENTITIES = "xml"
11540bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    XHTML_ENTITIES = "xhtml"
11550bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    # TODO: This only exists for backwards-compatibility
11560bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    ALL_ENTITIES = XHTML_ENTITIES
11570bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
11580bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    # Used when determining whether a text node is all whitespace and
11590bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    # can be replaced with a single space. A text node that contains
11600bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    # fancy Unicode spaces (usually non-breaking) should be left
11610bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    # alone.
11620bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    STRIP_ASCII_SPACES = { 9: None, 10: None, 12: None, 13: None, 32: None, }
11630bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
11640bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def __init__(self, markup="", parseOnlyThese=None, fromEncoding=None,
11650bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                 markupMassage=True, smartQuotesTo=XML_ENTITIES,
11660bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                 convertEntities=None, selfClosingTags=None, isHTML=False,
11670bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                 builder=HTMLParserBuilder):
11680bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        """The Soup object is initialized as the 'root tag', and the
11690bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        provided markup (which can be a string or a file-like object)
11700bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        is fed into the underlying parser.
11710bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
11720bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        HTMLParser will process most bad HTML, and the BeautifulSoup
11730bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        class has some tricks for dealing with some HTML that kills
11740bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        HTMLParser, but Beautiful Soup can nonetheless choke or lose data
11750bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if your data uses self-closing tags or declarations
11760bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        incorrectly.
11770bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
11780bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        By default, Beautiful Soup uses regexes to sanitize input,
11790bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        avoiding the vast majority of these problems. If the problems
11800bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        don't apply to you, pass in False for markupMassage, and
11810bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        you'll get better performance.
11820bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
11830bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        The default parser massage techniques fix the two most common
11840bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        instances of invalid HTML that choke HTMLParser:
11850bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
11860bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch         <br/> (No space between name of closing tag and tag close)
11870bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch         <! --Comment--> (Extraneous whitespace in declaration)
11880bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
11890bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        You can pass in a custom list of (RE object, replace method)
11900bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        tuples to get Beautiful Soup to scrub your input the way you
11910bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        want."""
11920bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
11930bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.parseOnlyThese = parseOnlyThese
11940bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.fromEncoding = fromEncoding
11950bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.smartQuotesTo = smartQuotesTo
11960bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.convertEntities = convertEntities
11970bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        # Set the rules for how we'll deal with the entities we
11980bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        # encounter
11990bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if self.convertEntities:
12000bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            # It doesn't make sense to convert encoded characters to
12010bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            # entities even while you're converting entities to Unicode.
12020bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            # Just convert it all to Unicode.
12030bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            self.smartQuotesTo = None
12040bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            if convertEntities == self.HTML_ENTITIES:
12050bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                self.convertXMLEntities = False
12060bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                self.convertHTMLEntities = True
12070bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                self.escapeUnrecognizedEntities = True
12080bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            elif convertEntities == self.XHTML_ENTITIES:
12090bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                self.convertXMLEntities = True
12100bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                self.convertHTMLEntities = True
12110bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                self.escapeUnrecognizedEntities = False
12120bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            elif convertEntities == self.XML_ENTITIES:
12130bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                self.convertXMLEntities = True
12140bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                self.convertHTMLEntities = False
12150bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                self.escapeUnrecognizedEntities = False
12160bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        else:
12170bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            self.convertXMLEntities = False
12180bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            self.convertHTMLEntities = False
12190bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            self.escapeUnrecognizedEntities = False
12200bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
12210bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.instanceSelfClosingTags = buildTagMap(None, selfClosingTags)
12220bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.builder = builder(self)
12230bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.reset()
12240bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
12250bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if hasattr(markup, 'read'):        # It's a file-type object.
12260bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            markup = markup.read()
12270bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.markup = markup
12280bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.markupMassage = markupMassage
12290bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        try:
12300bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            self._feed(isHTML=isHTML)
12310bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        except StopParsing:
12320bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            pass
12330bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.markup = None                 # The markup can now be GCed.
12340bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.builder = None                # So can the builder.
12350bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
12360bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def _feed(self, inDocumentEncoding=None, isHTML=False):
12370bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        # Convert the document to Unicode.
12380bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        markup = self.markup
12390bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if isinstance(markup, unicode):
12400bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            if not hasattr(self, 'originalEncoding'):
12410bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                self.originalEncoding = None
12420bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        else:
12430bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            dammit = UnicodeDammit\
12440bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                     (markup, [self.fromEncoding, inDocumentEncoding],
12450bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                      smartQuotesTo=self.smartQuotesTo, isHTML=isHTML)
12460bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            markup = dammit.unicode
12470bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            self.originalEncoding = dammit.originalEncoding
12480bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            self.declaredHTMLEncoding = dammit.declaredHTMLEncoding
12490bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if markup:
12500bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            if self.markupMassage:
12510bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                if not isList(self.markupMassage):
12520bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    self.markupMassage = self.MARKUP_MASSAGE
12530bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                for fix, m in self.markupMassage:
12540bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    markup = fix.sub(m, markup)
12550bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                # TODO: We get rid of markupMassage so that the
12560bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                # soup object can be deepcopied later on. Some
12570bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                # Python installations can't copy regexes. If anyone
12580bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                # was relying on the existence of markupMassage, this
12590bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                # might cause problems.
12600bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                del(self.markupMassage)
12610bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.builder.reset()
12620bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
12630bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.builder.feed(markup)
12640bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        # Close out any unfinished strings and close all the open tags.
12650bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.endData()
12660bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        while self.currentTag.name != self.ROOT_TAG_NAME:
12670bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            self.popTag()
12680bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
12690bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def isSelfClosingTag(self, name):
12700bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        """Returns true iff the given string is the name of a
12710bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self-closing tag according to this parser."""
12720bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return self.SELF_CLOSING_TAGS.has_key(name) \
12730bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch               or self.instanceSelfClosingTags.has_key(name)
12740bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
12750bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def reset(self):
12760bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        Tag.__init__(self, self, self.ROOT_TAG_NAME)
12770bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.hidden = 1
12780bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.builder.reset()
12790bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.currentData = []
12800bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.currentTag = None
12810bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.tagStack = []
12820bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.quoteStack = []
12830bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.pushTag(self)
12840bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
12850bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def popTag(self):
12860bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        tag = self.tagStack.pop()
12870bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        # Tags with just one string-owning child get the child as a
12880bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        # 'string' property, so that soup.tag.string is shorthand for
12890bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        # soup.tag.contents[0]
12900bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if len(self.currentTag.contents) == 1 and \
12910bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch           isinstance(self.currentTag.contents[0], NavigableString):
12920bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            self.currentTag.string = self.currentTag.contents[0]
12930bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
12940bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        #print "Pop", tag.name
12950bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if self.tagStack:
12960bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            self.currentTag = self.tagStack[-1]
12970bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return self.currentTag
12980bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
12990bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def pushTag(self, tag):
13000bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        #print "Push", tag.name
13010bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if self.currentTag:
13020bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            self.currentTag.contents.append(tag)
13030bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.tagStack.append(tag)
13040bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.currentTag = self.tagStack[-1]
13050bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
13060bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def endData(self, containerClass=NavigableString):
13070bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if self.currentData:
13080bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            currentData = u''.join(self.currentData)
13090bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            if (currentData.translate(self.STRIP_ASCII_SPACES) == '' and
13100bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                not set([tag.name for tag in self.tagStack]).intersection(
13110bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    self.PRESERVE_WHITESPACE_TAGS)):
13120bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                if '\n' in currentData:
13130bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    currentData = '\n'
13140bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                else:
13150bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    currentData = ' '
13160bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            self.currentData = []
13170bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            if self.parseOnlyThese and len(self.tagStack) <= 1 and \
13180bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                   (not self.parseOnlyThese.text or \
13190bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    not self.parseOnlyThese.search(currentData)):
13200bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                return
13210bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            o = containerClass(currentData)
13220bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            o.setup(self.currentTag, self.previous)
13230bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            if self.previous:
13240bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                self.previous.next = o
13250bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            self.previous = o
13260bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            self.currentTag.contents.append(o)
13270bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
13280bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
13290bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def _popToTag(self, name, inclusivePop=True):
13300bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        """Pops the tag stack up to and including the most recent
13310bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        instance of the given tag. If inclusivePop is false, pops the tag
13320bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        stack up to but *not* including the most recent instqance of
13330bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        the given tag."""
13340bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        #print "Popping to %s" % name
13350bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if name == self.ROOT_TAG_NAME:
13360bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            return
13370bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
13380bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        numPops = 0
13390bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        mostRecentTag = None
13400bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        for i in range(len(self.tagStack)-1, 0, -1):
13410bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            if name == self.tagStack[i].name:
13420bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                numPops = len(self.tagStack)-i
13430bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                break
13440bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if not inclusivePop:
13450bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            numPops = numPops - 1
13460bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
13470bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        for i in range(0, numPops):
13480bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            mostRecentTag = self.popTag()
13490bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return mostRecentTag
13500bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
13510bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def _smartPop(self, name):
13520bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
13530bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        """We need to pop up to the previous tag of this type, unless
13540bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        one of this tag's nesting reset triggers comes between this
13550bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        tag and the previous tag of this type, OR unless this tag is a
13560bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        generic nesting trigger and another generic nesting trigger
13570bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        comes between this tag and the previous tag of this type.
13580bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
13590bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        Examples:
13600bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch         <p>Foo<b>Bar *<p>* should pop to 'p', not 'b'.
13610bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch         <p>Foo<table>Bar *<p>* should pop to 'table', not 'p'.
13620bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch         <p>Foo<table><tr>Bar *<p>* should pop to 'tr', not 'p'.
13630bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
13640bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch         <li><ul><li> *<li>* should pop to 'ul', not the first 'li'.
13650bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch         <tr><table><tr> *<tr>* should pop to 'table', not the first 'tr'
13660bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch         <td><tr><td> *<td>* should pop to 'tr', not the first 'td'
13670bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        """
13680bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
13690bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        nestingResetTriggers = self.NESTABLE_TAGS.get(name)
13700bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        isNestable = nestingResetTriggers != None
13710bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        isResetNesting = self.RESET_NESTING_TAGS.has_key(name)
13720bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        popTo = None
13730bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        inclusive = True
13740bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        for i in range(len(self.tagStack)-1, 0, -1):
13750bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            p = self.tagStack[i]
13760bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            if (not p or p.name == name) and not isNestable:
13770bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                #Non-nestable tags get popped to the top or to their
13780bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                #last occurance.
13790bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                popTo = name
13800bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                break
13810bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            if (nestingResetTriggers != None
13820bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                and p.name in nestingResetTriggers) \
13830bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                or (nestingResetTriggers == None and isResetNesting
13840bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    and self.RESET_NESTING_TAGS.has_key(p.name)):
13850bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
13860bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                #If we encounter one of the nesting reset triggers
13870bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                #peculiar to this tag, or we encounter another tag
13880bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                #that causes nesting to reset, pop up to but not
13890bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                #including that tag.
13900bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                popTo = p.name
13910bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                inclusive = False
13920bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                break
13930bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            p = p.parent
13940bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if popTo:
13950bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            self._popToTag(popTo, inclusive)
13960bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
13970bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def unknown_starttag(self, name, attrs, selfClosing=0):
13980bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        #print "Start tag %s: %s" % (name, attrs)
13990bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if self.quoteStack:
14000bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            #This is not a real tag.
14010bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            #print "<%s> is not real!" % name
14020bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            attrs = ''.join(map(lambda(x, y): ' %s="%s"' % (x, y), attrs))
14030bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            self.handle_data('<%s%s>' % (name, attrs))
14040bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            return
14050bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.endData()
14060bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
14070bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if not self.isSelfClosingTag(name) and not selfClosing:
14080bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            self._smartPop(name)
14090bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
14100bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if self.parseOnlyThese and len(self.tagStack) <= 1 \
14110bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch               and (self.parseOnlyThese.text or not self.parseOnlyThese.searchTag(name, attrs)):
14120bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            return
14130bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
14140bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        tag = Tag(self, name, attrs, self.currentTag, self.previous)
14150bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if self.previous:
14160bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            self.previous.next = tag
14170bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.previous = tag
14180bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.pushTag(tag)
14190bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if selfClosing or self.isSelfClosingTag(name):
14200bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            self.popTag()
14210bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if name in self.QUOTE_TAGS:
14220bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            #print "Beginning quote (%s)" % name
14230bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            self.quoteStack.append(name)
14240bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            self.literal = 1
14250bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return tag
14260bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
14270bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def unknown_endtag(self, name):
14280bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        #print "End tag %s" % name
14290bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if self.quoteStack and self.quoteStack[-1] != name:
14300bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            #This is not a real end tag.
14310bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            #print "</%s> is not real!" % name
14320bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            self.handle_data('</%s>' % name)
14330bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            return
14340bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.endData()
14350bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self._popToTag(name)
14360bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if self.quoteStack and self.quoteStack[-1] == name:
14370bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            self.quoteStack.pop()
14380bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            self.literal = (len(self.quoteStack) > 0)
14390bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
14400bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def handle_data(self, data):
14410bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.currentData.append(data)
14420bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
14430bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def extractCharsetFromMeta(self, attrs):
14440bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.unknown_starttag('meta', attrs)
14450bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
14460bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
14470bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdochclass BeautifulSoup(BeautifulStoneSoup):
14480bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
14490bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    """This parser knows the following facts about HTML:
14500bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
14510bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    * Some tags have no closing tag and should be interpreted as being
14520bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch      closed as soon as they are encountered.
14530bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
14540bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    * The text inside some tags (ie. 'script') may contain tags which
14550bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch      are not really part of the document and which should be parsed
14560bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch      as text, not tags. If you want to parse the text as tags, you can
14570bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch      always fetch it and parse it explicitly.
14580bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
14590bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    * Tag nesting rules:
14600bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
14610bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch      Most tags can't be nested at all. For instance, the occurance of
14620bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch      a <p> tag should implicitly close the previous <p> tag.
14630bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
14640bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch       <p>Para1<p>Para2
14650bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        should be transformed into:
14660bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch       <p>Para1</p><p>Para2
14670bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
14680bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch      Some tags can be nested arbitrarily. For instance, the occurance
14690bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch      of a <blockquote> tag should _not_ implicitly close the previous
14700bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch      <blockquote> tag.
14710bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
14720bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch       Alice said: <blockquote>Bob said: <blockquote>Blah
14730bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        should NOT be transformed into:
14740bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch       Alice said: <blockquote>Bob said: </blockquote><blockquote>Blah
14750bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
14760bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch      Some tags can be nested, but the nesting is reset by the
14770bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch      interposition of other tags. For instance, a <tr> tag should
14780bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch      implicitly close the previous <tr> tag within the same <table>,
14790bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch      but not close a <tr> tag in another table.
14800bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
14810bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch       <table><tr>Blah<tr>Blah
14820bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        should be transformed into:
14830bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch       <table><tr>Blah</tr><tr>Blah
14840bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        but,
14850bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch       <tr>Blah<table><tr>Blah
14860bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        should NOT be transformed into
14870bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch       <tr>Blah<table></tr><tr>Blah
14880bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
14890bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    Differing assumptions about tag nesting rules are a major source
14900bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    of problems with the BeautifulSoup class. If BeautifulSoup is not
14910bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    treating as nestable a tag your page author treats as nestable,
14920bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    try ICantBelieveItsBeautifulSoup, MinimalSoup, or
14930bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    BeautifulStoneSoup before writing your own subclass."""
14940bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
14950bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def __init__(self, *args, **kwargs):
14960bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if not kwargs.has_key('smartQuotesTo'):
14970bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            kwargs['smartQuotesTo'] = self.HTML_ENTITIES
14980bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        kwargs['isHTML'] = True
14990bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        BeautifulStoneSoup.__init__(self, *args, **kwargs)
15000bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
15010bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    SELF_CLOSING_TAGS = buildTagMap(None,
15020bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                                    ['br' , 'hr', 'input', 'img', 'meta',
15030bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                                    'spacer', 'link', 'frame', 'base'])
15040bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
15050bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    PRESERVE_WHITESPACE_TAGS = set(['pre', 'textarea'])
15060bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
15070bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    QUOTE_TAGS = {'script' : None, 'textarea' : None}
15080bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
15090bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    #According to the HTML standard, each of these inline tags can
15100bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    #contain another tag of the same type. Furthermore, it's common
15110bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    #to actually use these tags this way.
15120bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    NESTABLE_INLINE_TAGS = ['span', 'font', 'q', 'object', 'bdo', 'sub', 'sup',
15130bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                            'center']
15140bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
15150bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    #According to the HTML standard, these block tags can contain
15160bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    #another tag of the same type. Furthermore, it's common
15170bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    #to actually use these tags this way.
15180bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    NESTABLE_BLOCK_TAGS = ['blockquote', 'div', 'fieldset', 'ins', 'del']
15190bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
15200bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    #Lists can contain other lists, but there are restrictions.
15210bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    NESTABLE_LIST_TAGS = { 'ol' : [],
15220bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                           'ul' : [],
15230bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                           'li' : ['ul', 'ol'],
15240bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                           'dl' : [],
15250bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                           'dd' : ['dl'],
15260bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                           'dt' : ['dl'] }
15270bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
15280bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    #Tables can contain other tables, but there are restrictions.
15290bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    NESTABLE_TABLE_TAGS = {'table' : [],
15300bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                           'tr' : ['table', 'tbody', 'tfoot', 'thead'],
15310bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                           'td' : ['tr'],
15320bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                           'th' : ['tr'],
15330bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                           'thead' : ['table'],
15340bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                           'tbody' : ['table'],
15350bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                           'tfoot' : ['table'],
15360bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                           }
15370bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
15380bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    NON_NESTABLE_BLOCK_TAGS = ['address', 'form', 'p', 'pre']
15390bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
15400bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    #If one of these tags is encountered, all tags up to the next tag of
15410bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    #this type are popped.
15420bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    RESET_NESTING_TAGS = buildTagMap(None, NESTABLE_BLOCK_TAGS, 'noscript',
15430bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                                     NON_NESTABLE_BLOCK_TAGS,
15440bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                                     NESTABLE_LIST_TAGS,
15450bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                                     NESTABLE_TABLE_TAGS)
15460bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
15470bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    NESTABLE_TAGS = buildTagMap([], NESTABLE_INLINE_TAGS, NESTABLE_BLOCK_TAGS,
15480bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                                NESTABLE_LIST_TAGS, NESTABLE_TABLE_TAGS)
15490bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
15500bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    # Used to detect the charset in a META tag; see start_meta
15510bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    CHARSET_RE = re.compile("((^|;)\s*charset=)([^;]*)", re.M)
15520bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
15530bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def extractCharsetFromMeta(self, attrs):
15540bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        """Beautiful Soup can detect a charset included in a META tag,
15550bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        try to convert the document to that charset, and re-parse the
15560bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        document from the beginning."""
15570bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        httpEquiv = None
15580bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        contentType = None
15590bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        contentTypeIndex = None
15600bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        tagNeedsEncodingSubstitution = False
15610bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
15620bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        for i in range(0, len(attrs)):
15630bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            key, value = attrs[i]
15640bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            key = key.lower()
15650bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            if key == 'http-equiv':
15660bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                httpEquiv = value
15670bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            elif key == 'content':
15680bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                contentType = value
15690bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                contentTypeIndex = i
15700bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
15710bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if httpEquiv and contentType: # It's an interesting meta tag.
15720bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            match = self.CHARSET_RE.search(contentType)
15730bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            if match:
15740bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                if (self.declaredHTMLEncoding is not None or
15750bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    self.originalEncoding == self.fromEncoding):
15760bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    # An HTML encoding was sniffed while converting
15770bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    # the document to Unicode, or an HTML encoding was
15780bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    # sniffed during a previous pass through the
15790bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    # document, or an encoding was specified
15800bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    # explicitly and it worked. Rewrite the meta tag.
15810bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    def rewrite(match):
15820bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                        return match.group(1) + "%SOUP-ENCODING%"
15830bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    newAttr = self.CHARSET_RE.sub(rewrite, contentType)
15840bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    attrs[contentTypeIndex] = (attrs[contentTypeIndex][0],
15850bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                                               newAttr)
15860bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    tagNeedsEncodingSubstitution = True
15870bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                else:
15880bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    # This is our first pass through the document.
15890bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    # Go through it again with the encoding information.
15900bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    newCharset = match.group(3)
15910bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    if newCharset and newCharset != self.originalEncoding:
15920bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                        self.declaredHTMLEncoding = newCharset
15930bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                        self._feed(self.declaredHTMLEncoding)
15940bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                        raise StopParsing
15950bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    pass
15960bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        tag = self.unknown_starttag("meta", attrs)
15970bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if tag and tagNeedsEncodingSubstitution:
15980bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            tag.containsSubstitutions = True
15990bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
16000bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
16010bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdochclass StopParsing(Exception):
16020bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    pass
16030bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
16040bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdochclass ICantBelieveItsBeautifulSoup(BeautifulSoup):
16050bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
16060bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    """The BeautifulSoup class is oriented towards skipping over
16070bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    common HTML errors like unclosed tags. However, sometimes it makes
16080bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    errors of its own. For instance, consider this fragment:
16090bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
16100bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch     <b>Foo<b>Bar</b></b>
16110bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
16120bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    This is perfectly valid (if bizarre) HTML. However, the
16130bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    BeautifulSoup class will implicitly close the first b tag when it
16140bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    encounters the second 'b'. It will think the author wrote
16150bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    "<b>Foo<b>Bar", and didn't close the first 'b' tag, because
16160bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    there's no real-world reason to bold something that's already
16170bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    bold. When it encounters '</b></b>' it will close two more 'b'
16180bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    tags, for a grand total of three tags closed instead of two. This
16190bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    can throw off the rest of your document structure. The same is
16200bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    true of a number of other tags, listed below.
16210bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
16220bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    It's much more common for someone to forget to close a 'b' tag
16230bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    than to actually use nested 'b' tags, and the BeautifulSoup class
16240bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    handles the common case. This class handles the not-co-common
16250bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    case: where you can't believe someone wrote what they did, but
16260bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    it's valid HTML and BeautifulSoup screwed up by assuming it
16270bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    wouldn't be."""
16280bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
16290bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    I_CANT_BELIEVE_THEYRE_NESTABLE_INLINE_TAGS = \
16300bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch     ['em', 'big', 'i', 'small', 'tt', 'abbr', 'acronym', 'strong',
16310bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch      'cite', 'code', 'dfn', 'kbd', 'samp', 'strong', 'var', 'b',
16320bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch      'big']
16330bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
16340bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    I_CANT_BELIEVE_THEYRE_NESTABLE_BLOCK_TAGS = ['noscript']
16350bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
16360bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    NESTABLE_TAGS = buildTagMap([], BeautifulSoup.NESTABLE_TAGS,
16370bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                                I_CANT_BELIEVE_THEYRE_NESTABLE_BLOCK_TAGS,
16380bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                                I_CANT_BELIEVE_THEYRE_NESTABLE_INLINE_TAGS)
16390bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
16400bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdochclass MinimalSoup(BeautifulSoup):
16410bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    """The MinimalSoup class is for parsing HTML that contains
16420bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    pathologically bad markup. It makes no assumptions about tag
16430bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    nesting, but it does know which tags are self-closing, that
16440bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    <script> tags contain Javascript and should not be parsed, that
16450bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    META tags may contain encoding information, and so on.
16460bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
16470bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    This also makes it better for subclassing than BeautifulStoneSoup
16480bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    or BeautifulSoup."""
16490bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
16500bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    RESET_NESTING_TAGS = buildTagMap('noscript')
16510bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    NESTABLE_TAGS = {}
16520bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
16530bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdochclass BeautifulSOAP(BeautifulStoneSoup):
16540bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    """This class will push a tag with only a single string child into
16550bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    the tag's parent as an attribute. The attribute's name is the tag
16560bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    name, and the value is the string child. An example should give
16570bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    the flavor of the change:
16580bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
16590bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    <foo><bar>baz</bar></foo>
16600bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch     =>
16610bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    <foo bar="baz"><bar>baz</bar></foo>
16620bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
16630bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    You can then access fooTag['bar'] instead of fooTag.barTag.string.
16640bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
16650bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    This is, of course, useful for scraping structures that tend to
16660bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    use subelements instead of attributes, such as SOAP messages. Note
16670bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    that it modifies its input, so don't print the modified version
16680bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    out.
16690bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
16700bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    I'm not sure how many people really want to use this class; let me
16710bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    know if you do. Mainly I like the name."""
16720bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
16730bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def popTag(self):
16740bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if len(self.tagStack) > 1:
16750bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            tag = self.tagStack[-1]
16760bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            parent = self.tagStack[-2]
16770bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            parent._getAttrMap()
16780bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            if (isinstance(tag, Tag) and len(tag.contents) == 1 and
16790bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                isinstance(tag.contents[0], NavigableString) and
16800bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                not parent.attrMap.has_key(tag.name)):
16810bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                parent[tag.name] = tag.contents[0]
16820bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        BeautifulStoneSoup.popTag(self)
16830bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
16840bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch#Enterprise class names! It has come to our attention that some people
16850bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch#think the names of the Beautiful Soup parser classes are too silly
16860bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch#and "unprofessional" for use in enterprise screen-scraping. We feel
16870bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch#your pain! For such-minded folk, the Beautiful Soup Consortium And
16880bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch#All-Night Kosher Bakery recommends renaming this file to
16890bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch#"RobustParser.py" (or, in cases of extreme enterprisiness,
16900bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch#"RobustParserBeanInterface.class") and using the following
16910bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch#enterprise-friendly class aliases:
16920bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdochclass RobustXMLParser(BeautifulStoneSoup):
16930bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    pass
16940bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdochclass RobustHTMLParser(BeautifulSoup):
16950bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    pass
16960bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdochclass RobustWackAssHTMLParser(ICantBelieveItsBeautifulSoup):
16970bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    pass
16980bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdochclass RobustInsanelyWackAssHTMLParser(MinimalSoup):
16990bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    pass
17000bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdochclass SimplifyingSOAPParser(BeautifulSOAP):
17010bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    pass
17020bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
17030bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch######################################################
17040bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch#
17050bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch# Bonus library: Unicode, Dammit
17060bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch#
17070bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch# This class forces XML data into a standard format (usually to UTF-8
17080bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch# or Unicode).  It is heavily based on code from Mark Pilgrim's
17090bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch# Universal Feed Parser. It does not rewrite the XML or HTML to
17100bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch# reflect a new encoding: that happens in BeautifulStoneSoup.handle_pi
17110bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch# (XML) and BeautifulSoup.start_meta (HTML).
17120bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
17130bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch# Autodetects character encodings.
17140bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch# Download from http://chardet.feedparser.org/
17150bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdochtry:
17160bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    import chardet
17170bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch#    import chardet.constants
17180bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch#    chardet.constants._debug = 1
17190bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdochexcept ImportError:
17200bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    chardet = None
17210bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
17220bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch# cjkcodecs and iconv_codec make Python know about more character encodings.
17230bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch# Both are available from http://cjkpython.i18n.org/
17240bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch# They're built in if you use Python 2.4.
17250bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdochtry:
17260bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    import cjkcodecs.aliases
17270bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdochexcept ImportError:
17280bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    pass
17290bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdochtry:
17300bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    import iconv_codec
17310bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdochexcept ImportError:
17320bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    pass
17330bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
17340bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdochclass UnicodeDammit:
17350bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    """A class for detecting the encoding of a *ML document and
17360bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    converting it to a Unicode string. If the source encoding is
17370bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    windows-1252, can replace MS smart quotes with their HTML or XML
17380bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    equivalents."""
17390bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
17400bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    # This dictionary maps commonly seen values for "charset" in HTML
17410bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    # meta tags to the corresponding Python codec names. It only covers
17420bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    # values that aren't in Python's aliases and can't be determined
17430bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    # by the heuristics in find_codec.
17440bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    CHARSET_ALIASES = { "macintosh" : "mac-roman",
17450bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                        "x-sjis" : "shift-jis" }
17460bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
17470bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def __init__(self, markup, overrideEncodings=[],
17480bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                 smartQuotesTo='xml', isHTML=False):
17490bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.declaredHTMLEncoding = None
17500bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.markup, documentEncoding, sniffedEncoding = \
17510bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                     self._detectEncoding(markup, isHTML)
17520bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.smartQuotesTo = smartQuotesTo
17530bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.triedEncodings = []
17540bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if markup == '' or isinstance(markup, unicode):
17550bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            self.originalEncoding = None
17560bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            self.unicode = unicode(markup)
17570bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            return
17580bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
17590bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        u = None
17600bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        for proposedEncoding in overrideEncodings:
17610bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            u = self._convertFrom(proposedEncoding)
17620bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            if u: break
17630bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if not u:
17640bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            for proposedEncoding in (documentEncoding, sniffedEncoding):
17650bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                u = self._convertFrom(proposedEncoding)
17660bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                if u: break
17670bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
17680bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        # If no luck and we have auto-detection library, try that:
17690bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if not u and chardet and not isinstance(self.markup, unicode):
17700bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            u = self._convertFrom(chardet.detect(self.markup)['encoding'])
17710bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
17720bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        # As a last resort, try utf-8 and windows-1252:
17730bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if not u:
17740bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            for proposed_encoding in ("utf-8", "windows-1252"):
17750bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                u = self._convertFrom(proposed_encoding)
17760bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                if u: break
17770bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
17780bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.unicode = u
17790bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if not u: self.originalEncoding = None
17800bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
17810bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def _subMSChar(self, match):
17820bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        """Changes a MS smart quote character to an XML or HTML
17830bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        entity."""
17840bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        orig = match.group(1)
17850bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        sub = self.MS_CHARS.get(orig)
17860bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if type(sub) == types.TupleType:
17870bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            if self.smartQuotesTo == 'xml':
17880bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                sub = '&#x'.encode() + sub[1].encode() + ';'.encode()
17890bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            else:
17900bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                sub = '&'.encode() + sub[0].encode() + ';'.encode()
17910bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        else:
17920bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            sub = sub.encode()
17930bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return sub
17940bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
17950bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def _convertFrom(self, proposed):
17960bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        proposed = self.find_codec(proposed)
17970bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if not proposed or proposed in self.triedEncodings:
17980bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            return None
17990bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        self.triedEncodings.append(proposed)
18000bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        markup = self.markup
18010bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
18020bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        # Convert smart quotes to HTML if coming from an encoding
18030bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        # that might have them.
18040bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if self.smartQuotesTo and proposed.lower() in("windows-1252",
18050bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                                                      "iso-8859-1",
18060bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                                                      "iso-8859-2"):
18070bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            smart_quotes_re = "([\x80-\x9f])"
18080bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            smart_quotes_compiled = re.compile(smart_quotes_re)
18090bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            markup = smart_quotes_compiled.sub(self._subMSChar, markup)
18100bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
18110bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        try:
18120bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            # print "Trying to convert document to %s" % proposed
18130bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            u = self._toUnicode(markup, proposed)
18140bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            self.markup = u
18150bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            self.originalEncoding = proposed
18160bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        except Exception, e:
18170bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            # print "That didn't work!"
18180bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            # print e
18190bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            return None
18200bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        #print "Correct encoding: %s" % proposed
18210bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return self.markup
18220bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
18230bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def _toUnicode(self, data, encoding):
18240bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        '''Given a string and its encoding, decodes the string into Unicode.
18250bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        %encoding is a string recognized by encodings.aliases'''
18260bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
18270bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        # strip Byte Order Mark (if present)
18280bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if (len(data) >= 4) and (data[:2] == '\xfe\xff') \
18290bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch               and (data[2:4] != '\x00\x00'):
18300bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            encoding = 'utf-16be'
18310bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            data = data[2:]
18320bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        elif (len(data) >= 4) and (data[:2] == '\xff\xfe') \
18330bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                 and (data[2:4] != '\x00\x00'):
18340bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            encoding = 'utf-16le'
18350bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            data = data[2:]
18360bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        elif data[:3] == '\xef\xbb\xbf':
18370bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            encoding = 'utf-8'
18380bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            data = data[3:]
18390bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        elif data[:4] == '\x00\x00\xfe\xff':
18400bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            encoding = 'utf-32be'
18410bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            data = data[4:]
18420bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        elif data[:4] == '\xff\xfe\x00\x00':
18430bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            encoding = 'utf-32le'
18440bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            data = data[4:]
18450bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        newdata = unicode(data, encoding)
18460bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return newdata
18470bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
18480bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def _detectEncoding(self, xml_data, isHTML=False):
18490bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        """Given a document, tries to detect its XML encoding."""
18500bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        xml_encoding = sniffed_xml_encoding = None
18510bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        try:
18520bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            if xml_data[:4] == '\x4c\x6f\xa7\x94':
18530bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                # EBCDIC
18540bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                xml_data = self._ebcdic_to_ascii(xml_data)
18550bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            elif xml_data[:4] == '\x00\x3c\x00\x3f':
18560bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                # UTF-16BE
18570bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                sniffed_xml_encoding = 'utf-16be'
18580bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                xml_data = unicode(xml_data, 'utf-16be').encode('utf-8')
18590bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            elif (len(xml_data) >= 4) and (xml_data[:2] == '\xfe\xff') \
18600bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                     and (xml_data[2:4] != '\x00\x00'):
18610bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                # UTF-16BE with BOM
18620bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                sniffed_xml_encoding = 'utf-16be'
18630bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                xml_data = unicode(xml_data[2:], 'utf-16be').encode('utf-8')
18640bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            elif xml_data[:4] == '\x3c\x00\x3f\x00':
18650bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                # UTF-16LE
18660bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                sniffed_xml_encoding = 'utf-16le'
18670bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                xml_data = unicode(xml_data, 'utf-16le').encode('utf-8')
18680bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            elif (len(xml_data) >= 4) and (xml_data[:2] == '\xff\xfe') and \
18690bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                     (xml_data[2:4] != '\x00\x00'):
18700bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                # UTF-16LE with BOM
18710bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                sniffed_xml_encoding = 'utf-16le'
18720bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                xml_data = unicode(xml_data[2:], 'utf-16le').encode('utf-8')
18730bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            elif xml_data[:4] == '\x00\x00\x00\x3c':
18740bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                # UTF-32BE
18750bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                sniffed_xml_encoding = 'utf-32be'
18760bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                xml_data = unicode(xml_data, 'utf-32be').encode('utf-8')
18770bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            elif xml_data[:4] == '\x3c\x00\x00\x00':
18780bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                # UTF-32LE
18790bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                sniffed_xml_encoding = 'utf-32le'
18800bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                xml_data = unicode(xml_data, 'utf-32le').encode('utf-8')
18810bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            elif xml_data[:4] == '\x00\x00\xfe\xff':
18820bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                # UTF-32BE with BOM
18830bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                sniffed_xml_encoding = 'utf-32be'
18840bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                xml_data = unicode(xml_data[4:], 'utf-32be').encode('utf-8')
18850bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            elif xml_data[:4] == '\xff\xfe\x00\x00':
18860bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                # UTF-32LE with BOM
18870bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                sniffed_xml_encoding = 'utf-32le'
18880bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                xml_data = unicode(xml_data[4:], 'utf-32le').encode('utf-8')
18890bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            elif xml_data[:3] == '\xef\xbb\xbf':
18900bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                # UTF-8 with BOM
18910bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                sniffed_xml_encoding = 'utf-8'
18920bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                xml_data = unicode(xml_data[3:], 'utf-8').encode('utf-8')
18930bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            else:
18940bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                sniffed_xml_encoding = 'ascii'
18950bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                pass
18960bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        except:
18970bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            xml_encoding_match = None
18980bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        xml_encoding_re = '^<\?.*encoding=[\'"](.*?)[\'"].*\?>'.encode()
18990bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        xml_encoding_match = re.compile(xml_encoding_re).match(xml_data)
19000bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if not xml_encoding_match and isHTML:
19010bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            meta_re = '<\s*meta[^>]+charset=([^>]*?)[;\'">]'.encode()
19020bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            regexp = re.compile(meta_re, re.I)
19030bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            xml_encoding_match = regexp.search(xml_data)
19040bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if xml_encoding_match is not None:
19050bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            xml_encoding = xml_encoding_match.groups()[0].decode(
19060bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                'ascii').lower()
19070bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            if isHTML:
19080bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                self.declaredHTMLEncoding = xml_encoding
19090bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            if sniffed_xml_encoding and \
19100bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch               (xml_encoding in ('iso-10646-ucs-2', 'ucs-2', 'csunicode',
19110bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                                 'iso-10646-ucs-4', 'ucs-4', 'csucs4',
19120bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                                 'utf-16', 'utf-32', 'utf_16', 'utf_32',
19130bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                                 'utf16', 'u16')):
19140bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                xml_encoding = sniffed_xml_encoding
19150bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return xml_data, xml_encoding, sniffed_xml_encoding
19160bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
19170bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
19180bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def find_codec(self, charset):
19190bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return self._codec(self.CHARSET_ALIASES.get(charset, charset)) \
19200bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch               or (charset and self._codec(charset.replace("-", ""))) \
19210bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch               or (charset and self._codec(charset.replace("-", "_"))) \
19220bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch               or charset
19230bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
19240bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def _codec(self, charset):
19250bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if not charset: return charset
19260bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        codec = None
19270bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        try:
19280bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            codecs.lookup(charset)
19290bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            codec = charset
19300bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        except (LookupError, ValueError):
19310bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            pass
19320bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return codec
19330bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
19340bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    EBCDIC_TO_ASCII_MAP = None
19350bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    def _ebcdic_to_ascii(self, s):
19360bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        c = self.__class__
19370bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        if not c.EBCDIC_TO_ASCII_MAP:
19380bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            emap = (0,1,2,3,156,9,134,127,151,141,142,11,12,13,14,15,
19390bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    16,17,18,19,157,133,8,135,24,25,146,143,28,29,30,31,
19400bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    128,129,130,131,132,10,23,27,136,137,138,139,140,5,6,7,
19410bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    144,145,22,147,148,149,150,4,152,153,154,155,20,21,158,26,
19420bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    32,160,161,162,163,164,165,166,167,168,91,46,60,40,43,33,
19430bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    38,169,170,171,172,173,174,175,176,177,93,36,42,41,59,94,
19440bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    45,47,178,179,180,181,182,183,184,185,124,44,37,95,62,63,
19450bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    186,187,188,189,190,191,192,193,194,96,58,35,64,39,61,34,
19460bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    195,97,98,99,100,101,102,103,104,105,196,197,198,199,200,
19470bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    201,202,106,107,108,109,110,111,112,113,114,203,204,205,
19480bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    206,207,208,209,126,115,116,117,118,119,120,121,122,210,
19490bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    211,212,213,214,215,216,217,218,219,220,221,222,223,224,
19500bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    225,226,227,228,229,230,231,123,65,66,67,68,69,70,71,72,
19510bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    73,232,233,234,235,236,237,125,74,75,76,77,78,79,80,81,
19520bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    82,238,239,240,241,242,243,92,159,83,84,85,86,87,88,89,
19530bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    90,244,245,246,247,248,249,48,49,50,51,52,53,54,55,56,57,
19540bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                    250,251,252,253,254,255)
19550bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            import string
19560bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            c.EBCDIC_TO_ASCII_MAP = string.maketrans( \
19570bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch            ''.join(map(chr, range(256))), ''.join(map(chr, emap)))
19580bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch        return s.translate(c.EBCDIC_TO_ASCII_MAP)
19590bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
19600bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    MS_CHARS = { '\x80' : ('euro', '20AC'),
19610bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                 '\x81' : ' ',
19620bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                 '\x82' : ('sbquo', '201A'),
19630bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                 '\x83' : ('fnof', '192'),
19640bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                 '\x84' : ('bdquo', '201E'),
19650bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                 '\x85' : ('hellip', '2026'),
19660bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                 '\x86' : ('dagger', '2020'),
19670bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                 '\x87' : ('Dagger', '2021'),
19680bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                 '\x88' : ('circ', '2C6'),
19690bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                 '\x89' : ('permil', '2030'),
19700bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                 '\x8A' : ('Scaron', '160'),
19710bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                 '\x8B' : ('lsaquo', '2039'),
19720bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                 '\x8C' : ('OElig', '152'),
19730bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                 '\x8D' : '?',
19740bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                 '\x8E' : ('#x17D', '17D'),
19750bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                 '\x8F' : '?',
19760bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                 '\x90' : '?',
19770bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                 '\x91' : ('lsquo', '2018'),
19780bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                 '\x92' : ('rsquo', '2019'),
19790bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                 '\x93' : ('ldquo', '201C'),
19800bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                 '\x94' : ('rdquo', '201D'),
19810bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                 '\x95' : ('bull', '2022'),
19820bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                 '\x96' : ('ndash', '2013'),
19830bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                 '\x97' : ('mdash', '2014'),
19840bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                 '\x98' : ('tilde', '2DC'),
19850bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                 '\x99' : ('trade', '2122'),
19860bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                 '\x9a' : ('scaron', '161'),
19870bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                 '\x9b' : ('rsaquo', '203A'),
19880bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                 '\x9c' : ('oelig', '153'),
19890bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                 '\x9d' : '?',
19900bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                 '\x9e' : ('#x17E', '17E'),
19910bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch                 '\x9f' : ('Yuml', ''),}
19920bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
19930bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch#######################################################################
19940bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
19950bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch
19960bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch#By default, act as an HTML pretty-printer.
19970bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdochif __name__ == '__main__':
19980bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    import sys
19990bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    soup = BeautifulSoup(sys.stdin)
20000bf48ef3be53ddaa52bbead65dfd75bf90e7a2b5Ben Murdoch    print soup.prettify()
2001