1e64493206b76bce6e3af0598790ec076094e8c37Simran Basi"""Beautiful Soup 2e64493206b76bce6e3af0598790ec076094e8c37Simran BasiElixir and Tonic 3e64493206b76bce6e3af0598790ec076094e8c37Simran Basi"The Screen-Scraper's Friend" 4e64493206b76bce6e3af0598790ec076094e8c37Simran Basihttp://www.crummy.com/software/BeautifulSoup/ 5e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 6e64493206b76bce6e3af0598790ec076094e8c37Simran BasiBeautiful Soup parses a (possibly invalid) XML or HTML document into a 7e64493206b76bce6e3af0598790ec076094e8c37Simran Basitree representation. It provides methods and Pythonic idioms that make 8e64493206b76bce6e3af0598790ec076094e8c37Simran Basiit easy to navigate, search, and modify the tree. 9e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 10e64493206b76bce6e3af0598790ec076094e8c37Simran BasiA well-formed XML/HTML document yields a well-formed data 11e64493206b76bce6e3af0598790ec076094e8c37Simran Basistructure. An ill-formed XML/HTML document yields a correspondingly 12e64493206b76bce6e3af0598790ec076094e8c37Simran Basiill-formed data structure. If your document is only locally 13e64493206b76bce6e3af0598790ec076094e8c37Simran Basiwell-formed, you can use this library to find and process the 14e64493206b76bce6e3af0598790ec076094e8c37Simran Basiwell-formed part of it. 15e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 16e64493206b76bce6e3af0598790ec076094e8c37Simran BasiBeautiful Soup works with Python 2.2 and up. It has no external 17e64493206b76bce6e3af0598790ec076094e8c37Simran Basidependencies, but you'll have more success at converting data to UTF-8 18e64493206b76bce6e3af0598790ec076094e8c37Simran Basiif you also install these three packages: 19e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 20e64493206b76bce6e3af0598790ec076094e8c37Simran Basi* chardet, for auto-detecting character encodings 21e64493206b76bce6e3af0598790ec076094e8c37Simran Basi http://chardet.feedparser.org/ 22e64493206b76bce6e3af0598790ec076094e8c37Simran Basi* cjkcodecs and iconv_codec, which add more encodings to the ones supported 23e64493206b76bce6e3af0598790ec076094e8c37Simran Basi by stock Python. 24e64493206b76bce6e3af0598790ec076094e8c37Simran Basi http://cjkpython.i18n.org/ 25e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 26e64493206b76bce6e3af0598790ec076094e8c37Simran BasiBeautiful Soup defines classes for two main parsing strategies: 27e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 28e64493206b76bce6e3af0598790ec076094e8c37Simran Basi * BeautifulStoneSoup, for parsing XML, SGML, or your domain-specific 29e64493206b76bce6e3af0598790ec076094e8c37Simran Basi language that kind of looks like XML. 30e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 31e64493206b76bce6e3af0598790ec076094e8c37Simran Basi * BeautifulSoup, for parsing run-of-the-mill HTML code, be it valid 32e64493206b76bce6e3af0598790ec076094e8c37Simran Basi or invalid. This class has web browser-like heuristics for 33e64493206b76bce6e3af0598790ec076094e8c37Simran Basi obtaining a sensible parse tree in the face of common HTML errors. 34e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 35e64493206b76bce6e3af0598790ec076094e8c37Simran BasiBeautiful Soup also defines a class (UnicodeDammit) for autodetecting 36e64493206b76bce6e3af0598790ec076094e8c37Simran Basithe encoding of an HTML or XML document, and converting it to 37e64493206b76bce6e3af0598790ec076094e8c37Simran BasiUnicode. Much of this code is taken from Mark Pilgrim's Universal Feed Parser. 38e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 39e64493206b76bce6e3af0598790ec076094e8c37Simran BasiFor more than you ever wanted to know about Beautiful Soup, see the 40e64493206b76bce6e3af0598790ec076094e8c37Simran Basidocumentation: 41e64493206b76bce6e3af0598790ec076094e8c37Simran Basihttp://www.crummy.com/software/BeautifulSoup/documentation.html 42e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 43e64493206b76bce6e3af0598790ec076094e8c37Simran BasiHere, have some legalese: 44e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 45e64493206b76bce6e3af0598790ec076094e8c37Simran BasiCopyright (c) 2004-2010, Leonard Richardson 46e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 47e64493206b76bce6e3af0598790ec076094e8c37Simran BasiAll rights reserved. 48e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 49e64493206b76bce6e3af0598790ec076094e8c37Simran BasiRedistribution and use in source and binary forms, with or without 50e64493206b76bce6e3af0598790ec076094e8c37Simran Basimodification, are permitted provided that the following conditions are 51e64493206b76bce6e3af0598790ec076094e8c37Simran Basimet: 52e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 53e64493206b76bce6e3af0598790ec076094e8c37Simran Basi * Redistributions of source code must retain the above copyright 54e64493206b76bce6e3af0598790ec076094e8c37Simran Basi notice, this list of conditions and the following disclaimer. 55e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 56e64493206b76bce6e3af0598790ec076094e8c37Simran Basi * Redistributions in binary form must reproduce the above 57e64493206b76bce6e3af0598790ec076094e8c37Simran Basi copyright notice, this list of conditions and the following 58e64493206b76bce6e3af0598790ec076094e8c37Simran Basi disclaimer in the documentation and/or other materials provided 59e64493206b76bce6e3af0598790ec076094e8c37Simran Basi with the distribution. 60e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 61e64493206b76bce6e3af0598790ec076094e8c37Simran Basi * Neither the name of the the Beautiful Soup Consortium and All 62e64493206b76bce6e3af0598790ec076094e8c37Simran Basi Night Kosher Bakery nor the names of its contributors may be 63e64493206b76bce6e3af0598790ec076094e8c37Simran Basi used to endorse or promote products derived from this software 64e64493206b76bce6e3af0598790ec076094e8c37Simran Basi without specific prior written permission. 65e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 66e64493206b76bce6e3af0598790ec076094e8c37Simran BasiTHIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 67e64493206b76bce6e3af0598790ec076094e8c37Simran Basi"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 68e64493206b76bce6e3af0598790ec076094e8c37Simran BasiLIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR 69e64493206b76bce6e3af0598790ec076094e8c37Simran BasiA PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR 70e64493206b76bce6e3af0598790ec076094e8c37Simran BasiCONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, 71e64493206b76bce6e3af0598790ec076094e8c37Simran BasiEXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, 72e64493206b76bce6e3af0598790ec076094e8c37Simran BasiPROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR 73e64493206b76bce6e3af0598790ec076094e8c37Simran BasiPROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF 74e64493206b76bce6e3af0598790ec076094e8c37Simran BasiLIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING 75e64493206b76bce6e3af0598790ec076094e8c37Simran BasiNEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS 76e64493206b76bce6e3af0598790ec076094e8c37Simran BasiSOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE, DAMMIT. 77e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 78e64493206b76bce6e3af0598790ec076094e8c37Simran Basi""" 79e64493206b76bce6e3af0598790ec076094e8c37Simran Basifrom __future__ import generators 80e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 81e64493206b76bce6e3af0598790ec076094e8c37Simran Basi__author__ = "Leonard Richardson (leonardr@segfault.org)" 82e64493206b76bce6e3af0598790ec076094e8c37Simran Basi__version__ = "3.2.1" 83e64493206b76bce6e3af0598790ec076094e8c37Simran Basi__copyright__ = "Copyright (c) 2004-2012 Leonard Richardson" 84e64493206b76bce6e3af0598790ec076094e8c37Simran Basi__license__ = "New-style BSD" 85e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 86e64493206b76bce6e3af0598790ec076094e8c37Simran Basifrom sgmllib import SGMLParser, SGMLParseError 87e64493206b76bce6e3af0598790ec076094e8c37Simran Basiimport codecs 88e64493206b76bce6e3af0598790ec076094e8c37Simran Basiimport markupbase 89e64493206b76bce6e3af0598790ec076094e8c37Simran Basiimport types 90e64493206b76bce6e3af0598790ec076094e8c37Simran Basiimport re 91e64493206b76bce6e3af0598790ec076094e8c37Simran Basiimport sgmllib 92e64493206b76bce6e3af0598790ec076094e8c37Simran Basitry: 93e64493206b76bce6e3af0598790ec076094e8c37Simran Basi from htmlentitydefs import name2codepoint 94e64493206b76bce6e3af0598790ec076094e8c37Simran Basiexcept ImportError: 95e64493206b76bce6e3af0598790ec076094e8c37Simran Basi name2codepoint = {} 96e64493206b76bce6e3af0598790ec076094e8c37Simran Basitry: 97e64493206b76bce6e3af0598790ec076094e8c37Simran Basi set 98e64493206b76bce6e3af0598790ec076094e8c37Simran Basiexcept NameError: 99e64493206b76bce6e3af0598790ec076094e8c37Simran Basi from sets import Set as set 100e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 101e64493206b76bce6e3af0598790ec076094e8c37Simran Basi#These hacks make Beautiful Soup able to parse XML with namespaces 102e64493206b76bce6e3af0598790ec076094e8c37Simran Basisgmllib.tagfind = re.compile('[a-zA-Z][-_.:a-zA-Z0-9]*') 103e64493206b76bce6e3af0598790ec076094e8c37Simran Basimarkupbase._declname_match = re.compile(r'[a-zA-Z][-_.:a-zA-Z0-9]*\s*').match 104e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 105e64493206b76bce6e3af0598790ec076094e8c37Simran BasiDEFAULT_OUTPUT_ENCODING = "utf-8" 106e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 107e64493206b76bce6e3af0598790ec076094e8c37Simran Basidef _match_css_class(str): 108e64493206b76bce6e3af0598790ec076094e8c37Simran Basi """Build a RE to match the given CSS class.""" 109e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return re.compile(r"(^|.*\s)%s($|\s)" % str) 110e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 111e64493206b76bce6e3af0598790ec076094e8c37Simran Basi# First, the classes that represent markup elements. 112e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 113e64493206b76bce6e3af0598790ec076094e8c37Simran Basiclass PageElement(object): 114e64493206b76bce6e3af0598790ec076094e8c37Simran Basi """Contains the navigational information for some part of the page 115e64493206b76bce6e3af0598790ec076094e8c37Simran Basi (either a tag or a piece of text)""" 116e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 117e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def _invert(h): 118e64493206b76bce6e3af0598790ec076094e8c37Simran Basi "Cheap function to invert a hash." 119e64493206b76bce6e3af0598790ec076094e8c37Simran Basi i = {} 120e64493206b76bce6e3af0598790ec076094e8c37Simran Basi for k,v in h.items(): 121e64493206b76bce6e3af0598790ec076094e8c37Simran Basi i[v] = k 122e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return i 123e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 124e64493206b76bce6e3af0598790ec076094e8c37Simran Basi XML_ENTITIES_TO_SPECIAL_CHARS = { "apos" : "'", 125e64493206b76bce6e3af0598790ec076094e8c37Simran Basi "quot" : '"', 126e64493206b76bce6e3af0598790ec076094e8c37Simran Basi "amp" : "&", 127e64493206b76bce6e3af0598790ec076094e8c37Simran Basi "lt" : "<", 128e64493206b76bce6e3af0598790ec076094e8c37Simran Basi "gt" : ">" } 129e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 130e64493206b76bce6e3af0598790ec076094e8c37Simran Basi XML_SPECIAL_CHARS_TO_ENTITIES = _invert(XML_ENTITIES_TO_SPECIAL_CHARS) 131e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 132e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def setup(self, parent=None, previous=None): 133e64493206b76bce6e3af0598790ec076094e8c37Simran Basi """Sets up the initial relations between this element and 134e64493206b76bce6e3af0598790ec076094e8c37Simran Basi other elements.""" 135e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.parent = parent 136e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.previous = previous 137e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.next = None 138e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.previousSibling = None 139e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.nextSibling = None 140e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if self.parent and self.parent.contents: 141e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.previousSibling = self.parent.contents[-1] 142e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.previousSibling.nextSibling = self 143e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 144e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def replaceWith(self, replaceWith): 145e64493206b76bce6e3af0598790ec076094e8c37Simran Basi oldParent = self.parent 146e64493206b76bce6e3af0598790ec076094e8c37Simran Basi myIndex = self.parent.index(self) 147e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if hasattr(replaceWith, "parent")\ 148e64493206b76bce6e3af0598790ec076094e8c37Simran Basi and replaceWith.parent is self.parent: 149e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # We're replacing this element with one of its siblings. 150e64493206b76bce6e3af0598790ec076094e8c37Simran Basi index = replaceWith.parent.index(replaceWith) 151e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if index and index < myIndex: 152e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # Furthermore, it comes before this element. That 153e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # means that when we extract it, the index of this 154e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # element will change. 155e64493206b76bce6e3af0598790ec076094e8c37Simran Basi myIndex = myIndex - 1 156e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.extract() 157e64493206b76bce6e3af0598790ec076094e8c37Simran Basi oldParent.insert(myIndex, replaceWith) 158e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 159e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def replaceWithChildren(self): 160e64493206b76bce6e3af0598790ec076094e8c37Simran Basi myParent = self.parent 161e64493206b76bce6e3af0598790ec076094e8c37Simran Basi myIndex = self.parent.index(self) 162e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.extract() 163e64493206b76bce6e3af0598790ec076094e8c37Simran Basi reversedChildren = list(self.contents) 164e64493206b76bce6e3af0598790ec076094e8c37Simran Basi reversedChildren.reverse() 165e64493206b76bce6e3af0598790ec076094e8c37Simran Basi for child in reversedChildren: 166e64493206b76bce6e3af0598790ec076094e8c37Simran Basi myParent.insert(myIndex, child) 167e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 168e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def extract(self): 169e64493206b76bce6e3af0598790ec076094e8c37Simran Basi """Destructively rips this element out of the tree.""" 170e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if self.parent: 171e64493206b76bce6e3af0598790ec076094e8c37Simran Basi try: 172e64493206b76bce6e3af0598790ec076094e8c37Simran Basi del self.parent.contents[self.parent.index(self)] 173e64493206b76bce6e3af0598790ec076094e8c37Simran Basi except ValueError: 174e64493206b76bce6e3af0598790ec076094e8c37Simran Basi pass 175e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 176e64493206b76bce6e3af0598790ec076094e8c37Simran Basi #Find the two elements that would be next to each other if 177e64493206b76bce6e3af0598790ec076094e8c37Simran Basi #this element (and any children) hadn't been parsed. Connect 178e64493206b76bce6e3af0598790ec076094e8c37Simran Basi #the two. 179e64493206b76bce6e3af0598790ec076094e8c37Simran Basi lastChild = self._lastRecursiveChild() 180e64493206b76bce6e3af0598790ec076094e8c37Simran Basi nextElement = lastChild.next 181e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 182e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if self.previous: 183e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.previous.next = nextElement 184e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if nextElement: 185e64493206b76bce6e3af0598790ec076094e8c37Simran Basi nextElement.previous = self.previous 186e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.previous = None 187e64493206b76bce6e3af0598790ec076094e8c37Simran Basi lastChild.next = None 188e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 189e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.parent = None 190e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if self.previousSibling: 191e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.previousSibling.nextSibling = self.nextSibling 192e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if self.nextSibling: 193e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.nextSibling.previousSibling = self.previousSibling 194e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.previousSibling = self.nextSibling = None 195e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return self 196e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 197e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def _lastRecursiveChild(self): 198e64493206b76bce6e3af0598790ec076094e8c37Simran Basi "Finds the last element beneath this object to be parsed." 199e64493206b76bce6e3af0598790ec076094e8c37Simran Basi lastChild = self 200e64493206b76bce6e3af0598790ec076094e8c37Simran Basi while hasattr(lastChild, 'contents') and lastChild.contents: 201e64493206b76bce6e3af0598790ec076094e8c37Simran Basi lastChild = lastChild.contents[-1] 202e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return lastChild 203e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 204e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def insert(self, position, newChild): 205e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if isinstance(newChild, basestring) \ 206e64493206b76bce6e3af0598790ec076094e8c37Simran Basi and not isinstance(newChild, NavigableString): 207e64493206b76bce6e3af0598790ec076094e8c37Simran Basi newChild = NavigableString(newChild) 208e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 209e64493206b76bce6e3af0598790ec076094e8c37Simran Basi position = min(position, len(self.contents)) 210e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if hasattr(newChild, 'parent') and newChild.parent is not None: 211e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # We're 'inserting' an element that's already one 212e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # of this object's children. 213e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if newChild.parent is self: 214e64493206b76bce6e3af0598790ec076094e8c37Simran Basi index = self.index(newChild) 215e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if index > position: 216e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # Furthermore we're moving it further down the 217e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # list of this object's children. That means that 218e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # when we extract this element, our target index 219e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # will jump down one. 220e64493206b76bce6e3af0598790ec076094e8c37Simran Basi position = position - 1 221e64493206b76bce6e3af0598790ec076094e8c37Simran Basi newChild.extract() 222e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 223e64493206b76bce6e3af0598790ec076094e8c37Simran Basi newChild.parent = self 224e64493206b76bce6e3af0598790ec076094e8c37Simran Basi previousChild = None 225e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if position == 0: 226e64493206b76bce6e3af0598790ec076094e8c37Simran Basi newChild.previousSibling = None 227e64493206b76bce6e3af0598790ec076094e8c37Simran Basi newChild.previous = self 228e64493206b76bce6e3af0598790ec076094e8c37Simran Basi else: 229e64493206b76bce6e3af0598790ec076094e8c37Simran Basi previousChild = self.contents[position-1] 230e64493206b76bce6e3af0598790ec076094e8c37Simran Basi newChild.previousSibling = previousChild 231e64493206b76bce6e3af0598790ec076094e8c37Simran Basi newChild.previousSibling.nextSibling = newChild 232e64493206b76bce6e3af0598790ec076094e8c37Simran Basi newChild.previous = previousChild._lastRecursiveChild() 233e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if newChild.previous: 234e64493206b76bce6e3af0598790ec076094e8c37Simran Basi newChild.previous.next = newChild 235e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 236e64493206b76bce6e3af0598790ec076094e8c37Simran Basi newChildsLastElement = newChild._lastRecursiveChild() 237e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 238e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if position >= len(self.contents): 239e64493206b76bce6e3af0598790ec076094e8c37Simran Basi newChild.nextSibling = None 240e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 241e64493206b76bce6e3af0598790ec076094e8c37Simran Basi parent = self 242e64493206b76bce6e3af0598790ec076094e8c37Simran Basi parentsNextSibling = None 243e64493206b76bce6e3af0598790ec076094e8c37Simran Basi while not parentsNextSibling: 244e64493206b76bce6e3af0598790ec076094e8c37Simran Basi parentsNextSibling = parent.nextSibling 245e64493206b76bce6e3af0598790ec076094e8c37Simran Basi parent = parent.parent 246e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if not parent: # This is the last element in the document. 247e64493206b76bce6e3af0598790ec076094e8c37Simran Basi break 248e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if parentsNextSibling: 249e64493206b76bce6e3af0598790ec076094e8c37Simran Basi newChildsLastElement.next = parentsNextSibling 250e64493206b76bce6e3af0598790ec076094e8c37Simran Basi else: 251e64493206b76bce6e3af0598790ec076094e8c37Simran Basi newChildsLastElement.next = None 252e64493206b76bce6e3af0598790ec076094e8c37Simran Basi else: 253e64493206b76bce6e3af0598790ec076094e8c37Simran Basi nextChild = self.contents[position] 254e64493206b76bce6e3af0598790ec076094e8c37Simran Basi newChild.nextSibling = nextChild 255e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if newChild.nextSibling: 256e64493206b76bce6e3af0598790ec076094e8c37Simran Basi newChild.nextSibling.previousSibling = newChild 257e64493206b76bce6e3af0598790ec076094e8c37Simran Basi newChildsLastElement.next = nextChild 258e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 259e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if newChildsLastElement.next: 260e64493206b76bce6e3af0598790ec076094e8c37Simran Basi newChildsLastElement.next.previous = newChildsLastElement 261e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.contents.insert(position, newChild) 262e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 263e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def append(self, tag): 264e64493206b76bce6e3af0598790ec076094e8c37Simran Basi """Appends the given tag to the contents of this tag.""" 265e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.insert(len(self.contents), tag) 266e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 267e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def findNext(self, name=None, attrs={}, text=None, **kwargs): 268e64493206b76bce6e3af0598790ec076094e8c37Simran Basi """Returns the first item that matches the given criteria and 269e64493206b76bce6e3af0598790ec076094e8c37Simran Basi appears after this Tag in the document.""" 270e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return self._findOne(self.findAllNext, name, attrs, text, **kwargs) 271e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 272e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def findAllNext(self, name=None, attrs={}, text=None, limit=None, 273e64493206b76bce6e3af0598790ec076094e8c37Simran Basi **kwargs): 274e64493206b76bce6e3af0598790ec076094e8c37Simran Basi """Returns all items that match the given criteria and appear 275e64493206b76bce6e3af0598790ec076094e8c37Simran Basi after this Tag in the document.""" 276e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return self._findAll(name, attrs, text, limit, self.nextGenerator, 277e64493206b76bce6e3af0598790ec076094e8c37Simran Basi **kwargs) 278e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 279e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def findNextSibling(self, name=None, attrs={}, text=None, **kwargs): 280e64493206b76bce6e3af0598790ec076094e8c37Simran Basi """Returns the closest sibling to this Tag that matches the 281e64493206b76bce6e3af0598790ec076094e8c37Simran Basi given criteria and appears after this Tag in the document.""" 282e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return self._findOne(self.findNextSiblings, name, attrs, text, 283e64493206b76bce6e3af0598790ec076094e8c37Simran Basi **kwargs) 284e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 285e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def findNextSiblings(self, name=None, attrs={}, text=None, limit=None, 286e64493206b76bce6e3af0598790ec076094e8c37Simran Basi **kwargs): 287e64493206b76bce6e3af0598790ec076094e8c37Simran Basi """Returns the siblings of this Tag that match the given 288e64493206b76bce6e3af0598790ec076094e8c37Simran Basi criteria and appear after this Tag in the document.""" 289e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return self._findAll(name, attrs, text, limit, 290e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.nextSiblingGenerator, **kwargs) 291e64493206b76bce6e3af0598790ec076094e8c37Simran Basi fetchNextSiblings = findNextSiblings # Compatibility with pre-3.x 292e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 293e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def findPrevious(self, name=None, attrs={}, text=None, **kwargs): 294e64493206b76bce6e3af0598790ec076094e8c37Simran Basi """Returns the first item that matches the given criteria and 295e64493206b76bce6e3af0598790ec076094e8c37Simran Basi appears before this Tag in the document.""" 296e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return self._findOne(self.findAllPrevious, name, attrs, text, **kwargs) 297e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 298e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def findAllPrevious(self, name=None, attrs={}, text=None, limit=None, 299e64493206b76bce6e3af0598790ec076094e8c37Simran Basi **kwargs): 300e64493206b76bce6e3af0598790ec076094e8c37Simran Basi """Returns all items that match the given criteria and appear 301e64493206b76bce6e3af0598790ec076094e8c37Simran Basi before this Tag in the document.""" 302e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return self._findAll(name, attrs, text, limit, self.previousGenerator, 303e64493206b76bce6e3af0598790ec076094e8c37Simran Basi **kwargs) 304e64493206b76bce6e3af0598790ec076094e8c37Simran Basi fetchPrevious = findAllPrevious # Compatibility with pre-3.x 305e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 306e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def findPreviousSibling(self, name=None, attrs={}, text=None, **kwargs): 307e64493206b76bce6e3af0598790ec076094e8c37Simran Basi """Returns the closest sibling to this Tag that matches the 308e64493206b76bce6e3af0598790ec076094e8c37Simran Basi given criteria and appears before this Tag in the document.""" 309e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return self._findOne(self.findPreviousSiblings, name, attrs, text, 310e64493206b76bce6e3af0598790ec076094e8c37Simran Basi **kwargs) 311e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 312e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def findPreviousSiblings(self, name=None, attrs={}, text=None, 313e64493206b76bce6e3af0598790ec076094e8c37Simran Basi limit=None, **kwargs): 314e64493206b76bce6e3af0598790ec076094e8c37Simran Basi """Returns the siblings of this Tag that match the given 315e64493206b76bce6e3af0598790ec076094e8c37Simran Basi criteria and appear before this Tag in the document.""" 316e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return self._findAll(name, attrs, text, limit, 317e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.previousSiblingGenerator, **kwargs) 318e64493206b76bce6e3af0598790ec076094e8c37Simran Basi fetchPreviousSiblings = findPreviousSiblings # Compatibility with pre-3.x 319e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 320e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def findParent(self, name=None, attrs={}, **kwargs): 321e64493206b76bce6e3af0598790ec076094e8c37Simran Basi """Returns the closest parent of this Tag that matches the given 322e64493206b76bce6e3af0598790ec076094e8c37Simran Basi criteria.""" 323e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # NOTE: We can't use _findOne because findParents takes a different 324e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # set of arguments. 325e64493206b76bce6e3af0598790ec076094e8c37Simran Basi r = None 326e64493206b76bce6e3af0598790ec076094e8c37Simran Basi l = self.findParents(name, attrs, 1) 327e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if l: 328e64493206b76bce6e3af0598790ec076094e8c37Simran Basi r = l[0] 329e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return r 330e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 331e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def findParents(self, name=None, attrs={}, limit=None, **kwargs): 332e64493206b76bce6e3af0598790ec076094e8c37Simran Basi """Returns the parents of this Tag that match the given 333e64493206b76bce6e3af0598790ec076094e8c37Simran Basi criteria.""" 334e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 335e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return self._findAll(name, attrs, None, limit, self.parentGenerator, 336e64493206b76bce6e3af0598790ec076094e8c37Simran Basi **kwargs) 337e64493206b76bce6e3af0598790ec076094e8c37Simran Basi fetchParents = findParents # Compatibility with pre-3.x 338e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 339e64493206b76bce6e3af0598790ec076094e8c37Simran Basi #These methods do the real heavy lifting. 340e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 341e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def _findOne(self, method, name, attrs, text, **kwargs): 342e64493206b76bce6e3af0598790ec076094e8c37Simran Basi r = None 343e64493206b76bce6e3af0598790ec076094e8c37Simran Basi l = method(name, attrs, text, 1, **kwargs) 344e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if l: 345e64493206b76bce6e3af0598790ec076094e8c37Simran Basi r = l[0] 346e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return r 347e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 348e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def _findAll(self, name, attrs, text, limit, generator, **kwargs): 349e64493206b76bce6e3af0598790ec076094e8c37Simran Basi "Iterates over a generator looking for things that match." 350e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 351e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if isinstance(name, SoupStrainer): 352e64493206b76bce6e3af0598790ec076094e8c37Simran Basi strainer = name 353e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # (Possibly) special case some findAll*(...) searches 354e64493206b76bce6e3af0598790ec076094e8c37Simran Basi elif text is None and not limit and not attrs and not kwargs: 355e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # findAll*(True) 356e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if name is True: 357e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return [element for element in generator() 358e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if isinstance(element, Tag)] 359e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # findAll*('tag-name') 360e64493206b76bce6e3af0598790ec076094e8c37Simran Basi elif isinstance(name, basestring): 361e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return [element for element in generator() 362e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if isinstance(element, Tag) and 363e64493206b76bce6e3af0598790ec076094e8c37Simran Basi element.name == name] 364e64493206b76bce6e3af0598790ec076094e8c37Simran Basi else: 365e64493206b76bce6e3af0598790ec076094e8c37Simran Basi strainer = SoupStrainer(name, attrs, text, **kwargs) 366e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # Build a SoupStrainer 367e64493206b76bce6e3af0598790ec076094e8c37Simran Basi else: 368e64493206b76bce6e3af0598790ec076094e8c37Simran Basi strainer = SoupStrainer(name, attrs, text, **kwargs) 369e64493206b76bce6e3af0598790ec076094e8c37Simran Basi results = ResultSet(strainer) 370e64493206b76bce6e3af0598790ec076094e8c37Simran Basi g = generator() 371e64493206b76bce6e3af0598790ec076094e8c37Simran Basi while True: 372e64493206b76bce6e3af0598790ec076094e8c37Simran Basi try: 373e64493206b76bce6e3af0598790ec076094e8c37Simran Basi i = g.next() 374e64493206b76bce6e3af0598790ec076094e8c37Simran Basi except StopIteration: 375e64493206b76bce6e3af0598790ec076094e8c37Simran Basi break 376e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if i: 377e64493206b76bce6e3af0598790ec076094e8c37Simran Basi found = strainer.search(i) 378e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if found: 379e64493206b76bce6e3af0598790ec076094e8c37Simran Basi results.append(found) 380e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if limit and len(results) >= limit: 381e64493206b76bce6e3af0598790ec076094e8c37Simran Basi break 382e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return results 383e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 384e64493206b76bce6e3af0598790ec076094e8c37Simran Basi #These Generators can be used to navigate starting from both 385e64493206b76bce6e3af0598790ec076094e8c37Simran Basi #NavigableStrings and Tags. 386e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def nextGenerator(self): 387e64493206b76bce6e3af0598790ec076094e8c37Simran Basi i = self 388e64493206b76bce6e3af0598790ec076094e8c37Simran Basi while i is not None: 389e64493206b76bce6e3af0598790ec076094e8c37Simran Basi i = i.next 390e64493206b76bce6e3af0598790ec076094e8c37Simran Basi yield i 391e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 392e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def nextSiblingGenerator(self): 393e64493206b76bce6e3af0598790ec076094e8c37Simran Basi i = self 394e64493206b76bce6e3af0598790ec076094e8c37Simran Basi while i is not None: 395e64493206b76bce6e3af0598790ec076094e8c37Simran Basi i = i.nextSibling 396e64493206b76bce6e3af0598790ec076094e8c37Simran Basi yield i 397e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 398e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def previousGenerator(self): 399e64493206b76bce6e3af0598790ec076094e8c37Simran Basi i = self 400e64493206b76bce6e3af0598790ec076094e8c37Simran Basi while i is not None: 401e64493206b76bce6e3af0598790ec076094e8c37Simran Basi i = i.previous 402e64493206b76bce6e3af0598790ec076094e8c37Simran Basi yield i 403e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 404e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def previousSiblingGenerator(self): 405e64493206b76bce6e3af0598790ec076094e8c37Simran Basi i = self 406e64493206b76bce6e3af0598790ec076094e8c37Simran Basi while i is not None: 407e64493206b76bce6e3af0598790ec076094e8c37Simran Basi i = i.previousSibling 408e64493206b76bce6e3af0598790ec076094e8c37Simran Basi yield i 409e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 410e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def parentGenerator(self): 411e64493206b76bce6e3af0598790ec076094e8c37Simran Basi i = self 412e64493206b76bce6e3af0598790ec076094e8c37Simran Basi while i is not None: 413e64493206b76bce6e3af0598790ec076094e8c37Simran Basi i = i.parent 414e64493206b76bce6e3af0598790ec076094e8c37Simran Basi yield i 415e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 416e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # Utility methods 417e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def substituteEncoding(self, str, encoding=None): 418e64493206b76bce6e3af0598790ec076094e8c37Simran Basi encoding = encoding or "utf-8" 419e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return str.replace("%SOUP-ENCODING%", encoding) 420e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 421e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def toEncoding(self, s, encoding=None): 422e64493206b76bce6e3af0598790ec076094e8c37Simran Basi """Encodes an object to a string in some encoding, or to Unicode. 423e64493206b76bce6e3af0598790ec076094e8c37Simran Basi .""" 424e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if isinstance(s, unicode): 425e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if encoding: 426e64493206b76bce6e3af0598790ec076094e8c37Simran Basi s = s.encode(encoding) 427e64493206b76bce6e3af0598790ec076094e8c37Simran Basi elif isinstance(s, str): 428e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if encoding: 429e64493206b76bce6e3af0598790ec076094e8c37Simran Basi s = s.encode(encoding) 430e64493206b76bce6e3af0598790ec076094e8c37Simran Basi else: 431e64493206b76bce6e3af0598790ec076094e8c37Simran Basi s = unicode(s) 432e64493206b76bce6e3af0598790ec076094e8c37Simran Basi else: 433e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if encoding: 434e64493206b76bce6e3af0598790ec076094e8c37Simran Basi s = self.toEncoding(str(s), encoding) 435e64493206b76bce6e3af0598790ec076094e8c37Simran Basi else: 436e64493206b76bce6e3af0598790ec076094e8c37Simran Basi s = unicode(s) 437e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return s 438e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 439e64493206b76bce6e3af0598790ec076094e8c37Simran Basi BARE_AMPERSAND_OR_BRACKET = re.compile("([<>]|" 440e64493206b76bce6e3af0598790ec076094e8c37Simran Basi + "&(?!#\d+;|#x[0-9a-fA-F]+;|\w+;)" 441e64493206b76bce6e3af0598790ec076094e8c37Simran Basi + ")") 442e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 443e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def _sub_entity(self, x): 444e64493206b76bce6e3af0598790ec076094e8c37Simran Basi """Used with a regular expression to substitute the 445e64493206b76bce6e3af0598790ec076094e8c37Simran Basi appropriate XML entity for an XML special character.""" 446e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return "&" + self.XML_SPECIAL_CHARS_TO_ENTITIES[x.group(0)[0]] + ";" 447e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 448e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 449e64493206b76bce6e3af0598790ec076094e8c37Simran Basiclass NavigableString(unicode, PageElement): 450e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 451e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def __new__(cls, value): 452e64493206b76bce6e3af0598790ec076094e8c37Simran Basi """Create a new NavigableString. 453e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 454e64493206b76bce6e3af0598790ec076094e8c37Simran Basi When unpickling a NavigableString, this method is called with 455e64493206b76bce6e3af0598790ec076094e8c37Simran Basi the string in DEFAULT_OUTPUT_ENCODING. That encoding needs to be 456e64493206b76bce6e3af0598790ec076094e8c37Simran Basi passed in to the superclass's __new__ or the superclass won't know 457e64493206b76bce6e3af0598790ec076094e8c37Simran Basi how to handle non-ASCII characters. 458e64493206b76bce6e3af0598790ec076094e8c37Simran Basi """ 459e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if isinstance(value, unicode): 460e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return unicode.__new__(cls, value) 461e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return unicode.__new__(cls, value, DEFAULT_OUTPUT_ENCODING) 462e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 463e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def __getnewargs__(self): 464e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return (NavigableString.__str__(self),) 465e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 466e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def __getattr__(self, attr): 467e64493206b76bce6e3af0598790ec076094e8c37Simran Basi """text.string gives you text. This is for backwards 468e64493206b76bce6e3af0598790ec076094e8c37Simran Basi compatibility for Navigable*String, but for CData* it lets you 469e64493206b76bce6e3af0598790ec076094e8c37Simran Basi get the string without the CData wrapper.""" 470e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if attr == 'string': 471e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return self 472e64493206b76bce6e3af0598790ec076094e8c37Simran Basi else: 473e64493206b76bce6e3af0598790ec076094e8c37Simran Basi raise AttributeError, "'%s' object has no attribute '%s'" % (self.__class__.__name__, attr) 474e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 475e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def __unicode__(self): 476e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return str(self).decode(DEFAULT_OUTPUT_ENCODING) 477e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 478e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING): 479e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # Substitute outgoing XML entities. 480e64493206b76bce6e3af0598790ec076094e8c37Simran Basi data = self.BARE_AMPERSAND_OR_BRACKET.sub(self._sub_entity, self) 481e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if encoding: 482e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return data.encode(encoding) 483e64493206b76bce6e3af0598790ec076094e8c37Simran Basi else: 484e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return data 485e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 486e64493206b76bce6e3af0598790ec076094e8c37Simran Basiclass CData(NavigableString): 487e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 488e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING): 489e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return "<![CDATA[%s]]>" % NavigableString.__str__(self, encoding) 490e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 491e64493206b76bce6e3af0598790ec076094e8c37Simran Basiclass ProcessingInstruction(NavigableString): 492e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING): 493e64493206b76bce6e3af0598790ec076094e8c37Simran Basi output = self 494e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if "%SOUP-ENCODING%" in output: 495e64493206b76bce6e3af0598790ec076094e8c37Simran Basi output = self.substituteEncoding(output, encoding) 496e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return "<?%s?>" % self.toEncoding(output, encoding) 497e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 498e64493206b76bce6e3af0598790ec076094e8c37Simran Basiclass Comment(NavigableString): 499e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING): 500e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return "<!--%s-->" % NavigableString.__str__(self, encoding) 501e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 502e64493206b76bce6e3af0598790ec076094e8c37Simran Basiclass Declaration(NavigableString): 503e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING): 504e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return "<!%s>" % NavigableString.__str__(self, encoding) 505e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 506e64493206b76bce6e3af0598790ec076094e8c37Simran Basiclass Tag(PageElement): 507e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 508e64493206b76bce6e3af0598790ec076094e8c37Simran Basi """Represents a found HTML tag with its attributes and contents.""" 509e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 510e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def _convertEntities(self, match): 511e64493206b76bce6e3af0598790ec076094e8c37Simran Basi """Used in a call to re.sub to replace HTML, XML, and numeric 512e64493206b76bce6e3af0598790ec076094e8c37Simran Basi entities with the appropriate Unicode characters. If HTML 513e64493206b76bce6e3af0598790ec076094e8c37Simran Basi entities are being converted, any unrecognized entities are 514e64493206b76bce6e3af0598790ec076094e8c37Simran Basi escaped.""" 515e64493206b76bce6e3af0598790ec076094e8c37Simran Basi x = match.group(1) 516e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if self.convertHTMLEntities and x in name2codepoint: 517e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return unichr(name2codepoint[x]) 518e64493206b76bce6e3af0598790ec076094e8c37Simran Basi elif x in self.XML_ENTITIES_TO_SPECIAL_CHARS: 519e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if self.convertXMLEntities: 520e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return self.XML_ENTITIES_TO_SPECIAL_CHARS[x] 521e64493206b76bce6e3af0598790ec076094e8c37Simran Basi else: 522e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return u'&%s;' % x 523e64493206b76bce6e3af0598790ec076094e8c37Simran Basi elif len(x) > 0 and x[0] == '#': 524e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # Handle numeric entities 525e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if len(x) > 1 and x[1] == 'x': 526e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return unichr(int(x[2:], 16)) 527e64493206b76bce6e3af0598790ec076094e8c37Simran Basi else: 528e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return unichr(int(x[1:])) 529e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 530e64493206b76bce6e3af0598790ec076094e8c37Simran Basi elif self.escapeUnrecognizedEntities: 531e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return u'&%s;' % x 532e64493206b76bce6e3af0598790ec076094e8c37Simran Basi else: 533e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return u'&%s;' % x 534e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 535e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def __init__(self, parser, name, attrs=None, parent=None, 536e64493206b76bce6e3af0598790ec076094e8c37Simran Basi previous=None): 537e64493206b76bce6e3af0598790ec076094e8c37Simran Basi "Basic constructor." 538e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 539e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # We don't actually store the parser object: that lets extracted 540e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # chunks be garbage-collected 541e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.parserClass = parser.__class__ 542e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.isSelfClosing = parser.isSelfClosingTag(name) 543e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.name = name 544e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if attrs is None: 545e64493206b76bce6e3af0598790ec076094e8c37Simran Basi attrs = [] 546e64493206b76bce6e3af0598790ec076094e8c37Simran Basi elif isinstance(attrs, dict): 547e64493206b76bce6e3af0598790ec076094e8c37Simran Basi attrs = attrs.items() 548e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.attrs = attrs 549e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.contents = [] 550e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.setup(parent, previous) 551e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.hidden = False 552e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.containsSubstitutions = False 553e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.convertHTMLEntities = parser.convertHTMLEntities 554e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.convertXMLEntities = parser.convertXMLEntities 555e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.escapeUnrecognizedEntities = parser.escapeUnrecognizedEntities 556e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 557e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # Convert any HTML, XML, or numeric entities in the attribute values. 558e64493206b76bce6e3af0598790ec076094e8c37Simran Basi convert = lambda(k, val): (k, 559e64493206b76bce6e3af0598790ec076094e8c37Simran Basi re.sub("&(#\d+|#x[0-9a-fA-F]+|\w+);", 560e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self._convertEntities, 561e64493206b76bce6e3af0598790ec076094e8c37Simran Basi val)) 562e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.attrs = map(convert, self.attrs) 563e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 564e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def getString(self): 565e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if (len(self.contents) == 1 566e64493206b76bce6e3af0598790ec076094e8c37Simran Basi and isinstance(self.contents[0], NavigableString)): 567e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return self.contents[0] 568e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 569e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def setString(self, string): 570e64493206b76bce6e3af0598790ec076094e8c37Simran Basi """Replace the contents of the tag with a string""" 571e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.clear() 572e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.append(string) 573e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 574e64493206b76bce6e3af0598790ec076094e8c37Simran Basi string = property(getString, setString) 575e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 576e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def getText(self, separator=u""): 577e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if not len(self.contents): 578e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return u"" 579e64493206b76bce6e3af0598790ec076094e8c37Simran Basi stopNode = self._lastRecursiveChild().next 580e64493206b76bce6e3af0598790ec076094e8c37Simran Basi strings = [] 581e64493206b76bce6e3af0598790ec076094e8c37Simran Basi current = self.contents[0] 582e64493206b76bce6e3af0598790ec076094e8c37Simran Basi while current is not stopNode: 583e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if isinstance(current, NavigableString): 584e64493206b76bce6e3af0598790ec076094e8c37Simran Basi strings.append(current.strip()) 585e64493206b76bce6e3af0598790ec076094e8c37Simran Basi current = current.next 586e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return separator.join(strings) 587e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 588e64493206b76bce6e3af0598790ec076094e8c37Simran Basi text = property(getText) 589e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 590e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def get(self, key, default=None): 591e64493206b76bce6e3af0598790ec076094e8c37Simran Basi """Returns the value of the 'key' attribute for the tag, or 592e64493206b76bce6e3af0598790ec076094e8c37Simran Basi the value given for 'default' if it doesn't have that 593e64493206b76bce6e3af0598790ec076094e8c37Simran Basi attribute.""" 594e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return self._getAttrMap().get(key, default) 595e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 596e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def clear(self): 597e64493206b76bce6e3af0598790ec076094e8c37Simran Basi """Extract all children.""" 598e64493206b76bce6e3af0598790ec076094e8c37Simran Basi for child in self.contents[:]: 599e64493206b76bce6e3af0598790ec076094e8c37Simran Basi child.extract() 600e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 601e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def index(self, element): 602e64493206b76bce6e3af0598790ec076094e8c37Simran Basi for i, child in enumerate(self.contents): 603e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if child is element: 604e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return i 605e64493206b76bce6e3af0598790ec076094e8c37Simran Basi raise ValueError("Tag.index: element not in tag") 606e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 607e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def has_key(self, key): 608e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return self._getAttrMap().has_key(key) 609e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 610e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def __getitem__(self, key): 611e64493206b76bce6e3af0598790ec076094e8c37Simran Basi """tag[key] returns the value of the 'key' attribute for the tag, 612e64493206b76bce6e3af0598790ec076094e8c37Simran Basi and throws an exception if it's not there.""" 613e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return self._getAttrMap()[key] 614e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 615e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def __iter__(self): 616e64493206b76bce6e3af0598790ec076094e8c37Simran Basi "Iterating over a tag iterates over its contents." 617e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return iter(self.contents) 618e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 619e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def __len__(self): 620e64493206b76bce6e3af0598790ec076094e8c37Simran Basi "The length of a tag is the length of its list of contents." 621e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return len(self.contents) 622e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 623e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def __contains__(self, x): 624e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return x in self.contents 625e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 626e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def __nonzero__(self): 627e64493206b76bce6e3af0598790ec076094e8c37Simran Basi "A tag is non-None even if it has no contents." 628e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return True 629e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 630e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def __setitem__(self, key, value): 631e64493206b76bce6e3af0598790ec076094e8c37Simran Basi """Setting tag[key] sets the value of the 'key' attribute for the 632e64493206b76bce6e3af0598790ec076094e8c37Simran Basi tag.""" 633e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self._getAttrMap() 634e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.attrMap[key] = value 635e64493206b76bce6e3af0598790ec076094e8c37Simran Basi found = False 636e64493206b76bce6e3af0598790ec076094e8c37Simran Basi for i in range(0, len(self.attrs)): 637e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if self.attrs[i][0] == key: 638e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.attrs[i] = (key, value) 639e64493206b76bce6e3af0598790ec076094e8c37Simran Basi found = True 640e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if not found: 641e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.attrs.append((key, value)) 642e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self._getAttrMap()[key] = value 643e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 644e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def __delitem__(self, key): 645e64493206b76bce6e3af0598790ec076094e8c37Simran Basi "Deleting tag[key] deletes all 'key' attributes for the tag." 646e64493206b76bce6e3af0598790ec076094e8c37Simran Basi for item in self.attrs: 647e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if item[0] == key: 648e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.attrs.remove(item) 649e64493206b76bce6e3af0598790ec076094e8c37Simran Basi #We don't break because bad HTML can define the same 650e64493206b76bce6e3af0598790ec076094e8c37Simran Basi #attribute multiple times. 651e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self._getAttrMap() 652e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if self.attrMap.has_key(key): 653e64493206b76bce6e3af0598790ec076094e8c37Simran Basi del self.attrMap[key] 654e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 655e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def __call__(self, *args, **kwargs): 656e64493206b76bce6e3af0598790ec076094e8c37Simran Basi """Calling a tag like a function is the same as calling its 657e64493206b76bce6e3af0598790ec076094e8c37Simran Basi findAll() method. Eg. tag('a') returns a list of all the A tags 658e64493206b76bce6e3af0598790ec076094e8c37Simran Basi found within this tag.""" 659e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return apply(self.findAll, args, kwargs) 660e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 661e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def __getattr__(self, tag): 662e64493206b76bce6e3af0598790ec076094e8c37Simran Basi #print "Getattr %s.%s" % (self.__class__, tag) 663e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if len(tag) > 3 and tag.rfind('Tag') == len(tag)-3: 664e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return self.find(tag[:-3]) 665e64493206b76bce6e3af0598790ec076094e8c37Simran Basi elif tag.find('__') != 0: 666e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return self.find(tag) 667e64493206b76bce6e3af0598790ec076094e8c37Simran Basi raise AttributeError, "'%s' object has no attribute '%s'" % (self.__class__, tag) 668e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 669e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def __eq__(self, other): 670e64493206b76bce6e3af0598790ec076094e8c37Simran Basi """Returns true iff this tag has the same name, the same attributes, 671e64493206b76bce6e3af0598790ec076094e8c37Simran Basi and the same contents (recursively) as the given tag. 672e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 673e64493206b76bce6e3af0598790ec076094e8c37Simran Basi NOTE: right now this will return false if two tags have the 674e64493206b76bce6e3af0598790ec076094e8c37Simran Basi same attributes in a different order. Should this be fixed?""" 675e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if other is self: 676e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return True 677e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if not hasattr(other, 'name') or not hasattr(other, 'attrs') or not hasattr(other, 'contents') or self.name != other.name or self.attrs != other.attrs or len(self) != len(other): 678e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return False 679e64493206b76bce6e3af0598790ec076094e8c37Simran Basi for i in range(0, len(self.contents)): 680e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if self.contents[i] != other.contents[i]: 681e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return False 682e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return True 683e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 684e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def __ne__(self, other): 685e64493206b76bce6e3af0598790ec076094e8c37Simran Basi """Returns true iff this tag is not identical to the other tag, 686e64493206b76bce6e3af0598790ec076094e8c37Simran Basi as defined in __eq__.""" 687e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return not self == other 688e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 689e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def __repr__(self, encoding=DEFAULT_OUTPUT_ENCODING): 690e64493206b76bce6e3af0598790ec076094e8c37Simran Basi """Renders this tag as a string.""" 691e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return self.__str__(encoding) 692e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 693e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def __unicode__(self): 694e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return self.__str__(None) 695e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 696e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING, 697e64493206b76bce6e3af0598790ec076094e8c37Simran Basi prettyPrint=False, indentLevel=0): 698e64493206b76bce6e3af0598790ec076094e8c37Simran Basi """Returns a string or Unicode representation of this tag and 699e64493206b76bce6e3af0598790ec076094e8c37Simran Basi its contents. To get Unicode, pass None for encoding. 700e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 701e64493206b76bce6e3af0598790ec076094e8c37Simran Basi NOTE: since Python's HTML parser consumes whitespace, this 702e64493206b76bce6e3af0598790ec076094e8c37Simran Basi method is not certain to reproduce the whitespace present in 703e64493206b76bce6e3af0598790ec076094e8c37Simran Basi the original string.""" 704e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 705e64493206b76bce6e3af0598790ec076094e8c37Simran Basi encodedName = self.toEncoding(self.name, encoding) 706e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 707e64493206b76bce6e3af0598790ec076094e8c37Simran Basi attrs = [] 708e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if self.attrs: 709e64493206b76bce6e3af0598790ec076094e8c37Simran Basi for key, val in self.attrs: 710e64493206b76bce6e3af0598790ec076094e8c37Simran Basi fmt = '%s="%s"' 711e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if isinstance(val, basestring): 712e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if self.containsSubstitutions and '%SOUP-ENCODING%' in val: 713e64493206b76bce6e3af0598790ec076094e8c37Simran Basi val = self.substituteEncoding(val, encoding) 714e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 715e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # The attribute value either: 716e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # 717e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # * Contains no embedded double quotes or single quotes. 718e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # No problem: we enclose it in double quotes. 719e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # * Contains embedded single quotes. No problem: 720e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # double quotes work here too. 721e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # * Contains embedded double quotes. No problem: 722e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # we enclose it in single quotes. 723e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # * Embeds both single _and_ double quotes. This 724e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # can't happen naturally, but it can happen if 725e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # you modify an attribute value after parsing 726e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # the document. Now we have a bit of a 727e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # problem. We solve it by enclosing the 728e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # attribute in single quotes, and escaping any 729e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # embedded single quotes to XML entities. 730e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if '"' in val: 731e64493206b76bce6e3af0598790ec076094e8c37Simran Basi fmt = "%s='%s'" 732e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if "'" in val: 733e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # TODO: replace with apos when 734e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # appropriate. 735e64493206b76bce6e3af0598790ec076094e8c37Simran Basi val = val.replace("'", "&squot;") 736e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 737e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # Now we're okay w/r/t quotes. But the attribute 738e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # value might also contain angle brackets, or 739e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # ampersands that aren't part of entities. We need 740e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # to escape those to XML entities too. 741e64493206b76bce6e3af0598790ec076094e8c37Simran Basi val = self.BARE_AMPERSAND_OR_BRACKET.sub(self._sub_entity, val) 742e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 743e64493206b76bce6e3af0598790ec076094e8c37Simran Basi attrs.append(fmt % (self.toEncoding(key, encoding), 744e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.toEncoding(val, encoding))) 745e64493206b76bce6e3af0598790ec076094e8c37Simran Basi close = '' 746e64493206b76bce6e3af0598790ec076094e8c37Simran Basi closeTag = '' 747e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if self.isSelfClosing: 748e64493206b76bce6e3af0598790ec076094e8c37Simran Basi close = ' /' 749e64493206b76bce6e3af0598790ec076094e8c37Simran Basi else: 750e64493206b76bce6e3af0598790ec076094e8c37Simran Basi closeTag = '</%s>' % encodedName 751e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 752e64493206b76bce6e3af0598790ec076094e8c37Simran Basi indentTag, indentContents = 0, 0 753e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if prettyPrint: 754e64493206b76bce6e3af0598790ec076094e8c37Simran Basi indentTag = indentLevel 755e64493206b76bce6e3af0598790ec076094e8c37Simran Basi space = (' ' * (indentTag-1)) 756e64493206b76bce6e3af0598790ec076094e8c37Simran Basi indentContents = indentTag + 1 757e64493206b76bce6e3af0598790ec076094e8c37Simran Basi contents = self.renderContents(encoding, prettyPrint, indentContents) 758e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if self.hidden: 759e64493206b76bce6e3af0598790ec076094e8c37Simran Basi s = contents 760e64493206b76bce6e3af0598790ec076094e8c37Simran Basi else: 761e64493206b76bce6e3af0598790ec076094e8c37Simran Basi s = [] 762e64493206b76bce6e3af0598790ec076094e8c37Simran Basi attributeString = '' 763e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if attrs: 764e64493206b76bce6e3af0598790ec076094e8c37Simran Basi attributeString = ' ' + ' '.join(attrs) 765e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if prettyPrint: 766e64493206b76bce6e3af0598790ec076094e8c37Simran Basi s.append(space) 767e64493206b76bce6e3af0598790ec076094e8c37Simran Basi s.append('<%s%s%s>' % (encodedName, attributeString, close)) 768e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if prettyPrint: 769e64493206b76bce6e3af0598790ec076094e8c37Simran Basi s.append("\n") 770e64493206b76bce6e3af0598790ec076094e8c37Simran Basi s.append(contents) 771e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if prettyPrint and contents and contents[-1] != "\n": 772e64493206b76bce6e3af0598790ec076094e8c37Simran Basi s.append("\n") 773e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if prettyPrint and closeTag: 774e64493206b76bce6e3af0598790ec076094e8c37Simran Basi s.append(space) 775e64493206b76bce6e3af0598790ec076094e8c37Simran Basi s.append(closeTag) 776e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if prettyPrint and closeTag and self.nextSibling: 777e64493206b76bce6e3af0598790ec076094e8c37Simran Basi s.append("\n") 778e64493206b76bce6e3af0598790ec076094e8c37Simran Basi s = ''.join(s) 779e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return s 780e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 781e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def decompose(self): 782e64493206b76bce6e3af0598790ec076094e8c37Simran Basi """Recursively destroys the contents of this tree.""" 783e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.extract() 784e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if len(self.contents) == 0: 785e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return 786e64493206b76bce6e3af0598790ec076094e8c37Simran Basi current = self.contents[0] 787e64493206b76bce6e3af0598790ec076094e8c37Simran Basi while current is not None: 788e64493206b76bce6e3af0598790ec076094e8c37Simran Basi next = current.next 789e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if isinstance(current, Tag): 790e64493206b76bce6e3af0598790ec076094e8c37Simran Basi del current.contents[:] 791e64493206b76bce6e3af0598790ec076094e8c37Simran Basi current.parent = None 792e64493206b76bce6e3af0598790ec076094e8c37Simran Basi current.previous = None 793e64493206b76bce6e3af0598790ec076094e8c37Simran Basi current.previousSibling = None 794e64493206b76bce6e3af0598790ec076094e8c37Simran Basi current.next = None 795e64493206b76bce6e3af0598790ec076094e8c37Simran Basi current.nextSibling = None 796e64493206b76bce6e3af0598790ec076094e8c37Simran Basi current = next 797e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 798e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def prettify(self, encoding=DEFAULT_OUTPUT_ENCODING): 799e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return self.__str__(encoding, True) 800e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 801e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def renderContents(self, encoding=DEFAULT_OUTPUT_ENCODING, 802e64493206b76bce6e3af0598790ec076094e8c37Simran Basi prettyPrint=False, indentLevel=0): 803e64493206b76bce6e3af0598790ec076094e8c37Simran Basi """Renders the contents of this tag as a string in the given 804e64493206b76bce6e3af0598790ec076094e8c37Simran Basi encoding. If encoding is None, returns a Unicode string..""" 805e64493206b76bce6e3af0598790ec076094e8c37Simran Basi s=[] 806e64493206b76bce6e3af0598790ec076094e8c37Simran Basi for c in self: 807e64493206b76bce6e3af0598790ec076094e8c37Simran Basi text = None 808e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if isinstance(c, NavigableString): 809e64493206b76bce6e3af0598790ec076094e8c37Simran Basi text = c.__str__(encoding) 810e64493206b76bce6e3af0598790ec076094e8c37Simran Basi elif isinstance(c, Tag): 811e64493206b76bce6e3af0598790ec076094e8c37Simran Basi s.append(c.__str__(encoding, prettyPrint, indentLevel)) 812e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if text and prettyPrint: 813e64493206b76bce6e3af0598790ec076094e8c37Simran Basi text = text.strip() 814e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if text: 815e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if prettyPrint: 816e64493206b76bce6e3af0598790ec076094e8c37Simran Basi s.append(" " * (indentLevel-1)) 817e64493206b76bce6e3af0598790ec076094e8c37Simran Basi s.append(text) 818e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if prettyPrint: 819e64493206b76bce6e3af0598790ec076094e8c37Simran Basi s.append("\n") 820e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return ''.join(s) 821e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 822e64493206b76bce6e3af0598790ec076094e8c37Simran Basi #Soup methods 823e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 824e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def find(self, name=None, attrs={}, recursive=True, text=None, 825e64493206b76bce6e3af0598790ec076094e8c37Simran Basi **kwargs): 826e64493206b76bce6e3af0598790ec076094e8c37Simran Basi """Return only the first child of this Tag matching the given 827e64493206b76bce6e3af0598790ec076094e8c37Simran Basi criteria.""" 828e64493206b76bce6e3af0598790ec076094e8c37Simran Basi r = None 829e64493206b76bce6e3af0598790ec076094e8c37Simran Basi l = self.findAll(name, attrs, recursive, text, 1, **kwargs) 830e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if l: 831e64493206b76bce6e3af0598790ec076094e8c37Simran Basi r = l[0] 832e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return r 833e64493206b76bce6e3af0598790ec076094e8c37Simran Basi findChild = find 834e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 835e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def findAll(self, name=None, attrs={}, recursive=True, text=None, 836e64493206b76bce6e3af0598790ec076094e8c37Simran Basi limit=None, **kwargs): 837e64493206b76bce6e3af0598790ec076094e8c37Simran Basi """Extracts a list of Tag objects that match the given 838e64493206b76bce6e3af0598790ec076094e8c37Simran Basi criteria. You can specify the name of the Tag and any 839e64493206b76bce6e3af0598790ec076094e8c37Simran Basi attributes you want the Tag to have. 840e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 841e64493206b76bce6e3af0598790ec076094e8c37Simran Basi The value of a key-value pair in the 'attrs' map can be a 842e64493206b76bce6e3af0598790ec076094e8c37Simran Basi string, a list of strings, a regular expression object, or a 843e64493206b76bce6e3af0598790ec076094e8c37Simran Basi callable that takes a string and returns whether or not the 844e64493206b76bce6e3af0598790ec076094e8c37Simran Basi string matches for some custom definition of 'matches'. The 845e64493206b76bce6e3af0598790ec076094e8c37Simran Basi same is true of the tag name.""" 846e64493206b76bce6e3af0598790ec076094e8c37Simran Basi generator = self.recursiveChildGenerator 847e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if not recursive: 848e64493206b76bce6e3af0598790ec076094e8c37Simran Basi generator = self.childGenerator 849e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return self._findAll(name, attrs, text, limit, generator, **kwargs) 850e64493206b76bce6e3af0598790ec076094e8c37Simran Basi findChildren = findAll 851e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 852e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # Pre-3.x compatibility methods 853e64493206b76bce6e3af0598790ec076094e8c37Simran Basi first = find 854e64493206b76bce6e3af0598790ec076094e8c37Simran Basi fetch = findAll 855e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 856e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def fetchText(self, text=None, recursive=True, limit=None): 857e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return self.findAll(text=text, recursive=recursive, limit=limit) 858e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 859e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def firstText(self, text=None, recursive=True): 860e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return self.find(text=text, recursive=recursive) 861e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 862e64493206b76bce6e3af0598790ec076094e8c37Simran Basi #Private methods 863e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 864e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def _getAttrMap(self): 865e64493206b76bce6e3af0598790ec076094e8c37Simran Basi """Initializes a map representation of this tag's attributes, 866e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if not already initialized.""" 867e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if not getattr(self, 'attrMap'): 868e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.attrMap = {} 869e64493206b76bce6e3af0598790ec076094e8c37Simran Basi for (key, value) in self.attrs: 870e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.attrMap[key] = value 871e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return self.attrMap 872e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 873e64493206b76bce6e3af0598790ec076094e8c37Simran Basi #Generator methods 874e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def childGenerator(self): 875e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # Just use the iterator from the contents 876e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return iter(self.contents) 877e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 878e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def recursiveChildGenerator(self): 879e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if not len(self.contents): 880e64493206b76bce6e3af0598790ec076094e8c37Simran Basi raise StopIteration 881e64493206b76bce6e3af0598790ec076094e8c37Simran Basi stopNode = self._lastRecursiveChild().next 882e64493206b76bce6e3af0598790ec076094e8c37Simran Basi current = self.contents[0] 883e64493206b76bce6e3af0598790ec076094e8c37Simran Basi while current is not stopNode: 884e64493206b76bce6e3af0598790ec076094e8c37Simran Basi yield current 885e64493206b76bce6e3af0598790ec076094e8c37Simran Basi current = current.next 886e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 887e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 888e64493206b76bce6e3af0598790ec076094e8c37Simran Basi# Next, a couple classes to represent queries and their results. 889e64493206b76bce6e3af0598790ec076094e8c37Simran Basiclass SoupStrainer: 890e64493206b76bce6e3af0598790ec076094e8c37Simran Basi """Encapsulates a number of ways of matching a markup element (tag or 891e64493206b76bce6e3af0598790ec076094e8c37Simran Basi text).""" 892e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 893e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def __init__(self, name=None, attrs={}, text=None, **kwargs): 894e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.name = name 895e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if isinstance(attrs, basestring): 896e64493206b76bce6e3af0598790ec076094e8c37Simran Basi kwargs['class'] = _match_css_class(attrs) 897e64493206b76bce6e3af0598790ec076094e8c37Simran Basi attrs = None 898e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if kwargs: 899e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if attrs: 900e64493206b76bce6e3af0598790ec076094e8c37Simran Basi attrs = attrs.copy() 901e64493206b76bce6e3af0598790ec076094e8c37Simran Basi attrs.update(kwargs) 902e64493206b76bce6e3af0598790ec076094e8c37Simran Basi else: 903e64493206b76bce6e3af0598790ec076094e8c37Simran Basi attrs = kwargs 904e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.attrs = attrs 905e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.text = text 906e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 907e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def __str__(self): 908e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if self.text: 909e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return self.text 910e64493206b76bce6e3af0598790ec076094e8c37Simran Basi else: 911e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return "%s|%s" % (self.name, self.attrs) 912e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 913e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def searchTag(self, markupName=None, markupAttrs={}): 914e64493206b76bce6e3af0598790ec076094e8c37Simran Basi found = None 915e64493206b76bce6e3af0598790ec076094e8c37Simran Basi markup = None 916e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if isinstance(markupName, Tag): 917e64493206b76bce6e3af0598790ec076094e8c37Simran Basi markup = markupName 918e64493206b76bce6e3af0598790ec076094e8c37Simran Basi markupAttrs = markup 919e64493206b76bce6e3af0598790ec076094e8c37Simran Basi callFunctionWithTagData = callable(self.name) \ 920e64493206b76bce6e3af0598790ec076094e8c37Simran Basi and not isinstance(markupName, Tag) 921e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 922e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if (not self.name) \ 923e64493206b76bce6e3af0598790ec076094e8c37Simran Basi or callFunctionWithTagData \ 924e64493206b76bce6e3af0598790ec076094e8c37Simran Basi or (markup and self._matches(markup, self.name)) \ 925e64493206b76bce6e3af0598790ec076094e8c37Simran Basi or (not markup and self._matches(markupName, self.name)): 926e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if callFunctionWithTagData: 927e64493206b76bce6e3af0598790ec076094e8c37Simran Basi match = self.name(markupName, markupAttrs) 928e64493206b76bce6e3af0598790ec076094e8c37Simran Basi else: 929e64493206b76bce6e3af0598790ec076094e8c37Simran Basi match = True 930e64493206b76bce6e3af0598790ec076094e8c37Simran Basi markupAttrMap = None 931e64493206b76bce6e3af0598790ec076094e8c37Simran Basi for attr, matchAgainst in self.attrs.items(): 932e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if not markupAttrMap: 933e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if hasattr(markupAttrs, 'get'): 934e64493206b76bce6e3af0598790ec076094e8c37Simran Basi markupAttrMap = markupAttrs 935e64493206b76bce6e3af0598790ec076094e8c37Simran Basi else: 936e64493206b76bce6e3af0598790ec076094e8c37Simran Basi markupAttrMap = {} 937e64493206b76bce6e3af0598790ec076094e8c37Simran Basi for k,v in markupAttrs: 938e64493206b76bce6e3af0598790ec076094e8c37Simran Basi markupAttrMap[k] = v 939e64493206b76bce6e3af0598790ec076094e8c37Simran Basi attrValue = markupAttrMap.get(attr) 940e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if not self._matches(attrValue, matchAgainst): 941e64493206b76bce6e3af0598790ec076094e8c37Simran Basi match = False 942e64493206b76bce6e3af0598790ec076094e8c37Simran Basi break 943e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if match: 944e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if markup: 945e64493206b76bce6e3af0598790ec076094e8c37Simran Basi found = markup 946e64493206b76bce6e3af0598790ec076094e8c37Simran Basi else: 947e64493206b76bce6e3af0598790ec076094e8c37Simran Basi found = markupName 948e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return found 949e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 950e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def search(self, markup): 951e64493206b76bce6e3af0598790ec076094e8c37Simran Basi #print 'looking for %s in %s' % (self, markup) 952e64493206b76bce6e3af0598790ec076094e8c37Simran Basi found = None 953e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # If given a list of items, scan it for a text element that 954e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # matches. 955e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if hasattr(markup, "__iter__") \ 956e64493206b76bce6e3af0598790ec076094e8c37Simran Basi and not isinstance(markup, Tag): 957e64493206b76bce6e3af0598790ec076094e8c37Simran Basi for element in markup: 958e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if isinstance(element, NavigableString) \ 959e64493206b76bce6e3af0598790ec076094e8c37Simran Basi and self.search(element): 960e64493206b76bce6e3af0598790ec076094e8c37Simran Basi found = element 961e64493206b76bce6e3af0598790ec076094e8c37Simran Basi break 962e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # If it's a Tag, make sure its name or attributes match. 963e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # Don't bother with Tags if we're searching for text. 964e64493206b76bce6e3af0598790ec076094e8c37Simran Basi elif isinstance(markup, Tag): 965e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if not self.text: 966e64493206b76bce6e3af0598790ec076094e8c37Simran Basi found = self.searchTag(markup) 967e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # If it's text, make sure the text matches. 968e64493206b76bce6e3af0598790ec076094e8c37Simran Basi elif isinstance(markup, NavigableString) or \ 969e64493206b76bce6e3af0598790ec076094e8c37Simran Basi isinstance(markup, basestring): 970e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if self._matches(markup, self.text): 971e64493206b76bce6e3af0598790ec076094e8c37Simran Basi found = markup 972e64493206b76bce6e3af0598790ec076094e8c37Simran Basi else: 973e64493206b76bce6e3af0598790ec076094e8c37Simran Basi raise Exception, "I don't know how to match against a %s" \ 974e64493206b76bce6e3af0598790ec076094e8c37Simran Basi % markup.__class__ 975e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return found 976e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 977e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def _matches(self, markup, matchAgainst): 978e64493206b76bce6e3af0598790ec076094e8c37Simran Basi #print "Matching %s against %s" % (markup, matchAgainst) 979e64493206b76bce6e3af0598790ec076094e8c37Simran Basi result = False 980e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if matchAgainst is True: 981e64493206b76bce6e3af0598790ec076094e8c37Simran Basi result = markup is not None 982e64493206b76bce6e3af0598790ec076094e8c37Simran Basi elif callable(matchAgainst): 983e64493206b76bce6e3af0598790ec076094e8c37Simran Basi result = matchAgainst(markup) 984e64493206b76bce6e3af0598790ec076094e8c37Simran Basi else: 985e64493206b76bce6e3af0598790ec076094e8c37Simran Basi #Custom match methods take the tag as an argument, but all 986e64493206b76bce6e3af0598790ec076094e8c37Simran Basi #other ways of matching match the tag name as a string. 987e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if isinstance(markup, Tag): 988e64493206b76bce6e3af0598790ec076094e8c37Simran Basi markup = markup.name 989e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if markup and not isinstance(markup, basestring): 990e64493206b76bce6e3af0598790ec076094e8c37Simran Basi markup = unicode(markup) 991e64493206b76bce6e3af0598790ec076094e8c37Simran Basi #Now we know that chunk is either a string, or None. 992e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if hasattr(matchAgainst, 'match'): 993e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # It's a regexp object. 994e64493206b76bce6e3af0598790ec076094e8c37Simran Basi result = markup and matchAgainst.search(markup) 995e64493206b76bce6e3af0598790ec076094e8c37Simran Basi elif hasattr(matchAgainst, '__iter__'): # list-like 996e64493206b76bce6e3af0598790ec076094e8c37Simran Basi result = markup in matchAgainst 997e64493206b76bce6e3af0598790ec076094e8c37Simran Basi elif hasattr(matchAgainst, 'items'): 998e64493206b76bce6e3af0598790ec076094e8c37Simran Basi result = markup.has_key(matchAgainst) 999e64493206b76bce6e3af0598790ec076094e8c37Simran Basi elif matchAgainst and isinstance(markup, basestring): 1000e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if isinstance(markup, unicode): 1001e64493206b76bce6e3af0598790ec076094e8c37Simran Basi matchAgainst = unicode(matchAgainst) 1002e64493206b76bce6e3af0598790ec076094e8c37Simran Basi else: 1003e64493206b76bce6e3af0598790ec076094e8c37Simran Basi matchAgainst = str(matchAgainst) 1004e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1005e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if not result: 1006e64493206b76bce6e3af0598790ec076094e8c37Simran Basi result = matchAgainst == markup 1007e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return result 1008e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1009e64493206b76bce6e3af0598790ec076094e8c37Simran Basiclass ResultSet(list): 1010e64493206b76bce6e3af0598790ec076094e8c37Simran Basi """A ResultSet is just a list that keeps track of the SoupStrainer 1011e64493206b76bce6e3af0598790ec076094e8c37Simran Basi that created it.""" 1012e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def __init__(self, source): 1013e64493206b76bce6e3af0598790ec076094e8c37Simran Basi list.__init__([]) 1014e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.source = source 1015e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1016e64493206b76bce6e3af0598790ec076094e8c37Simran Basi# Now, some helper functions. 1017e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1018e64493206b76bce6e3af0598790ec076094e8c37Simran Basidef buildTagMap(default, *args): 1019e64493206b76bce6e3af0598790ec076094e8c37Simran Basi """Turns a list of maps, lists, or scalars into a single map. 1020e64493206b76bce6e3af0598790ec076094e8c37Simran Basi Used to build the SELF_CLOSING_TAGS, NESTABLE_TAGS, and 1021e64493206b76bce6e3af0598790ec076094e8c37Simran Basi NESTING_RESET_TAGS maps out of lists and partial maps.""" 1022e64493206b76bce6e3af0598790ec076094e8c37Simran Basi built = {} 1023e64493206b76bce6e3af0598790ec076094e8c37Simran Basi for portion in args: 1024e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if hasattr(portion, 'items'): 1025e64493206b76bce6e3af0598790ec076094e8c37Simran Basi #It's a map. Merge it. 1026e64493206b76bce6e3af0598790ec076094e8c37Simran Basi for k,v in portion.items(): 1027e64493206b76bce6e3af0598790ec076094e8c37Simran Basi built[k] = v 1028e64493206b76bce6e3af0598790ec076094e8c37Simran Basi elif hasattr(portion, '__iter__'): # is a list 1029e64493206b76bce6e3af0598790ec076094e8c37Simran Basi #It's a list. Map each item to the default. 1030e64493206b76bce6e3af0598790ec076094e8c37Simran Basi for k in portion: 1031e64493206b76bce6e3af0598790ec076094e8c37Simran Basi built[k] = default 1032e64493206b76bce6e3af0598790ec076094e8c37Simran Basi else: 1033e64493206b76bce6e3af0598790ec076094e8c37Simran Basi #It's a scalar. Map it to the default. 1034e64493206b76bce6e3af0598790ec076094e8c37Simran Basi built[portion] = default 1035e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return built 1036e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1037e64493206b76bce6e3af0598790ec076094e8c37Simran Basi# Now, the parser classes. 1038e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1039e64493206b76bce6e3af0598790ec076094e8c37Simran Basiclass BeautifulStoneSoup(Tag, SGMLParser): 1040e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1041e64493206b76bce6e3af0598790ec076094e8c37Simran Basi """This class contains the basic parser and search code. It defines 1042e64493206b76bce6e3af0598790ec076094e8c37Simran Basi a parser that knows nothing about tag behavior except for the 1043e64493206b76bce6e3af0598790ec076094e8c37Simran Basi following: 1044e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1045e64493206b76bce6e3af0598790ec076094e8c37Simran Basi You can't close a tag without closing all the tags it encloses. 1046e64493206b76bce6e3af0598790ec076094e8c37Simran Basi That is, "<foo><bar></foo>" actually means 1047e64493206b76bce6e3af0598790ec076094e8c37Simran Basi "<foo><bar></bar></foo>". 1048e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1049e64493206b76bce6e3af0598790ec076094e8c37Simran Basi [Another possible explanation is "<foo><bar /></foo>", but since 1050e64493206b76bce6e3af0598790ec076094e8c37Simran Basi this class defines no SELF_CLOSING_TAGS, it will never use that 1051e64493206b76bce6e3af0598790ec076094e8c37Simran Basi explanation.] 1052e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1053e64493206b76bce6e3af0598790ec076094e8c37Simran Basi This class is useful for parsing XML or made-up markup languages, 1054e64493206b76bce6e3af0598790ec076094e8c37Simran Basi or when BeautifulSoup makes an assumption counter to what you were 1055e64493206b76bce6e3af0598790ec076094e8c37Simran Basi expecting.""" 1056e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1057e64493206b76bce6e3af0598790ec076094e8c37Simran Basi SELF_CLOSING_TAGS = {} 1058e64493206b76bce6e3af0598790ec076094e8c37Simran Basi NESTABLE_TAGS = {} 1059e64493206b76bce6e3af0598790ec076094e8c37Simran Basi RESET_NESTING_TAGS = {} 1060e64493206b76bce6e3af0598790ec076094e8c37Simran Basi QUOTE_TAGS = {} 1061e64493206b76bce6e3af0598790ec076094e8c37Simran Basi PRESERVE_WHITESPACE_TAGS = [] 1062e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1063e64493206b76bce6e3af0598790ec076094e8c37Simran Basi MARKUP_MASSAGE = [(re.compile('(<[^<>]*)/>'), 1064e64493206b76bce6e3af0598790ec076094e8c37Simran Basi lambda x: x.group(1) + ' />'), 1065e64493206b76bce6e3af0598790ec076094e8c37Simran Basi (re.compile('<!\s+([^<>]*)>'), 1066e64493206b76bce6e3af0598790ec076094e8c37Simran Basi lambda x: '<!' + x.group(1) + '>') 1067e64493206b76bce6e3af0598790ec076094e8c37Simran Basi ] 1068e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1069e64493206b76bce6e3af0598790ec076094e8c37Simran Basi ROOT_TAG_NAME = u'[document]' 1070e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1071e64493206b76bce6e3af0598790ec076094e8c37Simran Basi HTML_ENTITIES = "html" 1072e64493206b76bce6e3af0598790ec076094e8c37Simran Basi XML_ENTITIES = "xml" 1073e64493206b76bce6e3af0598790ec076094e8c37Simran Basi XHTML_ENTITIES = "xhtml" 1074e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # TODO: This only exists for backwards-compatibility 1075e64493206b76bce6e3af0598790ec076094e8c37Simran Basi ALL_ENTITIES = XHTML_ENTITIES 1076e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1077e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # Used when determining whether a text node is all whitespace and 1078e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # can be replaced with a single space. A text node that contains 1079e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # fancy Unicode spaces (usually non-breaking) should be left 1080e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # alone. 1081e64493206b76bce6e3af0598790ec076094e8c37Simran Basi STRIP_ASCII_SPACES = { 9: None, 10: None, 12: None, 13: None, 32: None, } 1082e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1083e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def __init__(self, markup="", parseOnlyThese=None, fromEncoding=None, 1084e64493206b76bce6e3af0598790ec076094e8c37Simran Basi markupMassage=True, smartQuotesTo=XML_ENTITIES, 1085e64493206b76bce6e3af0598790ec076094e8c37Simran Basi convertEntities=None, selfClosingTags=None, isHTML=False): 1086e64493206b76bce6e3af0598790ec076094e8c37Simran Basi """The Soup object is initialized as the 'root tag', and the 1087e64493206b76bce6e3af0598790ec076094e8c37Simran Basi provided markup (which can be a string or a file-like object) 1088e64493206b76bce6e3af0598790ec076094e8c37Simran Basi is fed into the underlying parser. 1089e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1090e64493206b76bce6e3af0598790ec076094e8c37Simran Basi sgmllib will process most bad HTML, and the BeautifulSoup 1091e64493206b76bce6e3af0598790ec076094e8c37Simran Basi class has some tricks for dealing with some HTML that kills 1092e64493206b76bce6e3af0598790ec076094e8c37Simran Basi sgmllib, but Beautiful Soup can nonetheless choke or lose data 1093e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if your data uses self-closing tags or declarations 1094e64493206b76bce6e3af0598790ec076094e8c37Simran Basi incorrectly. 1095e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1096e64493206b76bce6e3af0598790ec076094e8c37Simran Basi By default, Beautiful Soup uses regexes to sanitize input, 1097e64493206b76bce6e3af0598790ec076094e8c37Simran Basi avoiding the vast majority of these problems. If the problems 1098e64493206b76bce6e3af0598790ec076094e8c37Simran Basi don't apply to you, pass in False for markupMassage, and 1099e64493206b76bce6e3af0598790ec076094e8c37Simran Basi you'll get better performance. 1100e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1101e64493206b76bce6e3af0598790ec076094e8c37Simran Basi The default parser massage techniques fix the two most common 1102e64493206b76bce6e3af0598790ec076094e8c37Simran Basi instances of invalid HTML that choke sgmllib: 1103e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1104e64493206b76bce6e3af0598790ec076094e8c37Simran Basi <br/> (No space between name of closing tag and tag close) 1105e64493206b76bce6e3af0598790ec076094e8c37Simran Basi <! --Comment--> (Extraneous whitespace in declaration) 1106e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1107e64493206b76bce6e3af0598790ec076094e8c37Simran Basi You can pass in a custom list of (RE object, replace method) 1108e64493206b76bce6e3af0598790ec076094e8c37Simran Basi tuples to get Beautiful Soup to scrub your input the way you 1109e64493206b76bce6e3af0598790ec076094e8c37Simran Basi want.""" 1110e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1111e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.parseOnlyThese = parseOnlyThese 1112e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.fromEncoding = fromEncoding 1113e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.smartQuotesTo = smartQuotesTo 1114e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.convertEntities = convertEntities 1115e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # Set the rules for how we'll deal with the entities we 1116e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # encounter 1117e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if self.convertEntities: 1118e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # It doesn't make sense to convert encoded characters to 1119e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # entities even while you're converting entities to Unicode. 1120e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # Just convert it all to Unicode. 1121e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.smartQuotesTo = None 1122e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if convertEntities == self.HTML_ENTITIES: 1123e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.convertXMLEntities = False 1124e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.convertHTMLEntities = True 1125e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.escapeUnrecognizedEntities = True 1126e64493206b76bce6e3af0598790ec076094e8c37Simran Basi elif convertEntities == self.XHTML_ENTITIES: 1127e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.convertXMLEntities = True 1128e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.convertHTMLEntities = True 1129e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.escapeUnrecognizedEntities = False 1130e64493206b76bce6e3af0598790ec076094e8c37Simran Basi elif convertEntities == self.XML_ENTITIES: 1131e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.convertXMLEntities = True 1132e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.convertHTMLEntities = False 1133e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.escapeUnrecognizedEntities = False 1134e64493206b76bce6e3af0598790ec076094e8c37Simran Basi else: 1135e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.convertXMLEntities = False 1136e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.convertHTMLEntities = False 1137e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.escapeUnrecognizedEntities = False 1138e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1139e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.instanceSelfClosingTags = buildTagMap(None, selfClosingTags) 1140e64493206b76bce6e3af0598790ec076094e8c37Simran Basi SGMLParser.__init__(self) 1141e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1142e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if hasattr(markup, 'read'): # It's a file-type object. 1143e64493206b76bce6e3af0598790ec076094e8c37Simran Basi markup = markup.read() 1144e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.markup = markup 1145e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.markupMassage = markupMassage 1146e64493206b76bce6e3af0598790ec076094e8c37Simran Basi try: 1147e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self._feed(isHTML=isHTML) 1148e64493206b76bce6e3af0598790ec076094e8c37Simran Basi except StopParsing: 1149e64493206b76bce6e3af0598790ec076094e8c37Simran Basi pass 1150e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.markup = None # The markup can now be GCed 1151e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1152e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def convert_charref(self, name): 1153e64493206b76bce6e3af0598790ec076094e8c37Simran Basi """This method fixes a bug in Python's SGMLParser.""" 1154e64493206b76bce6e3af0598790ec076094e8c37Simran Basi try: 1155e64493206b76bce6e3af0598790ec076094e8c37Simran Basi n = int(name) 1156e64493206b76bce6e3af0598790ec076094e8c37Simran Basi except ValueError: 1157e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return 1158e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if not 0 <= n <= 127 : # ASCII ends at 127, not 255 1159e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return 1160e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return self.convert_codepoint(n) 1161e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1162e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def _feed(self, inDocumentEncoding=None, isHTML=False): 1163e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # Convert the document to Unicode. 1164e64493206b76bce6e3af0598790ec076094e8c37Simran Basi markup = self.markup 1165e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if isinstance(markup, unicode): 1166e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if not hasattr(self, 'originalEncoding'): 1167e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.originalEncoding = None 1168e64493206b76bce6e3af0598790ec076094e8c37Simran Basi else: 1169e64493206b76bce6e3af0598790ec076094e8c37Simran Basi dammit = UnicodeDammit\ 1170e64493206b76bce6e3af0598790ec076094e8c37Simran Basi (markup, [self.fromEncoding, inDocumentEncoding], 1171e64493206b76bce6e3af0598790ec076094e8c37Simran Basi smartQuotesTo=self.smartQuotesTo, isHTML=isHTML) 1172e64493206b76bce6e3af0598790ec076094e8c37Simran Basi markup = dammit.unicode 1173e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.originalEncoding = dammit.originalEncoding 1174e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.declaredHTMLEncoding = dammit.declaredHTMLEncoding 1175e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if markup: 1176e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if self.markupMassage: 1177e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if not hasattr(self.markupMassage, "__iter__"): 1178e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.markupMassage = self.MARKUP_MASSAGE 1179e64493206b76bce6e3af0598790ec076094e8c37Simran Basi for fix, m in self.markupMassage: 1180e64493206b76bce6e3af0598790ec076094e8c37Simran Basi markup = fix.sub(m, markup) 1181e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # TODO: We get rid of markupMassage so that the 1182e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # soup object can be deepcopied later on. Some 1183e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # Python installations can't copy regexes. If anyone 1184e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # was relying on the existence of markupMassage, this 1185e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # might cause problems. 1186e64493206b76bce6e3af0598790ec076094e8c37Simran Basi del(self.markupMassage) 1187e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.reset() 1188e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1189e64493206b76bce6e3af0598790ec076094e8c37Simran Basi SGMLParser.feed(self, markup) 1190e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # Close out any unfinished strings and close all the open tags. 1191e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.endData() 1192e64493206b76bce6e3af0598790ec076094e8c37Simran Basi while self.currentTag.name != self.ROOT_TAG_NAME: 1193e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.popTag() 1194e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1195e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def __getattr__(self, methodName): 1196e64493206b76bce6e3af0598790ec076094e8c37Simran Basi """This method routes method call requests to either the SGMLParser 1197e64493206b76bce6e3af0598790ec076094e8c37Simran Basi superclass or the Tag superclass, depending on the method name.""" 1198e64493206b76bce6e3af0598790ec076094e8c37Simran Basi #print "__getattr__ called on %s.%s" % (self.__class__, methodName) 1199e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1200e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if methodName.startswith('start_') or methodName.startswith('end_') \ 1201e64493206b76bce6e3af0598790ec076094e8c37Simran Basi or methodName.startswith('do_'): 1202e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return SGMLParser.__getattr__(self, methodName) 1203e64493206b76bce6e3af0598790ec076094e8c37Simran Basi elif not methodName.startswith('__'): 1204e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return Tag.__getattr__(self, methodName) 1205e64493206b76bce6e3af0598790ec076094e8c37Simran Basi else: 1206e64493206b76bce6e3af0598790ec076094e8c37Simran Basi raise AttributeError 1207e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1208e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def isSelfClosingTag(self, name): 1209e64493206b76bce6e3af0598790ec076094e8c37Simran Basi """Returns true iff the given string is the name of a 1210e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self-closing tag according to this parser.""" 1211e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return self.SELF_CLOSING_TAGS.has_key(name) \ 1212e64493206b76bce6e3af0598790ec076094e8c37Simran Basi or self.instanceSelfClosingTags.has_key(name) 1213e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1214e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def reset(self): 1215e64493206b76bce6e3af0598790ec076094e8c37Simran Basi Tag.__init__(self, self, self.ROOT_TAG_NAME) 1216e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.hidden = 1 1217e64493206b76bce6e3af0598790ec076094e8c37Simran Basi SGMLParser.reset(self) 1218e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.currentData = [] 1219e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.currentTag = None 1220e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.tagStack = [] 1221e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.quoteStack = [] 1222e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.pushTag(self) 1223e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1224e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def popTag(self): 1225e64493206b76bce6e3af0598790ec076094e8c37Simran Basi tag = self.tagStack.pop() 1226e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1227e64493206b76bce6e3af0598790ec076094e8c37Simran Basi #print "Pop", tag.name 1228e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if self.tagStack: 1229e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.currentTag = self.tagStack[-1] 1230e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return self.currentTag 1231e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1232e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def pushTag(self, tag): 1233e64493206b76bce6e3af0598790ec076094e8c37Simran Basi #print "Push", tag.name 1234e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if self.currentTag: 1235e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.currentTag.contents.append(tag) 1236e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.tagStack.append(tag) 1237e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.currentTag = self.tagStack[-1] 1238e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1239e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def endData(self, containerClass=NavigableString): 1240e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if self.currentData: 1241e64493206b76bce6e3af0598790ec076094e8c37Simran Basi currentData = u''.join(self.currentData) 1242e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if (currentData.translate(self.STRIP_ASCII_SPACES) == '' and 1243e64493206b76bce6e3af0598790ec076094e8c37Simran Basi not set([tag.name for tag in self.tagStack]).intersection( 1244e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.PRESERVE_WHITESPACE_TAGS)): 1245e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if '\n' in currentData: 1246e64493206b76bce6e3af0598790ec076094e8c37Simran Basi currentData = '\n' 1247e64493206b76bce6e3af0598790ec076094e8c37Simran Basi else: 1248e64493206b76bce6e3af0598790ec076094e8c37Simran Basi currentData = ' ' 1249e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.currentData = [] 1250e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if self.parseOnlyThese and len(self.tagStack) <= 1 and \ 1251e64493206b76bce6e3af0598790ec076094e8c37Simran Basi (not self.parseOnlyThese.text or \ 1252e64493206b76bce6e3af0598790ec076094e8c37Simran Basi not self.parseOnlyThese.search(currentData)): 1253e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return 1254e64493206b76bce6e3af0598790ec076094e8c37Simran Basi o = containerClass(currentData) 1255e64493206b76bce6e3af0598790ec076094e8c37Simran Basi o.setup(self.currentTag, self.previous) 1256e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if self.previous: 1257e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.previous.next = o 1258e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.previous = o 1259e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.currentTag.contents.append(o) 1260e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1261e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1262e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def _popToTag(self, name, inclusivePop=True): 1263e64493206b76bce6e3af0598790ec076094e8c37Simran Basi """Pops the tag stack up to and including the most recent 1264e64493206b76bce6e3af0598790ec076094e8c37Simran Basi instance of the given tag. If inclusivePop is false, pops the tag 1265e64493206b76bce6e3af0598790ec076094e8c37Simran Basi stack up to but *not* including the most recent instqance of 1266e64493206b76bce6e3af0598790ec076094e8c37Simran Basi the given tag.""" 1267e64493206b76bce6e3af0598790ec076094e8c37Simran Basi #print "Popping to %s" % name 1268e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if name == self.ROOT_TAG_NAME: 1269e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return 1270e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1271e64493206b76bce6e3af0598790ec076094e8c37Simran Basi numPops = 0 1272e64493206b76bce6e3af0598790ec076094e8c37Simran Basi mostRecentTag = None 1273e64493206b76bce6e3af0598790ec076094e8c37Simran Basi for i in range(len(self.tagStack)-1, 0, -1): 1274e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if name == self.tagStack[i].name: 1275e64493206b76bce6e3af0598790ec076094e8c37Simran Basi numPops = len(self.tagStack)-i 1276e64493206b76bce6e3af0598790ec076094e8c37Simran Basi break 1277e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if not inclusivePop: 1278e64493206b76bce6e3af0598790ec076094e8c37Simran Basi numPops = numPops - 1 1279e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1280e64493206b76bce6e3af0598790ec076094e8c37Simran Basi for i in range(0, numPops): 1281e64493206b76bce6e3af0598790ec076094e8c37Simran Basi mostRecentTag = self.popTag() 1282e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return mostRecentTag 1283e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1284e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def _smartPop(self, name): 1285e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1286e64493206b76bce6e3af0598790ec076094e8c37Simran Basi """We need to pop up to the previous tag of this type, unless 1287e64493206b76bce6e3af0598790ec076094e8c37Simran Basi one of this tag's nesting reset triggers comes between this 1288e64493206b76bce6e3af0598790ec076094e8c37Simran Basi tag and the previous tag of this type, OR unless this tag is a 1289e64493206b76bce6e3af0598790ec076094e8c37Simran Basi generic nesting trigger and another generic nesting trigger 1290e64493206b76bce6e3af0598790ec076094e8c37Simran Basi comes between this tag and the previous tag of this type. 1291e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1292e64493206b76bce6e3af0598790ec076094e8c37Simran Basi Examples: 1293e64493206b76bce6e3af0598790ec076094e8c37Simran Basi <p>Foo<b>Bar *<p>* should pop to 'p', not 'b'. 1294e64493206b76bce6e3af0598790ec076094e8c37Simran Basi <p>Foo<table>Bar *<p>* should pop to 'table', not 'p'. 1295e64493206b76bce6e3af0598790ec076094e8c37Simran Basi <p>Foo<table><tr>Bar *<p>* should pop to 'tr', not 'p'. 1296e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1297e64493206b76bce6e3af0598790ec076094e8c37Simran Basi <li><ul><li> *<li>* should pop to 'ul', not the first 'li'. 1298e64493206b76bce6e3af0598790ec076094e8c37Simran Basi <tr><table><tr> *<tr>* should pop to 'table', not the first 'tr' 1299e64493206b76bce6e3af0598790ec076094e8c37Simran Basi <td><tr><td> *<td>* should pop to 'tr', not the first 'td' 1300e64493206b76bce6e3af0598790ec076094e8c37Simran Basi """ 1301e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1302e64493206b76bce6e3af0598790ec076094e8c37Simran Basi nestingResetTriggers = self.NESTABLE_TAGS.get(name) 1303e64493206b76bce6e3af0598790ec076094e8c37Simran Basi isNestable = nestingResetTriggers != None 1304e64493206b76bce6e3af0598790ec076094e8c37Simran Basi isResetNesting = self.RESET_NESTING_TAGS.has_key(name) 1305e64493206b76bce6e3af0598790ec076094e8c37Simran Basi popTo = None 1306e64493206b76bce6e3af0598790ec076094e8c37Simran Basi inclusive = True 1307e64493206b76bce6e3af0598790ec076094e8c37Simran Basi for i in range(len(self.tagStack)-1, 0, -1): 1308e64493206b76bce6e3af0598790ec076094e8c37Simran Basi p = self.tagStack[i] 1309e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if (not p or p.name == name) and not isNestable: 1310e64493206b76bce6e3af0598790ec076094e8c37Simran Basi #Non-nestable tags get popped to the top or to their 1311e64493206b76bce6e3af0598790ec076094e8c37Simran Basi #last occurance. 1312e64493206b76bce6e3af0598790ec076094e8c37Simran Basi popTo = name 1313e64493206b76bce6e3af0598790ec076094e8c37Simran Basi break 1314e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if (nestingResetTriggers is not None 1315e64493206b76bce6e3af0598790ec076094e8c37Simran Basi and p.name in nestingResetTriggers) \ 1316e64493206b76bce6e3af0598790ec076094e8c37Simran Basi or (nestingResetTriggers is None and isResetNesting 1317e64493206b76bce6e3af0598790ec076094e8c37Simran Basi and self.RESET_NESTING_TAGS.has_key(p.name)): 1318e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1319e64493206b76bce6e3af0598790ec076094e8c37Simran Basi #If we encounter one of the nesting reset triggers 1320e64493206b76bce6e3af0598790ec076094e8c37Simran Basi #peculiar to this tag, or we encounter another tag 1321e64493206b76bce6e3af0598790ec076094e8c37Simran Basi #that causes nesting to reset, pop up to but not 1322e64493206b76bce6e3af0598790ec076094e8c37Simran Basi #including that tag. 1323e64493206b76bce6e3af0598790ec076094e8c37Simran Basi popTo = p.name 1324e64493206b76bce6e3af0598790ec076094e8c37Simran Basi inclusive = False 1325e64493206b76bce6e3af0598790ec076094e8c37Simran Basi break 1326e64493206b76bce6e3af0598790ec076094e8c37Simran Basi p = p.parent 1327e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if popTo: 1328e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self._popToTag(popTo, inclusive) 1329e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1330e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def unknown_starttag(self, name, attrs, selfClosing=0): 1331e64493206b76bce6e3af0598790ec076094e8c37Simran Basi #print "Start tag %s: %s" % (name, attrs) 1332e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if self.quoteStack: 1333e64493206b76bce6e3af0598790ec076094e8c37Simran Basi #This is not a real tag. 1334e64493206b76bce6e3af0598790ec076094e8c37Simran Basi #print "<%s> is not real!" % name 1335e64493206b76bce6e3af0598790ec076094e8c37Simran Basi attrs = ''.join([' %s="%s"' % (x, y) for x, y in attrs]) 1336e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.handle_data('<%s%s>' % (name, attrs)) 1337e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return 1338e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.endData() 1339e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1340e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if not self.isSelfClosingTag(name) and not selfClosing: 1341e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self._smartPop(name) 1342e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1343e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if self.parseOnlyThese and len(self.tagStack) <= 1 \ 1344e64493206b76bce6e3af0598790ec076094e8c37Simran Basi and (self.parseOnlyThese.text or not self.parseOnlyThese.searchTag(name, attrs)): 1345e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return 1346e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1347e64493206b76bce6e3af0598790ec076094e8c37Simran Basi tag = Tag(self, name, attrs, self.currentTag, self.previous) 1348e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if self.previous: 1349e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.previous.next = tag 1350e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.previous = tag 1351e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.pushTag(tag) 1352e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if selfClosing or self.isSelfClosingTag(name): 1353e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.popTag() 1354e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if name in self.QUOTE_TAGS: 1355e64493206b76bce6e3af0598790ec076094e8c37Simran Basi #print "Beginning quote (%s)" % name 1356e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.quoteStack.append(name) 1357e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.literal = 1 1358e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return tag 1359e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1360e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def unknown_endtag(self, name): 1361e64493206b76bce6e3af0598790ec076094e8c37Simran Basi #print "End tag %s" % name 1362e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if self.quoteStack and self.quoteStack[-1] != name: 1363e64493206b76bce6e3af0598790ec076094e8c37Simran Basi #This is not a real end tag. 1364e64493206b76bce6e3af0598790ec076094e8c37Simran Basi #print "</%s> is not real!" % name 1365e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.handle_data('</%s>' % name) 1366e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return 1367e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.endData() 1368e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self._popToTag(name) 1369e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if self.quoteStack and self.quoteStack[-1] == name: 1370e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.quoteStack.pop() 1371e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.literal = (len(self.quoteStack) > 0) 1372e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1373e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def handle_data(self, data): 1374e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.currentData.append(data) 1375e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1376e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def _toStringSubclass(self, text, subclass): 1377e64493206b76bce6e3af0598790ec076094e8c37Simran Basi """Adds a certain piece of text to the tree as a NavigableString 1378e64493206b76bce6e3af0598790ec076094e8c37Simran Basi subclass.""" 1379e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.endData() 1380e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.handle_data(text) 1381e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.endData(subclass) 1382e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1383e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def handle_pi(self, text): 1384e64493206b76bce6e3af0598790ec076094e8c37Simran Basi """Handle a processing instruction as a ProcessingInstruction 1385e64493206b76bce6e3af0598790ec076094e8c37Simran Basi object, possibly one with a %SOUP-ENCODING% slot into which an 1386e64493206b76bce6e3af0598790ec076094e8c37Simran Basi encoding will be plugged later.""" 1387e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if text[:3] == "xml": 1388e64493206b76bce6e3af0598790ec076094e8c37Simran Basi text = u"xml version='1.0' encoding='%SOUP-ENCODING%'" 1389e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self._toStringSubclass(text, ProcessingInstruction) 1390e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1391e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def handle_comment(self, text): 1392e64493206b76bce6e3af0598790ec076094e8c37Simran Basi "Handle comments as Comment objects." 1393e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self._toStringSubclass(text, Comment) 1394e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1395e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def handle_charref(self, ref): 1396e64493206b76bce6e3af0598790ec076094e8c37Simran Basi "Handle character references as data." 1397e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if self.convertEntities: 1398e64493206b76bce6e3af0598790ec076094e8c37Simran Basi data = unichr(int(ref)) 1399e64493206b76bce6e3af0598790ec076094e8c37Simran Basi else: 1400e64493206b76bce6e3af0598790ec076094e8c37Simran Basi data = '&#%s;' % ref 1401e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.handle_data(data) 1402e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1403e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def handle_entityref(self, ref): 1404e64493206b76bce6e3af0598790ec076094e8c37Simran Basi """Handle entity references as data, possibly converting known 1405e64493206b76bce6e3af0598790ec076094e8c37Simran Basi HTML and/or XML entity references to the corresponding Unicode 1406e64493206b76bce6e3af0598790ec076094e8c37Simran Basi characters.""" 1407e64493206b76bce6e3af0598790ec076094e8c37Simran Basi data = None 1408e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if self.convertHTMLEntities: 1409e64493206b76bce6e3af0598790ec076094e8c37Simran Basi try: 1410e64493206b76bce6e3af0598790ec076094e8c37Simran Basi data = unichr(name2codepoint[ref]) 1411e64493206b76bce6e3af0598790ec076094e8c37Simran Basi except KeyError: 1412e64493206b76bce6e3af0598790ec076094e8c37Simran Basi pass 1413e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1414e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if not data and self.convertXMLEntities: 1415e64493206b76bce6e3af0598790ec076094e8c37Simran Basi data = self.XML_ENTITIES_TO_SPECIAL_CHARS.get(ref) 1416e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1417e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if not data and self.convertHTMLEntities and \ 1418e64493206b76bce6e3af0598790ec076094e8c37Simran Basi not self.XML_ENTITIES_TO_SPECIAL_CHARS.get(ref): 1419e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # TODO: We've got a problem here. We're told this is 1420e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # an entity reference, but it's not an XML entity 1421e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # reference or an HTML entity reference. Nonetheless, 1422e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # the logical thing to do is to pass it through as an 1423e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # unrecognized entity reference. 1424e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # 1425e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # Except: when the input is "&carol;" this function 1426e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # will be called with input "carol". When the input is 1427e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # "AT&T", this function will be called with input 1428e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # "T". We have no way of knowing whether a semicolon 1429e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # was present originally, so we don't know whether 1430e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # this is an unknown entity or just a misplaced 1431e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # ampersand. 1432e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # 1433e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # The more common case is a misplaced ampersand, so I 1434e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # escape the ampersand and omit the trailing semicolon. 1435e64493206b76bce6e3af0598790ec076094e8c37Simran Basi data = "&%s" % ref 1436e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if not data: 1437e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # This case is different from the one above, because we 1438e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # haven't already gone through a supposedly comprehensive 1439e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # mapping of entities to Unicode characters. We might not 1440e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # have gone through any mapping at all. So the chances are 1441e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # very high that this is a real entity, and not a 1442e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # misplaced ampersand. 1443e64493206b76bce6e3af0598790ec076094e8c37Simran Basi data = "&%s;" % ref 1444e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.handle_data(data) 1445e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1446e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def handle_decl(self, data): 1447e64493206b76bce6e3af0598790ec076094e8c37Simran Basi "Handle DOCTYPEs and the like as Declaration objects." 1448e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self._toStringSubclass(data, Declaration) 1449e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1450e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def parse_declaration(self, i): 1451e64493206b76bce6e3af0598790ec076094e8c37Simran Basi """Treat a bogus SGML declaration as raw data. Treat a CDATA 1452e64493206b76bce6e3af0598790ec076094e8c37Simran Basi declaration as a CData object.""" 1453e64493206b76bce6e3af0598790ec076094e8c37Simran Basi j = None 1454e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if self.rawdata[i:i+9] == '<![CDATA[': 1455e64493206b76bce6e3af0598790ec076094e8c37Simran Basi k = self.rawdata.find(']]>', i) 1456e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if k == -1: 1457e64493206b76bce6e3af0598790ec076094e8c37Simran Basi k = len(self.rawdata) 1458e64493206b76bce6e3af0598790ec076094e8c37Simran Basi data = self.rawdata[i+9:k] 1459e64493206b76bce6e3af0598790ec076094e8c37Simran Basi j = k+3 1460e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self._toStringSubclass(data, CData) 1461e64493206b76bce6e3af0598790ec076094e8c37Simran Basi else: 1462e64493206b76bce6e3af0598790ec076094e8c37Simran Basi try: 1463e64493206b76bce6e3af0598790ec076094e8c37Simran Basi j = SGMLParser.parse_declaration(self, i) 1464e64493206b76bce6e3af0598790ec076094e8c37Simran Basi except SGMLParseError: 1465e64493206b76bce6e3af0598790ec076094e8c37Simran Basi toHandle = self.rawdata[i:] 1466e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.handle_data(toHandle) 1467e64493206b76bce6e3af0598790ec076094e8c37Simran Basi j = i + len(toHandle) 1468e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return j 1469e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1470e64493206b76bce6e3af0598790ec076094e8c37Simran Basiclass BeautifulSoup(BeautifulStoneSoup): 1471e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1472e64493206b76bce6e3af0598790ec076094e8c37Simran Basi """This parser knows the following facts about HTML: 1473e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1474e64493206b76bce6e3af0598790ec076094e8c37Simran Basi * Some tags have no closing tag and should be interpreted as being 1475e64493206b76bce6e3af0598790ec076094e8c37Simran Basi closed as soon as they are encountered. 1476e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1477e64493206b76bce6e3af0598790ec076094e8c37Simran Basi * The text inside some tags (ie. 'script') may contain tags which 1478e64493206b76bce6e3af0598790ec076094e8c37Simran Basi are not really part of the document and which should be parsed 1479e64493206b76bce6e3af0598790ec076094e8c37Simran Basi as text, not tags. If you want to parse the text as tags, you can 1480e64493206b76bce6e3af0598790ec076094e8c37Simran Basi always fetch it and parse it explicitly. 1481e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1482e64493206b76bce6e3af0598790ec076094e8c37Simran Basi * Tag nesting rules: 1483e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1484e64493206b76bce6e3af0598790ec076094e8c37Simran Basi Most tags can't be nested at all. For instance, the occurance of 1485e64493206b76bce6e3af0598790ec076094e8c37Simran Basi a <p> tag should implicitly close the previous <p> tag. 1486e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1487e64493206b76bce6e3af0598790ec076094e8c37Simran Basi <p>Para1<p>Para2 1488e64493206b76bce6e3af0598790ec076094e8c37Simran Basi should be transformed into: 1489e64493206b76bce6e3af0598790ec076094e8c37Simran Basi <p>Para1</p><p>Para2 1490e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1491e64493206b76bce6e3af0598790ec076094e8c37Simran Basi Some tags can be nested arbitrarily. For instance, the occurance 1492e64493206b76bce6e3af0598790ec076094e8c37Simran Basi of a <blockquote> tag should _not_ implicitly close the previous 1493e64493206b76bce6e3af0598790ec076094e8c37Simran Basi <blockquote> tag. 1494e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1495e64493206b76bce6e3af0598790ec076094e8c37Simran Basi Alice said: <blockquote>Bob said: <blockquote>Blah 1496e64493206b76bce6e3af0598790ec076094e8c37Simran Basi should NOT be transformed into: 1497e64493206b76bce6e3af0598790ec076094e8c37Simran Basi Alice said: <blockquote>Bob said: </blockquote><blockquote>Blah 1498e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1499e64493206b76bce6e3af0598790ec076094e8c37Simran Basi Some tags can be nested, but the nesting is reset by the 1500e64493206b76bce6e3af0598790ec076094e8c37Simran Basi interposition of other tags. For instance, a <tr> tag should 1501e64493206b76bce6e3af0598790ec076094e8c37Simran Basi implicitly close the previous <tr> tag within the same <table>, 1502e64493206b76bce6e3af0598790ec076094e8c37Simran Basi but not close a <tr> tag in another table. 1503e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1504e64493206b76bce6e3af0598790ec076094e8c37Simran Basi <table><tr>Blah<tr>Blah 1505e64493206b76bce6e3af0598790ec076094e8c37Simran Basi should be transformed into: 1506e64493206b76bce6e3af0598790ec076094e8c37Simran Basi <table><tr>Blah</tr><tr>Blah 1507e64493206b76bce6e3af0598790ec076094e8c37Simran Basi but, 1508e64493206b76bce6e3af0598790ec076094e8c37Simran Basi <tr>Blah<table><tr>Blah 1509e64493206b76bce6e3af0598790ec076094e8c37Simran Basi should NOT be transformed into 1510e64493206b76bce6e3af0598790ec076094e8c37Simran Basi <tr>Blah<table></tr><tr>Blah 1511e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1512e64493206b76bce6e3af0598790ec076094e8c37Simran Basi Differing assumptions about tag nesting rules are a major source 1513e64493206b76bce6e3af0598790ec076094e8c37Simran Basi of problems with the BeautifulSoup class. If BeautifulSoup is not 1514e64493206b76bce6e3af0598790ec076094e8c37Simran Basi treating as nestable a tag your page author treats as nestable, 1515e64493206b76bce6e3af0598790ec076094e8c37Simran Basi try ICantBelieveItsBeautifulSoup, MinimalSoup, or 1516e64493206b76bce6e3af0598790ec076094e8c37Simran Basi BeautifulStoneSoup before writing your own subclass.""" 1517e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1518e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def __init__(self, *args, **kwargs): 1519e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if not kwargs.has_key('smartQuotesTo'): 1520e64493206b76bce6e3af0598790ec076094e8c37Simran Basi kwargs['smartQuotesTo'] = self.HTML_ENTITIES 1521e64493206b76bce6e3af0598790ec076094e8c37Simran Basi kwargs['isHTML'] = True 1522e64493206b76bce6e3af0598790ec076094e8c37Simran Basi BeautifulStoneSoup.__init__(self, *args, **kwargs) 1523e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1524e64493206b76bce6e3af0598790ec076094e8c37Simran Basi SELF_CLOSING_TAGS = buildTagMap(None, 1525e64493206b76bce6e3af0598790ec076094e8c37Simran Basi ('br' , 'hr', 'input', 'img', 'meta', 1526e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 'spacer', 'link', 'frame', 'base', 'col')) 1527e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1528e64493206b76bce6e3af0598790ec076094e8c37Simran Basi PRESERVE_WHITESPACE_TAGS = set(['pre', 'textarea']) 1529e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1530e64493206b76bce6e3af0598790ec076094e8c37Simran Basi QUOTE_TAGS = {'script' : None, 'textarea' : None} 1531e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1532e64493206b76bce6e3af0598790ec076094e8c37Simran Basi #According to the HTML standard, each of these inline tags can 1533e64493206b76bce6e3af0598790ec076094e8c37Simran Basi #contain another tag of the same type. Furthermore, it's common 1534e64493206b76bce6e3af0598790ec076094e8c37Simran Basi #to actually use these tags this way. 1535e64493206b76bce6e3af0598790ec076094e8c37Simran Basi NESTABLE_INLINE_TAGS = ('span', 'font', 'q', 'object', 'bdo', 'sub', 'sup', 1536e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 'center') 1537e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1538e64493206b76bce6e3af0598790ec076094e8c37Simran Basi #According to the HTML standard, these block tags can contain 1539e64493206b76bce6e3af0598790ec076094e8c37Simran Basi #another tag of the same type. Furthermore, it's common 1540e64493206b76bce6e3af0598790ec076094e8c37Simran Basi #to actually use these tags this way. 1541e64493206b76bce6e3af0598790ec076094e8c37Simran Basi NESTABLE_BLOCK_TAGS = ('blockquote', 'div', 'fieldset', 'ins', 'del') 1542e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1543e64493206b76bce6e3af0598790ec076094e8c37Simran Basi #Lists can contain other lists, but there are restrictions. 1544e64493206b76bce6e3af0598790ec076094e8c37Simran Basi NESTABLE_LIST_TAGS = { 'ol' : [], 1545e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 'ul' : [], 1546e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 'li' : ['ul', 'ol'], 1547e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 'dl' : [], 1548e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 'dd' : ['dl'], 1549e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 'dt' : ['dl'] } 1550e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1551e64493206b76bce6e3af0598790ec076094e8c37Simran Basi #Tables can contain other tables, but there are restrictions. 1552e64493206b76bce6e3af0598790ec076094e8c37Simran Basi NESTABLE_TABLE_TAGS = {'table' : [], 1553e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 'tr' : ['table', 'tbody', 'tfoot', 'thead'], 1554e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 'td' : ['tr'], 1555e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 'th' : ['tr'], 1556e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 'thead' : ['table'], 1557e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 'tbody' : ['table'], 1558e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 'tfoot' : ['table'], 1559e64493206b76bce6e3af0598790ec076094e8c37Simran Basi } 1560e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1561e64493206b76bce6e3af0598790ec076094e8c37Simran Basi NON_NESTABLE_BLOCK_TAGS = ('address', 'form', 'p', 'pre') 1562e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1563e64493206b76bce6e3af0598790ec076094e8c37Simran Basi #If one of these tags is encountered, all tags up to the next tag of 1564e64493206b76bce6e3af0598790ec076094e8c37Simran Basi #this type are popped. 1565e64493206b76bce6e3af0598790ec076094e8c37Simran Basi RESET_NESTING_TAGS = buildTagMap(None, NESTABLE_BLOCK_TAGS, 'noscript', 1566e64493206b76bce6e3af0598790ec076094e8c37Simran Basi NON_NESTABLE_BLOCK_TAGS, 1567e64493206b76bce6e3af0598790ec076094e8c37Simran Basi NESTABLE_LIST_TAGS, 1568e64493206b76bce6e3af0598790ec076094e8c37Simran Basi NESTABLE_TABLE_TAGS) 1569e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1570e64493206b76bce6e3af0598790ec076094e8c37Simran Basi NESTABLE_TAGS = buildTagMap([], NESTABLE_INLINE_TAGS, NESTABLE_BLOCK_TAGS, 1571e64493206b76bce6e3af0598790ec076094e8c37Simran Basi NESTABLE_LIST_TAGS, NESTABLE_TABLE_TAGS) 1572e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1573e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # Used to detect the charset in a META tag; see start_meta 1574e64493206b76bce6e3af0598790ec076094e8c37Simran Basi CHARSET_RE = re.compile("((^|;)\s*charset=)([^;]*)", re.M) 1575e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1576e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def start_meta(self, attrs): 1577e64493206b76bce6e3af0598790ec076094e8c37Simran Basi """Beautiful Soup can detect a charset included in a META tag, 1578e64493206b76bce6e3af0598790ec076094e8c37Simran Basi try to convert the document to that charset, and re-parse the 1579e64493206b76bce6e3af0598790ec076094e8c37Simran Basi document from the beginning.""" 1580e64493206b76bce6e3af0598790ec076094e8c37Simran Basi httpEquiv = None 1581e64493206b76bce6e3af0598790ec076094e8c37Simran Basi contentType = None 1582e64493206b76bce6e3af0598790ec076094e8c37Simran Basi contentTypeIndex = None 1583e64493206b76bce6e3af0598790ec076094e8c37Simran Basi tagNeedsEncodingSubstitution = False 1584e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1585e64493206b76bce6e3af0598790ec076094e8c37Simran Basi for i in range(0, len(attrs)): 1586e64493206b76bce6e3af0598790ec076094e8c37Simran Basi key, value = attrs[i] 1587e64493206b76bce6e3af0598790ec076094e8c37Simran Basi key = key.lower() 1588e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if key == 'http-equiv': 1589e64493206b76bce6e3af0598790ec076094e8c37Simran Basi httpEquiv = value 1590e64493206b76bce6e3af0598790ec076094e8c37Simran Basi elif key == 'content': 1591e64493206b76bce6e3af0598790ec076094e8c37Simran Basi contentType = value 1592e64493206b76bce6e3af0598790ec076094e8c37Simran Basi contentTypeIndex = i 1593e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1594e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if httpEquiv and contentType: # It's an interesting meta tag. 1595e64493206b76bce6e3af0598790ec076094e8c37Simran Basi match = self.CHARSET_RE.search(contentType) 1596e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if match: 1597e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if (self.declaredHTMLEncoding is not None or 1598e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.originalEncoding == self.fromEncoding): 1599e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # An HTML encoding was sniffed while converting 1600e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # the document to Unicode, or an HTML encoding was 1601e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # sniffed during a previous pass through the 1602e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # document, or an encoding was specified 1603e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # explicitly and it worked. Rewrite the meta tag. 1604e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def rewrite(match): 1605e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return match.group(1) + "%SOUP-ENCODING%" 1606e64493206b76bce6e3af0598790ec076094e8c37Simran Basi newAttr = self.CHARSET_RE.sub(rewrite, contentType) 1607e64493206b76bce6e3af0598790ec076094e8c37Simran Basi attrs[contentTypeIndex] = (attrs[contentTypeIndex][0], 1608e64493206b76bce6e3af0598790ec076094e8c37Simran Basi newAttr) 1609e64493206b76bce6e3af0598790ec076094e8c37Simran Basi tagNeedsEncodingSubstitution = True 1610e64493206b76bce6e3af0598790ec076094e8c37Simran Basi else: 1611e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # This is our first pass through the document. 1612e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # Go through it again with the encoding information. 1613e64493206b76bce6e3af0598790ec076094e8c37Simran Basi newCharset = match.group(3) 1614e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if newCharset and newCharset != self.originalEncoding: 1615e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.declaredHTMLEncoding = newCharset 1616e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self._feed(self.declaredHTMLEncoding) 1617e64493206b76bce6e3af0598790ec076094e8c37Simran Basi raise StopParsing 1618e64493206b76bce6e3af0598790ec076094e8c37Simran Basi pass 1619e64493206b76bce6e3af0598790ec076094e8c37Simran Basi tag = self.unknown_starttag("meta", attrs) 1620e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if tag and tagNeedsEncodingSubstitution: 1621e64493206b76bce6e3af0598790ec076094e8c37Simran Basi tag.containsSubstitutions = True 1622e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1623e64493206b76bce6e3af0598790ec076094e8c37Simran Basiclass StopParsing(Exception): 1624e64493206b76bce6e3af0598790ec076094e8c37Simran Basi pass 1625e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1626e64493206b76bce6e3af0598790ec076094e8c37Simran Basiclass ICantBelieveItsBeautifulSoup(BeautifulSoup): 1627e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1628e64493206b76bce6e3af0598790ec076094e8c37Simran Basi """The BeautifulSoup class is oriented towards skipping over 1629e64493206b76bce6e3af0598790ec076094e8c37Simran Basi common HTML errors like unclosed tags. However, sometimes it makes 1630e64493206b76bce6e3af0598790ec076094e8c37Simran Basi errors of its own. For instance, consider this fragment: 1631e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1632e64493206b76bce6e3af0598790ec076094e8c37Simran Basi <b>Foo<b>Bar</b></b> 1633e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1634e64493206b76bce6e3af0598790ec076094e8c37Simran Basi This is perfectly valid (if bizarre) HTML. However, the 1635e64493206b76bce6e3af0598790ec076094e8c37Simran Basi BeautifulSoup class will implicitly close the first b tag when it 1636e64493206b76bce6e3af0598790ec076094e8c37Simran Basi encounters the second 'b'. It will think the author wrote 1637e64493206b76bce6e3af0598790ec076094e8c37Simran Basi "<b>Foo<b>Bar", and didn't close the first 'b' tag, because 1638e64493206b76bce6e3af0598790ec076094e8c37Simran Basi there's no real-world reason to bold something that's already 1639e64493206b76bce6e3af0598790ec076094e8c37Simran Basi bold. When it encounters '</b></b>' it will close two more 'b' 1640e64493206b76bce6e3af0598790ec076094e8c37Simran Basi tags, for a grand total of three tags closed instead of two. This 1641e64493206b76bce6e3af0598790ec076094e8c37Simran Basi can throw off the rest of your document structure. The same is 1642e64493206b76bce6e3af0598790ec076094e8c37Simran Basi true of a number of other tags, listed below. 1643e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1644e64493206b76bce6e3af0598790ec076094e8c37Simran Basi It's much more common for someone to forget to close a 'b' tag 1645e64493206b76bce6e3af0598790ec076094e8c37Simran Basi than to actually use nested 'b' tags, and the BeautifulSoup class 1646e64493206b76bce6e3af0598790ec076094e8c37Simran Basi handles the common case. This class handles the not-co-common 1647e64493206b76bce6e3af0598790ec076094e8c37Simran Basi case: where you can't believe someone wrote what they did, but 1648e64493206b76bce6e3af0598790ec076094e8c37Simran Basi it's valid HTML and BeautifulSoup screwed up by assuming it 1649e64493206b76bce6e3af0598790ec076094e8c37Simran Basi wouldn't be.""" 1650e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1651e64493206b76bce6e3af0598790ec076094e8c37Simran Basi I_CANT_BELIEVE_THEYRE_NESTABLE_INLINE_TAGS = \ 1652e64493206b76bce6e3af0598790ec076094e8c37Simran Basi ('em', 'big', 'i', 'small', 'tt', 'abbr', 'acronym', 'strong', 1653e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 'cite', 'code', 'dfn', 'kbd', 'samp', 'strong', 'var', 'b', 1654e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 'big') 1655e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1656e64493206b76bce6e3af0598790ec076094e8c37Simran Basi I_CANT_BELIEVE_THEYRE_NESTABLE_BLOCK_TAGS = ('noscript',) 1657e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1658e64493206b76bce6e3af0598790ec076094e8c37Simran Basi NESTABLE_TAGS = buildTagMap([], BeautifulSoup.NESTABLE_TAGS, 1659e64493206b76bce6e3af0598790ec076094e8c37Simran Basi I_CANT_BELIEVE_THEYRE_NESTABLE_BLOCK_TAGS, 1660e64493206b76bce6e3af0598790ec076094e8c37Simran Basi I_CANT_BELIEVE_THEYRE_NESTABLE_INLINE_TAGS) 1661e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1662e64493206b76bce6e3af0598790ec076094e8c37Simran Basiclass MinimalSoup(BeautifulSoup): 1663e64493206b76bce6e3af0598790ec076094e8c37Simran Basi """The MinimalSoup class is for parsing HTML that contains 1664e64493206b76bce6e3af0598790ec076094e8c37Simran Basi pathologically bad markup. It makes no assumptions about tag 1665e64493206b76bce6e3af0598790ec076094e8c37Simran Basi nesting, but it does know which tags are self-closing, that 1666e64493206b76bce6e3af0598790ec076094e8c37Simran Basi <script> tags contain Javascript and should not be parsed, that 1667e64493206b76bce6e3af0598790ec076094e8c37Simran Basi META tags may contain encoding information, and so on. 1668e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1669e64493206b76bce6e3af0598790ec076094e8c37Simran Basi This also makes it better for subclassing than BeautifulStoneSoup 1670e64493206b76bce6e3af0598790ec076094e8c37Simran Basi or BeautifulSoup.""" 1671e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1672e64493206b76bce6e3af0598790ec076094e8c37Simran Basi RESET_NESTING_TAGS = buildTagMap('noscript') 1673e64493206b76bce6e3af0598790ec076094e8c37Simran Basi NESTABLE_TAGS = {} 1674e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1675e64493206b76bce6e3af0598790ec076094e8c37Simran Basiclass BeautifulSOAP(BeautifulStoneSoup): 1676e64493206b76bce6e3af0598790ec076094e8c37Simran Basi """This class will push a tag with only a single string child into 1677e64493206b76bce6e3af0598790ec076094e8c37Simran Basi the tag's parent as an attribute. The attribute's name is the tag 1678e64493206b76bce6e3af0598790ec076094e8c37Simran Basi name, and the value is the string child. An example should give 1679e64493206b76bce6e3af0598790ec076094e8c37Simran Basi the flavor of the change: 1680e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1681e64493206b76bce6e3af0598790ec076094e8c37Simran Basi <foo><bar>baz</bar></foo> 1682e64493206b76bce6e3af0598790ec076094e8c37Simran Basi => 1683e64493206b76bce6e3af0598790ec076094e8c37Simran Basi <foo bar="baz"><bar>baz</bar></foo> 1684e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1685e64493206b76bce6e3af0598790ec076094e8c37Simran Basi You can then access fooTag['bar'] instead of fooTag.barTag.string. 1686e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1687e64493206b76bce6e3af0598790ec076094e8c37Simran Basi This is, of course, useful for scraping structures that tend to 1688e64493206b76bce6e3af0598790ec076094e8c37Simran Basi use subelements instead of attributes, such as SOAP messages. Note 1689e64493206b76bce6e3af0598790ec076094e8c37Simran Basi that it modifies its input, so don't print the modified version 1690e64493206b76bce6e3af0598790ec076094e8c37Simran Basi out. 1691e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1692e64493206b76bce6e3af0598790ec076094e8c37Simran Basi I'm not sure how many people really want to use this class; let me 1693e64493206b76bce6e3af0598790ec076094e8c37Simran Basi know if you do. Mainly I like the name.""" 1694e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1695e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def popTag(self): 1696e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if len(self.tagStack) > 1: 1697e64493206b76bce6e3af0598790ec076094e8c37Simran Basi tag = self.tagStack[-1] 1698e64493206b76bce6e3af0598790ec076094e8c37Simran Basi parent = self.tagStack[-2] 1699e64493206b76bce6e3af0598790ec076094e8c37Simran Basi parent._getAttrMap() 1700e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if (isinstance(tag, Tag) and len(tag.contents) == 1 and 1701e64493206b76bce6e3af0598790ec076094e8c37Simran Basi isinstance(tag.contents[0], NavigableString) and 1702e64493206b76bce6e3af0598790ec076094e8c37Simran Basi not parent.attrMap.has_key(tag.name)): 1703e64493206b76bce6e3af0598790ec076094e8c37Simran Basi parent[tag.name] = tag.contents[0] 1704e64493206b76bce6e3af0598790ec076094e8c37Simran Basi BeautifulStoneSoup.popTag(self) 1705e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1706e64493206b76bce6e3af0598790ec076094e8c37Simran Basi#Enterprise class names! It has come to our attention that some people 1707e64493206b76bce6e3af0598790ec076094e8c37Simran Basi#think the names of the Beautiful Soup parser classes are too silly 1708e64493206b76bce6e3af0598790ec076094e8c37Simran Basi#and "unprofessional" for use in enterprise screen-scraping. We feel 1709e64493206b76bce6e3af0598790ec076094e8c37Simran Basi#your pain! For such-minded folk, the Beautiful Soup Consortium And 1710e64493206b76bce6e3af0598790ec076094e8c37Simran Basi#All-Night Kosher Bakery recommends renaming this file to 1711e64493206b76bce6e3af0598790ec076094e8c37Simran Basi#"RobustParser.py" (or, in cases of extreme enterprisiness, 1712e64493206b76bce6e3af0598790ec076094e8c37Simran Basi#"RobustParserBeanInterface.class") and using the following 1713e64493206b76bce6e3af0598790ec076094e8c37Simran Basi#enterprise-friendly class aliases: 1714e64493206b76bce6e3af0598790ec076094e8c37Simran Basiclass RobustXMLParser(BeautifulStoneSoup): 1715e64493206b76bce6e3af0598790ec076094e8c37Simran Basi pass 1716e64493206b76bce6e3af0598790ec076094e8c37Simran Basiclass RobustHTMLParser(BeautifulSoup): 1717e64493206b76bce6e3af0598790ec076094e8c37Simran Basi pass 1718e64493206b76bce6e3af0598790ec076094e8c37Simran Basiclass RobustWackAssHTMLParser(ICantBelieveItsBeautifulSoup): 1719e64493206b76bce6e3af0598790ec076094e8c37Simran Basi pass 1720e64493206b76bce6e3af0598790ec076094e8c37Simran Basiclass RobustInsanelyWackAssHTMLParser(MinimalSoup): 1721e64493206b76bce6e3af0598790ec076094e8c37Simran Basi pass 1722e64493206b76bce6e3af0598790ec076094e8c37Simran Basiclass SimplifyingSOAPParser(BeautifulSOAP): 1723e64493206b76bce6e3af0598790ec076094e8c37Simran Basi pass 1724e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1725e64493206b76bce6e3af0598790ec076094e8c37Simran Basi###################################################### 1726e64493206b76bce6e3af0598790ec076094e8c37Simran Basi# 1727e64493206b76bce6e3af0598790ec076094e8c37Simran Basi# Bonus library: Unicode, Dammit 1728e64493206b76bce6e3af0598790ec076094e8c37Simran Basi# 1729e64493206b76bce6e3af0598790ec076094e8c37Simran Basi# This class forces XML data into a standard format (usually to UTF-8 1730e64493206b76bce6e3af0598790ec076094e8c37Simran Basi# or Unicode). It is heavily based on code from Mark Pilgrim's 1731e64493206b76bce6e3af0598790ec076094e8c37Simran Basi# Universal Feed Parser. It does not rewrite the XML or HTML to 1732e64493206b76bce6e3af0598790ec076094e8c37Simran Basi# reflect a new encoding: that happens in BeautifulStoneSoup.handle_pi 1733e64493206b76bce6e3af0598790ec076094e8c37Simran Basi# (XML) and BeautifulSoup.start_meta (HTML). 1734e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1735e64493206b76bce6e3af0598790ec076094e8c37Simran Basi# Autodetects character encodings. 1736e64493206b76bce6e3af0598790ec076094e8c37Simran Basi# Download from http://chardet.feedparser.org/ 1737e64493206b76bce6e3af0598790ec076094e8c37Simran Basitry: 1738e64493206b76bce6e3af0598790ec076094e8c37Simran Basi import chardet 1739e64493206b76bce6e3af0598790ec076094e8c37Simran Basi# import chardet.constants 1740e64493206b76bce6e3af0598790ec076094e8c37Simran Basi# chardet.constants._debug = 1 1741e64493206b76bce6e3af0598790ec076094e8c37Simran Basiexcept ImportError: 1742e64493206b76bce6e3af0598790ec076094e8c37Simran Basi chardet = None 1743e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1744e64493206b76bce6e3af0598790ec076094e8c37Simran Basi# cjkcodecs and iconv_codec make Python know about more character encodings. 1745e64493206b76bce6e3af0598790ec076094e8c37Simran Basi# Both are available from http://cjkpython.i18n.org/ 1746e64493206b76bce6e3af0598790ec076094e8c37Simran Basi# They're built in if you use Python 2.4. 1747e64493206b76bce6e3af0598790ec076094e8c37Simran Basitry: 1748e64493206b76bce6e3af0598790ec076094e8c37Simran Basi import cjkcodecs.aliases 1749e64493206b76bce6e3af0598790ec076094e8c37Simran Basiexcept ImportError: 1750e64493206b76bce6e3af0598790ec076094e8c37Simran Basi pass 1751e64493206b76bce6e3af0598790ec076094e8c37Simran Basitry: 1752e64493206b76bce6e3af0598790ec076094e8c37Simran Basi import iconv_codec 1753e64493206b76bce6e3af0598790ec076094e8c37Simran Basiexcept ImportError: 1754e64493206b76bce6e3af0598790ec076094e8c37Simran Basi pass 1755e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1756e64493206b76bce6e3af0598790ec076094e8c37Simran Basiclass UnicodeDammit: 1757e64493206b76bce6e3af0598790ec076094e8c37Simran Basi """A class for detecting the encoding of a *ML document and 1758e64493206b76bce6e3af0598790ec076094e8c37Simran Basi converting it to a Unicode string. If the source encoding is 1759e64493206b76bce6e3af0598790ec076094e8c37Simran Basi windows-1252, can replace MS smart quotes with their HTML or XML 1760e64493206b76bce6e3af0598790ec076094e8c37Simran Basi equivalents.""" 1761e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1762e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # This dictionary maps commonly seen values for "charset" in HTML 1763e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # meta tags to the corresponding Python codec names. It only covers 1764e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # values that aren't in Python's aliases and can't be determined 1765e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # by the heuristics in find_codec. 1766e64493206b76bce6e3af0598790ec076094e8c37Simran Basi CHARSET_ALIASES = { "macintosh" : "mac-roman", 1767e64493206b76bce6e3af0598790ec076094e8c37Simran Basi "x-sjis" : "shift-jis" } 1768e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1769e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def __init__(self, markup, overrideEncodings=[], 1770e64493206b76bce6e3af0598790ec076094e8c37Simran Basi smartQuotesTo='xml', isHTML=False): 1771e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.declaredHTMLEncoding = None 1772e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.markup, documentEncoding, sniffedEncoding = \ 1773e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self._detectEncoding(markup, isHTML) 1774e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.smartQuotesTo = smartQuotesTo 1775e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.triedEncodings = [] 1776e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if markup == '' or isinstance(markup, unicode): 1777e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.originalEncoding = None 1778e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.unicode = unicode(markup) 1779e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return 1780e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1781e64493206b76bce6e3af0598790ec076094e8c37Simran Basi u = None 1782e64493206b76bce6e3af0598790ec076094e8c37Simran Basi for proposedEncoding in overrideEncodings: 1783e64493206b76bce6e3af0598790ec076094e8c37Simran Basi u = self._convertFrom(proposedEncoding) 1784e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if u: break 1785e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if not u: 1786e64493206b76bce6e3af0598790ec076094e8c37Simran Basi for proposedEncoding in (documentEncoding, sniffedEncoding): 1787e64493206b76bce6e3af0598790ec076094e8c37Simran Basi u = self._convertFrom(proposedEncoding) 1788e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if u: break 1789e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1790e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # If no luck and we have auto-detection library, try that: 1791e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if not u and chardet and not isinstance(self.markup, unicode): 1792e64493206b76bce6e3af0598790ec076094e8c37Simran Basi u = self._convertFrom(chardet.detect(self.markup)['encoding']) 1793e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1794e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # As a last resort, try utf-8 and windows-1252: 1795e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if not u: 1796e64493206b76bce6e3af0598790ec076094e8c37Simran Basi for proposed_encoding in ("utf-8", "windows-1252"): 1797e64493206b76bce6e3af0598790ec076094e8c37Simran Basi u = self._convertFrom(proposed_encoding) 1798e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if u: break 1799e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1800e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.unicode = u 1801e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if not u: self.originalEncoding = None 1802e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1803e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def _subMSChar(self, orig): 1804e64493206b76bce6e3af0598790ec076094e8c37Simran Basi """Changes a MS smart quote character to an XML or HTML 1805e64493206b76bce6e3af0598790ec076094e8c37Simran Basi entity.""" 1806e64493206b76bce6e3af0598790ec076094e8c37Simran Basi sub = self.MS_CHARS.get(orig) 1807e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if isinstance(sub, tuple): 1808e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if self.smartQuotesTo == 'xml': 1809e64493206b76bce6e3af0598790ec076094e8c37Simran Basi sub = '&#x%s;' % sub[1] 1810e64493206b76bce6e3af0598790ec076094e8c37Simran Basi else: 1811e64493206b76bce6e3af0598790ec076094e8c37Simran Basi sub = '&%s;' % sub[0] 1812e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return sub 1813e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1814e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def _convertFrom(self, proposed): 1815e64493206b76bce6e3af0598790ec076094e8c37Simran Basi proposed = self.find_codec(proposed) 1816e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if not proposed or proposed in self.triedEncodings: 1817e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return None 1818e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.triedEncodings.append(proposed) 1819e64493206b76bce6e3af0598790ec076094e8c37Simran Basi markup = self.markup 1820e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1821e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # Convert smart quotes to HTML if coming from an encoding 1822e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # that might have them. 1823e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if self.smartQuotesTo and proposed.lower() in("windows-1252", 1824e64493206b76bce6e3af0598790ec076094e8c37Simran Basi "iso-8859-1", 1825e64493206b76bce6e3af0598790ec076094e8c37Simran Basi "iso-8859-2"): 1826e64493206b76bce6e3af0598790ec076094e8c37Simran Basi markup = re.compile("([\x80-\x9f])").sub \ 1827e64493206b76bce6e3af0598790ec076094e8c37Simran Basi (lambda(x): self._subMSChar(x.group(1)), 1828e64493206b76bce6e3af0598790ec076094e8c37Simran Basi markup) 1829e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1830e64493206b76bce6e3af0598790ec076094e8c37Simran Basi try: 1831e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # print "Trying to convert document to %s" % proposed 1832e64493206b76bce6e3af0598790ec076094e8c37Simran Basi u = self._toUnicode(markup, proposed) 1833e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.markup = u 1834e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.originalEncoding = proposed 1835e64493206b76bce6e3af0598790ec076094e8c37Simran Basi except Exception, e: 1836e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # print "That didn't work!" 1837e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # print e 1838e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return None 1839e64493206b76bce6e3af0598790ec076094e8c37Simran Basi #print "Correct encoding: %s" % proposed 1840e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return self.markup 1841e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1842e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def _toUnicode(self, data, encoding): 1843e64493206b76bce6e3af0598790ec076094e8c37Simran Basi '''Given a string and its encoding, decodes the string into Unicode. 1844e64493206b76bce6e3af0598790ec076094e8c37Simran Basi %encoding is a string recognized by encodings.aliases''' 1845e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1846e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # strip Byte Order Mark (if present) 1847e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if (len(data) >= 4) and (data[:2] == '\xfe\xff') \ 1848e64493206b76bce6e3af0598790ec076094e8c37Simran Basi and (data[2:4] != '\x00\x00'): 1849e64493206b76bce6e3af0598790ec076094e8c37Simran Basi encoding = 'utf-16be' 1850e64493206b76bce6e3af0598790ec076094e8c37Simran Basi data = data[2:] 1851e64493206b76bce6e3af0598790ec076094e8c37Simran Basi elif (len(data) >= 4) and (data[:2] == '\xff\xfe') \ 1852e64493206b76bce6e3af0598790ec076094e8c37Simran Basi and (data[2:4] != '\x00\x00'): 1853e64493206b76bce6e3af0598790ec076094e8c37Simran Basi encoding = 'utf-16le' 1854e64493206b76bce6e3af0598790ec076094e8c37Simran Basi data = data[2:] 1855e64493206b76bce6e3af0598790ec076094e8c37Simran Basi elif data[:3] == '\xef\xbb\xbf': 1856e64493206b76bce6e3af0598790ec076094e8c37Simran Basi encoding = 'utf-8' 1857e64493206b76bce6e3af0598790ec076094e8c37Simran Basi data = data[3:] 1858e64493206b76bce6e3af0598790ec076094e8c37Simran Basi elif data[:4] == '\x00\x00\xfe\xff': 1859e64493206b76bce6e3af0598790ec076094e8c37Simran Basi encoding = 'utf-32be' 1860e64493206b76bce6e3af0598790ec076094e8c37Simran Basi data = data[4:] 1861e64493206b76bce6e3af0598790ec076094e8c37Simran Basi elif data[:4] == '\xff\xfe\x00\x00': 1862e64493206b76bce6e3af0598790ec076094e8c37Simran Basi encoding = 'utf-32le' 1863e64493206b76bce6e3af0598790ec076094e8c37Simran Basi data = data[4:] 1864e64493206b76bce6e3af0598790ec076094e8c37Simran Basi newdata = unicode(data, encoding) 1865e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return newdata 1866e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1867e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def _detectEncoding(self, xml_data, isHTML=False): 1868e64493206b76bce6e3af0598790ec076094e8c37Simran Basi """Given a document, tries to detect its XML encoding.""" 1869e64493206b76bce6e3af0598790ec076094e8c37Simran Basi xml_encoding = sniffed_xml_encoding = None 1870e64493206b76bce6e3af0598790ec076094e8c37Simran Basi try: 1871e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if xml_data[:4] == '\x4c\x6f\xa7\x94': 1872e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # EBCDIC 1873e64493206b76bce6e3af0598790ec076094e8c37Simran Basi xml_data = self._ebcdic_to_ascii(xml_data) 1874e64493206b76bce6e3af0598790ec076094e8c37Simran Basi elif xml_data[:4] == '\x00\x3c\x00\x3f': 1875e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # UTF-16BE 1876e64493206b76bce6e3af0598790ec076094e8c37Simran Basi sniffed_xml_encoding = 'utf-16be' 1877e64493206b76bce6e3af0598790ec076094e8c37Simran Basi xml_data = unicode(xml_data, 'utf-16be').encode('utf-8') 1878e64493206b76bce6e3af0598790ec076094e8c37Simran Basi elif (len(xml_data) >= 4) and (xml_data[:2] == '\xfe\xff') \ 1879e64493206b76bce6e3af0598790ec076094e8c37Simran Basi and (xml_data[2:4] != '\x00\x00'): 1880e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # UTF-16BE with BOM 1881e64493206b76bce6e3af0598790ec076094e8c37Simran Basi sniffed_xml_encoding = 'utf-16be' 1882e64493206b76bce6e3af0598790ec076094e8c37Simran Basi xml_data = unicode(xml_data[2:], 'utf-16be').encode('utf-8') 1883e64493206b76bce6e3af0598790ec076094e8c37Simran Basi elif xml_data[:4] == '\x3c\x00\x3f\x00': 1884e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # UTF-16LE 1885e64493206b76bce6e3af0598790ec076094e8c37Simran Basi sniffed_xml_encoding = 'utf-16le' 1886e64493206b76bce6e3af0598790ec076094e8c37Simran Basi xml_data = unicode(xml_data, 'utf-16le').encode('utf-8') 1887e64493206b76bce6e3af0598790ec076094e8c37Simran Basi elif (len(xml_data) >= 4) and (xml_data[:2] == '\xff\xfe') and \ 1888e64493206b76bce6e3af0598790ec076094e8c37Simran Basi (xml_data[2:4] != '\x00\x00'): 1889e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # UTF-16LE with BOM 1890e64493206b76bce6e3af0598790ec076094e8c37Simran Basi sniffed_xml_encoding = 'utf-16le' 1891e64493206b76bce6e3af0598790ec076094e8c37Simran Basi xml_data = unicode(xml_data[2:], 'utf-16le').encode('utf-8') 1892e64493206b76bce6e3af0598790ec076094e8c37Simran Basi elif xml_data[:4] == '\x00\x00\x00\x3c': 1893e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # UTF-32BE 1894e64493206b76bce6e3af0598790ec076094e8c37Simran Basi sniffed_xml_encoding = 'utf-32be' 1895e64493206b76bce6e3af0598790ec076094e8c37Simran Basi xml_data = unicode(xml_data, 'utf-32be').encode('utf-8') 1896e64493206b76bce6e3af0598790ec076094e8c37Simran Basi elif xml_data[:4] == '\x3c\x00\x00\x00': 1897e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # UTF-32LE 1898e64493206b76bce6e3af0598790ec076094e8c37Simran Basi sniffed_xml_encoding = 'utf-32le' 1899e64493206b76bce6e3af0598790ec076094e8c37Simran Basi xml_data = unicode(xml_data, 'utf-32le').encode('utf-8') 1900e64493206b76bce6e3af0598790ec076094e8c37Simran Basi elif xml_data[:4] == '\x00\x00\xfe\xff': 1901e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # UTF-32BE with BOM 1902e64493206b76bce6e3af0598790ec076094e8c37Simran Basi sniffed_xml_encoding = 'utf-32be' 1903e64493206b76bce6e3af0598790ec076094e8c37Simran Basi xml_data = unicode(xml_data[4:], 'utf-32be').encode('utf-8') 1904e64493206b76bce6e3af0598790ec076094e8c37Simran Basi elif xml_data[:4] == '\xff\xfe\x00\x00': 1905e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # UTF-32LE with BOM 1906e64493206b76bce6e3af0598790ec076094e8c37Simran Basi sniffed_xml_encoding = 'utf-32le' 1907e64493206b76bce6e3af0598790ec076094e8c37Simran Basi xml_data = unicode(xml_data[4:], 'utf-32le').encode('utf-8') 1908e64493206b76bce6e3af0598790ec076094e8c37Simran Basi elif xml_data[:3] == '\xef\xbb\xbf': 1909e64493206b76bce6e3af0598790ec076094e8c37Simran Basi # UTF-8 with BOM 1910e64493206b76bce6e3af0598790ec076094e8c37Simran Basi sniffed_xml_encoding = 'utf-8' 1911e64493206b76bce6e3af0598790ec076094e8c37Simran Basi xml_data = unicode(xml_data[3:], 'utf-8').encode('utf-8') 1912e64493206b76bce6e3af0598790ec076094e8c37Simran Basi else: 1913e64493206b76bce6e3af0598790ec076094e8c37Simran Basi sniffed_xml_encoding = 'ascii' 1914e64493206b76bce6e3af0598790ec076094e8c37Simran Basi pass 1915e64493206b76bce6e3af0598790ec076094e8c37Simran Basi except: 1916e64493206b76bce6e3af0598790ec076094e8c37Simran Basi xml_encoding_match = None 1917e64493206b76bce6e3af0598790ec076094e8c37Simran Basi xml_encoding_match = re.compile( 1918e64493206b76bce6e3af0598790ec076094e8c37Simran Basi '^<\?.*encoding=[\'"](.*?)[\'"].*\?>').match(xml_data) 1919e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if not xml_encoding_match and isHTML: 1920e64493206b76bce6e3af0598790ec076094e8c37Simran Basi regexp = re.compile('<\s*meta[^>]+charset=([^>]*?)[;\'">]', re.I) 1921e64493206b76bce6e3af0598790ec076094e8c37Simran Basi xml_encoding_match = regexp.search(xml_data) 1922e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if xml_encoding_match is not None: 1923e64493206b76bce6e3af0598790ec076094e8c37Simran Basi xml_encoding = xml_encoding_match.groups()[0].lower() 1924e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if isHTML: 1925e64493206b76bce6e3af0598790ec076094e8c37Simran Basi self.declaredHTMLEncoding = xml_encoding 1926e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if sniffed_xml_encoding and \ 1927e64493206b76bce6e3af0598790ec076094e8c37Simran Basi (xml_encoding in ('iso-10646-ucs-2', 'ucs-2', 'csunicode', 1928e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 'iso-10646-ucs-4', 'ucs-4', 'csucs4', 1929e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 'utf-16', 'utf-32', 'utf_16', 'utf_32', 1930e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 'utf16', 'u16')): 1931e64493206b76bce6e3af0598790ec076094e8c37Simran Basi xml_encoding = sniffed_xml_encoding 1932e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return xml_data, xml_encoding, sniffed_xml_encoding 1933e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1934e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1935e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def find_codec(self, charset): 1936e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return self._codec(self.CHARSET_ALIASES.get(charset, charset)) \ 1937e64493206b76bce6e3af0598790ec076094e8c37Simran Basi or (charset and self._codec(charset.replace("-", ""))) \ 1938e64493206b76bce6e3af0598790ec076094e8c37Simran Basi or (charset and self._codec(charset.replace("-", "_"))) \ 1939e64493206b76bce6e3af0598790ec076094e8c37Simran Basi or charset 1940e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1941e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def _codec(self, charset): 1942e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if not charset: return charset 1943e64493206b76bce6e3af0598790ec076094e8c37Simran Basi codec = None 1944e64493206b76bce6e3af0598790ec076094e8c37Simran Basi try: 1945e64493206b76bce6e3af0598790ec076094e8c37Simran Basi codecs.lookup(charset) 1946e64493206b76bce6e3af0598790ec076094e8c37Simran Basi codec = charset 1947e64493206b76bce6e3af0598790ec076094e8c37Simran Basi except (LookupError, ValueError): 1948e64493206b76bce6e3af0598790ec076094e8c37Simran Basi pass 1949e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return codec 1950e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1951e64493206b76bce6e3af0598790ec076094e8c37Simran Basi EBCDIC_TO_ASCII_MAP = None 1952e64493206b76bce6e3af0598790ec076094e8c37Simran Basi def _ebcdic_to_ascii(self, s): 1953e64493206b76bce6e3af0598790ec076094e8c37Simran Basi c = self.__class__ 1954e64493206b76bce6e3af0598790ec076094e8c37Simran Basi if not c.EBCDIC_TO_ASCII_MAP: 1955e64493206b76bce6e3af0598790ec076094e8c37Simran Basi emap = (0,1,2,3,156,9,134,127,151,141,142,11,12,13,14,15, 1956e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 16,17,18,19,157,133,8,135,24,25,146,143,28,29,30,31, 1957e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 128,129,130,131,132,10,23,27,136,137,138,139,140,5,6,7, 1958e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 144,145,22,147,148,149,150,4,152,153,154,155,20,21,158,26, 1959e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 32,160,161,162,163,164,165,166,167,168,91,46,60,40,43,33, 1960e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 38,169,170,171,172,173,174,175,176,177,93,36,42,41,59,94, 1961e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 45,47,178,179,180,181,182,183,184,185,124,44,37,95,62,63, 1962e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 186,187,188,189,190,191,192,193,194,96,58,35,64,39,61,34, 1963e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 195,97,98,99,100,101,102,103,104,105,196,197,198,199,200, 1964e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 201,202,106,107,108,109,110,111,112,113,114,203,204,205, 1965e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 206,207,208,209,126,115,116,117,118,119,120,121,122,210, 1966e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 211,212,213,214,215,216,217,218,219,220,221,222,223,224, 1967e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 225,226,227,228,229,230,231,123,65,66,67,68,69,70,71,72, 1968e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 73,232,233,234,235,236,237,125,74,75,76,77,78,79,80,81, 1969e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 82,238,239,240,241,242,243,92,159,83,84,85,86,87,88,89, 1970e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 90,244,245,246,247,248,249,48,49,50,51,52,53,54,55,56,57, 1971e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 250,251,252,253,254,255) 1972e64493206b76bce6e3af0598790ec076094e8c37Simran Basi import string 1973e64493206b76bce6e3af0598790ec076094e8c37Simran Basi c.EBCDIC_TO_ASCII_MAP = string.maketrans( \ 1974e64493206b76bce6e3af0598790ec076094e8c37Simran Basi ''.join(map(chr, range(256))), ''.join(map(chr, emap))) 1975e64493206b76bce6e3af0598790ec076094e8c37Simran Basi return s.translate(c.EBCDIC_TO_ASCII_MAP) 1976e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 1977e64493206b76bce6e3af0598790ec076094e8c37Simran Basi MS_CHARS = { '\x80' : ('euro', '20AC'), 1978e64493206b76bce6e3af0598790ec076094e8c37Simran Basi '\x81' : ' ', 1979e64493206b76bce6e3af0598790ec076094e8c37Simran Basi '\x82' : ('sbquo', '201A'), 1980e64493206b76bce6e3af0598790ec076094e8c37Simran Basi '\x83' : ('fnof', '192'), 1981e64493206b76bce6e3af0598790ec076094e8c37Simran Basi '\x84' : ('bdquo', '201E'), 1982e64493206b76bce6e3af0598790ec076094e8c37Simran Basi '\x85' : ('hellip', '2026'), 1983e64493206b76bce6e3af0598790ec076094e8c37Simran Basi '\x86' : ('dagger', '2020'), 1984e64493206b76bce6e3af0598790ec076094e8c37Simran Basi '\x87' : ('Dagger', '2021'), 1985e64493206b76bce6e3af0598790ec076094e8c37Simran Basi '\x88' : ('circ', '2C6'), 1986e64493206b76bce6e3af0598790ec076094e8c37Simran Basi '\x89' : ('permil', '2030'), 1987e64493206b76bce6e3af0598790ec076094e8c37Simran Basi '\x8A' : ('Scaron', '160'), 1988e64493206b76bce6e3af0598790ec076094e8c37Simran Basi '\x8B' : ('lsaquo', '2039'), 1989e64493206b76bce6e3af0598790ec076094e8c37Simran Basi '\x8C' : ('OElig', '152'), 1990e64493206b76bce6e3af0598790ec076094e8c37Simran Basi '\x8D' : '?', 1991e64493206b76bce6e3af0598790ec076094e8c37Simran Basi '\x8E' : ('#x17D', '17D'), 1992e64493206b76bce6e3af0598790ec076094e8c37Simran Basi '\x8F' : '?', 1993e64493206b76bce6e3af0598790ec076094e8c37Simran Basi '\x90' : '?', 1994e64493206b76bce6e3af0598790ec076094e8c37Simran Basi '\x91' : ('lsquo', '2018'), 1995e64493206b76bce6e3af0598790ec076094e8c37Simran Basi '\x92' : ('rsquo', '2019'), 1996e64493206b76bce6e3af0598790ec076094e8c37Simran Basi '\x93' : ('ldquo', '201C'), 1997e64493206b76bce6e3af0598790ec076094e8c37Simran Basi '\x94' : ('rdquo', '201D'), 1998e64493206b76bce6e3af0598790ec076094e8c37Simran Basi '\x95' : ('bull', '2022'), 1999e64493206b76bce6e3af0598790ec076094e8c37Simran Basi '\x96' : ('ndash', '2013'), 2000e64493206b76bce6e3af0598790ec076094e8c37Simran Basi '\x97' : ('mdash', '2014'), 2001e64493206b76bce6e3af0598790ec076094e8c37Simran Basi '\x98' : ('tilde', '2DC'), 2002e64493206b76bce6e3af0598790ec076094e8c37Simran Basi '\x99' : ('trade', '2122'), 2003e64493206b76bce6e3af0598790ec076094e8c37Simran Basi '\x9a' : ('scaron', '161'), 2004e64493206b76bce6e3af0598790ec076094e8c37Simran Basi '\x9b' : ('rsaquo', '203A'), 2005e64493206b76bce6e3af0598790ec076094e8c37Simran Basi '\x9c' : ('oelig', '153'), 2006e64493206b76bce6e3af0598790ec076094e8c37Simran Basi '\x9d' : '?', 2007e64493206b76bce6e3af0598790ec076094e8c37Simran Basi '\x9e' : ('#x17E', '17E'), 2008e64493206b76bce6e3af0598790ec076094e8c37Simran Basi '\x9f' : ('Yuml', ''),} 2009e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 2010e64493206b76bce6e3af0598790ec076094e8c37Simran Basi####################################################################### 2011e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 2012e64493206b76bce6e3af0598790ec076094e8c37Simran Basi 2013e64493206b76bce6e3af0598790ec076094e8c37Simran Basi#By default, act as an HTML pretty-printer. 2014e64493206b76bce6e3af0598790ec076094e8c37Simran Basiif __name__ == '__main__': 2015e64493206b76bce6e3af0598790ec076094e8c37Simran Basi import sys 2016e64493206b76bce6e3af0598790ec076094e8c37Simran Basi soup = BeautifulSoup(sys.stdin) 2017e64493206b76bce6e3af0598790ec076094e8c37Simran Basi print soup.prettify() 2018