1The moving parts 2================ 3 4html5lib consists of a number of components, which are responsible for 5handling its features. 6 7 8Tree builders 9------------- 10 11The parser reads HTML by tokenizing the content and building a tree that 12the user can later access. There are three main types of trees that 13html5lib can build: 14 15* ``etree`` - this is the default; builds a tree based on ``xml.etree``, 16 which can be found in the standard library. Whenever possible, the 17 accelerated ``ElementTree`` implementation (i.e. 18 ``xml.etree.cElementTree`` on Python 2.x) is used. 19 20* ``dom`` - builds a tree based on ``xml.dom.minidom``. 21 22* ``lxml.etree`` - uses lxml's implementation of the ``ElementTree`` 23 API. The performance gains are relatively small compared to using the 24 accelerated ``ElementTree`` module. 25 26You can specify the builder by name when using the shorthand API: 27 28.. code-block:: python 29 30 import html5lib 31 with open("mydocument.html", "rb") as f: 32 lxml_etree_document = html5lib.parse(f, treebuilder="lxml") 33 34When instantiating a parser object, you have to pass a tree builder 35class in the ``tree`` keyword attribute: 36 37.. code-block:: python 38 39 import html5lib 40 parser = html5lib.HTMLParser(tree=SomeTreeBuilder) 41 document = parser.parse("<p>Hello World!") 42 43To get a builder class by name, use the ``getTreeBuilder`` function: 44 45.. code-block:: python 46 47 import html5lib 48 parser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("dom")) 49 minidom_document = parser.parse("<p>Hello World!") 50 51The implementation of builders can be found in `html5lib/treebuilders/ 52<https://github.com/html5lib/html5lib-python/tree/master/html5lib/treebuilders>`_. 53 54 55Tree walkers 56------------ 57 58Once a tree is ready, you can work on it either manually, or using 59a tree walker, which provides a streaming view of the tree. html5lib 60provides walkers for all three supported types of trees (``etree``, 61``dom`` and ``lxml``). 62 63The implementation of walkers can be found in `html5lib/treewalkers/ 64<https://github.com/html5lib/html5lib-python/tree/master/html5lib/treewalkers>`_. 65 66Walkers make consuming HTML easier. html5lib uses them to provide you 67with has a couple of handy tools. 68 69 70HTMLSerializer 71~~~~~~~~~~~~~~ 72 73The serializer lets you write HTML back as a stream of bytes. 74 75.. code-block:: pycon 76 77 >>> import html5lib 78 >>> element = html5lib.parse('<p xml:lang="pl">Witam wszystkich') 79 >>> walker = html5lib.getTreeWalker("etree") 80 >>> stream = walker(element) 81 >>> s = html5lib.serializer.HTMLSerializer() 82 >>> output = s.serialize(stream) 83 >>> for item in output: 84 ... print("%r" % item) 85 '<p' 86 ' ' 87 'xml:lang' 88 '=' 89 'pl' 90 '>' 91 'Witam wszystkich' 92 93You can customize the serializer behaviour in a variety of ways, consult 94the :class:`~html5lib.serializer.htmlserializer.HTMLSerializer` 95documentation. 96 97 98Filters 99~~~~~~~ 100 101You can alter the stream content with filters provided by html5lib: 102 103* :class:`alphabeticalattributes.Filter 104 <html5lib.filters.alphabeticalattributes.Filter>` sorts attributes on 105 tags to be in alphabetical order 106 107* :class:`inject_meta_charset.Filter 108 <html5lib.filters.inject_meta_charset.Filter>` sets a user-specified 109 encoding in the correct ``<meta>`` tag in the ``<head>`` section of 110 the document 111 112* :class:`lint.Filter <html5lib.filters.lint.Filter>` raises 113 ``LintError`` exceptions on invalid tag and attribute names, invalid 114 PCDATA, etc. 115 116* :class:`optionaltags.Filter <html5lib.filters.optionaltags.Filter>` 117 removes tags from the stream which are not necessary to produce valid 118 HTML 119 120* :class:`sanitizer.Filter <html5lib.filters.sanitizer.Filter>` removes 121 unsafe markup and CSS. Elements that are known to be safe are passed 122 through and the rest is converted to visible text. The default 123 configuration of the sanitizer follows the `WHATWG Sanitization Rules 124 <http://wiki.whatwg.org/wiki/Sanitization_rules>`_. 125 126* :class:`whitespace.Filter <html5lib.filters.whitespace.Filter>` 127 collapses all whitespace characters to single spaces unless they're in 128 ``<pre/>`` or ``textarea`` tags. 129 130To use a filter, simply wrap it around a stream: 131 132.. code-block:: python 133 134 >>> import html5lib 135 >>> from html5lib.filters import sanitizer 136 >>> dom = html5lib.parse("<p><script>alert('Boo!')", treebuilder="dom") 137 >>> walker = html5lib.getTreeWalker("dom") 138 >>> stream = walker(dom) 139 >>> sane_stream = sanitizer.Filter(stream) clean_stream = sanitizer.Filter(stream) 140 141 142Tree adapters 143------------- 144 145Used to translate one type of tree to another. More documentation 146pending, sorry. 147 148 149Encoding discovery 150------------------ 151 152Parsed trees are always Unicode. However a large variety of input 153encodings are supported. The encoding of the document is determined in 154the following way: 155 156* The encoding may be explicitly specified by passing the name of the 157 encoding as the encoding parameter to the 158 :meth:`~html5lib.html5parser.HTMLParser.parse` method on 159 ``HTMLParser`` objects. 160 161* If no encoding is specified, the parser will attempt to detect the 162 encoding from a ``<meta>`` element in the first 512 bytes of the 163 document (this is only a partial implementation of the current HTML 164 5 specification). 165 166* If no encoding can be found and the chardet library is available, an 167 attempt will be made to sniff the encoding from the byte pattern. 168 169* If all else fails, the default encoding will be used. This is usually 170 `Windows-1252 <http://en.wikipedia.org/wiki/Windows-1252>`_, which is 171 a common fallback used by Web browsers. 172 173 174Tokenizers 175---------- 176 177The part of the parser responsible for translating a raw input stream 178into meaningful tokens is the tokenizer. Currently html5lib provides 179two. 180 181To set up a tokenizer, simply pass it when instantiating 182a :class:`~html5lib.html5parser.HTMLParser`: 183 184.. code-block:: python 185 186 import html5lib 187 from html5lib import sanitizer 188 189 p = html5lib.HTMLParser(tokenizer=sanitizer.HTMLSanitizer) 190 p.parse("<p>Surprise!<script>alert('Boo!');</script>") 191 192HTMLTokenizer 193~~~~~~~~~~~~~ 194 195This is the default tokenizer, the heart of html5lib. The implementation 196can be found in `html5lib/tokenizer.py 197<https://github.com/html5lib/html5lib-python/blob/master/html5lib/tokenizer.py>`_. 198 199HTMLSanitizer 200~~~~~~~~~~~~~ 201 202This is a tokenizer that removes unsafe markup and CSS styles from the 203input. Elements that are known to be safe are passed through and the 204rest is converted to visible text. The default configuration of the 205sanitizer follows the `WHATWG Sanitization Rules 206<http://wiki.whatwg.org/wiki/Sanitization_rules>`_. 207 208The implementation can be found in `html5lib/sanitizer.py 209<https://github.com/html5lib/html5lib-python/blob/master/html5lib/sanitizer.py>`_. 210