1The moving parts
2================
3
4html5lib consists of a number of components, which are responsible for
5handling its features.
6
7
8Tree builders
9-------------
10
11The parser reads HTML by tokenizing the content and building a tree that
12the user can later access. There are three main types of trees that
13html5lib can build:
14
15* ``etree`` - this is the default; builds a tree based on ``xml.etree``,
16  which can be found in the standard library. Whenever possible, the
17  accelerated ``ElementTree`` implementation (i.e.
18  ``xml.etree.cElementTree`` on Python 2.x) is used.
19
20* ``dom`` - builds a tree based on ``xml.dom.minidom``.
21
22* ``lxml.etree`` - uses lxml's implementation of the ``ElementTree``
23  API.  The performance gains are relatively small compared to using the
24  accelerated ``ElementTree`` module.
25
26You can specify the builder by name when using the shorthand API:
27
28.. code-block:: python
29
30  import html5lib
31  with open("mydocument.html", "rb") as f:
32      lxml_etree_document = html5lib.parse(f, treebuilder="lxml")
33
34When instantiating a parser object, you have to pass a tree builder
35class in the ``tree`` keyword attribute:
36
37.. code-block:: python
38
39  import html5lib
40  parser = html5lib.HTMLParser(tree=SomeTreeBuilder)
41  document = parser.parse("<p>Hello World!")
42
43To get a builder class by name, use the ``getTreeBuilder`` function:
44
45.. code-block:: python
46
47  import html5lib
48  parser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("dom"))
49  minidom_document = parser.parse("<p>Hello World!")
50
51The implementation of builders can be found in `html5lib/treebuilders/
52<https://github.com/html5lib/html5lib-python/tree/master/html5lib/treebuilders>`_.
53
54
55Tree walkers
56------------
57
58Once a tree is ready, you can work on it either manually, or using
59a tree walker, which provides a streaming view of the tree. html5lib
60provides walkers for all three supported types of trees (``etree``,
61``dom`` and ``lxml``).
62
63The implementation of walkers can be found in `html5lib/treewalkers/
64<https://github.com/html5lib/html5lib-python/tree/master/html5lib/treewalkers>`_.
65
66Walkers make consuming HTML easier. html5lib uses them to provide you
67with has a couple of handy tools.
68
69
70HTMLSerializer
71~~~~~~~~~~~~~~
72
73The serializer lets you write HTML back as a stream of bytes.
74
75.. code-block:: pycon
76
77  >>> import html5lib
78  >>> element = html5lib.parse('<p xml:lang="pl">Witam wszystkich')
79  >>> walker = html5lib.getTreeWalker("etree")
80  >>> stream = walker(element)
81  >>> s = html5lib.serializer.HTMLSerializer()
82  >>> output = s.serialize(stream)
83  >>> for item in output:
84  ...   print("%r" % item)
85  '<p'
86  ' '
87  'xml:lang'
88  '='
89  'pl'
90  '>'
91  'Witam wszystkich'
92
93You can customize the serializer behaviour in a variety of ways, consult
94the :class:`~html5lib.serializer.htmlserializer.HTMLSerializer`
95documentation.
96
97
98Filters
99~~~~~~~
100
101You can alter the stream content with filters provided by html5lib:
102
103* :class:`alphabeticalattributes.Filter
104  <html5lib.filters.alphabeticalattributes.Filter>` sorts attributes on
105  tags to be in alphabetical order
106
107* :class:`inject_meta_charset.Filter
108  <html5lib.filters.inject_meta_charset.Filter>` sets a user-specified
109  encoding in the correct ``<meta>`` tag in the ``<head>`` section of
110  the document
111
112* :class:`lint.Filter <html5lib.filters.lint.Filter>` raises
113  ``LintError`` exceptions on invalid tag and attribute names, invalid
114  PCDATA, etc.
115
116* :class:`optionaltags.Filter <html5lib.filters.optionaltags.Filter>`
117  removes tags from the stream which are not necessary to produce valid
118  HTML
119
120* :class:`sanitizer.Filter <html5lib.filters.sanitizer.Filter>` removes
121  unsafe markup and CSS. Elements that are known to be safe are passed
122  through and the rest is converted to visible text. The default
123  configuration of the sanitizer follows the `WHATWG Sanitization Rules
124  <http://wiki.whatwg.org/wiki/Sanitization_rules>`_.
125
126* :class:`whitespace.Filter <html5lib.filters.whitespace.Filter>`
127  collapses all whitespace characters to single spaces unless they're in
128  ``<pre/>`` or ``textarea`` tags.
129
130To use a filter, simply wrap it around a stream:
131
132.. code-block:: python
133
134  >>> import html5lib
135  >>> from html5lib.filters import sanitizer
136  >>> dom = html5lib.parse("<p><script>alert('Boo!')", treebuilder="dom")
137  >>> walker = html5lib.getTreeWalker("dom")
138  >>> stream = walker(dom)
139  >>> sane_stream = sanitizer.Filter(stream) clean_stream = sanitizer.Filter(stream)
140
141
142Tree adapters
143-------------
144
145Used to translate one type of tree to another. More documentation
146pending, sorry.
147
148
149Encoding discovery
150------------------
151
152Parsed trees are always Unicode. However a large variety of input
153encodings are supported. The encoding of the document is determined in
154the following way:
155
156* The encoding may be explicitly specified by passing the name of the
157  encoding as the encoding parameter to the
158  :meth:`~html5lib.html5parser.HTMLParser.parse` method on
159  ``HTMLParser`` objects.
160
161* If no encoding is specified, the parser will attempt to detect the
162  encoding from a ``<meta>``  element in the first 512 bytes of the
163  document (this is only a partial implementation of the current HTML
164  5 specification).
165
166* If no encoding can be found and the chardet library is available, an
167  attempt will be made to sniff the encoding from the byte pattern.
168
169* If all else fails, the default encoding will be used. This is usually
170  `Windows-1252 <http://en.wikipedia.org/wiki/Windows-1252>`_, which is
171  a common fallback used by Web browsers.
172
173
174Tokenizers
175----------
176
177The part of the parser responsible for translating a raw input stream
178into meaningful tokens is the tokenizer. Currently html5lib provides
179two.
180
181To set up a tokenizer, simply pass it when instantiating
182a :class:`~html5lib.html5parser.HTMLParser`:
183
184.. code-block:: python
185
186  import html5lib
187  from html5lib import sanitizer
188
189  p = html5lib.HTMLParser(tokenizer=sanitizer.HTMLSanitizer)
190  p.parse("<p>Surprise!<script>alert('Boo!');</script>")
191
192HTMLTokenizer
193~~~~~~~~~~~~~
194
195This is the default tokenizer, the heart of html5lib. The implementation
196can be found in `html5lib/tokenizer.py
197<https://github.com/html5lib/html5lib-python/blob/master/html5lib/tokenizer.py>`_.
198
199HTMLSanitizer
200~~~~~~~~~~~~~
201
202This is a tokenizer that removes unsafe markup and CSS styles from the
203input. Elements that are known to be safe are passed through and the
204rest is converted to visible text. The default configuration of the
205sanitizer follows the `WHATWG Sanitization Rules
206<http://wiki.whatwg.org/wiki/Sanitization_rules>`_.
207
208The implementation can be found in `html5lib/sanitizer.py
209<https://github.com/html5lib/html5lib-python/blob/master/html5lib/sanitizer.py>`_.
210