195640e3a20adea634b4df4ccf8c93f411184c438joi@chromium.org#!/usr/bin/env python
295640e3a20adea634b4df4ccf8c93f411184c438joi@chromium.org# Copyright (c) 2012 The Chromium Authors. All rights reserved.
301b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org# Use of this source code is governed by a BSD-style license that can be
401b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org# found in the LICENSE file.
501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
601b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org'''A gatherer for the TotalRecall brand of HTML templates with replaceable
701b3bc768461bd303bff39f8cd1663682254e407joi@chromium.orgportions.  We wanted to reuse extern.tclib.api.handlers.html.TCHTMLParser
801b3bc768461bd303bff39f8cd1663682254e407joi@chromium.orgbut this proved impossible due to the fact that the TotalRecall HTML templates
901b3bc768461bd303bff39f8cd1663682254e407joi@chromium.orgare in general quite far from parseable HTML and the TCHTMLParser derives
1001b3bc768461bd303bff39f8cd1663682254e407joi@chromium.orgfrom HTMLParser.HTMLParser which requires relatively well-formed HTML.  Some
1101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.orgexamples of "HTML" from the TotalRecall HTML templates that wouldn't be
1201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.orgparseable include things like:
1301b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
1401b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  <a [PARAMS]>blabla</a>  (not parseable because attributes are invalid)
1501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
1601b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  <table><tr><td>[LOTSOFSTUFF]</tr></table> (not parseable because closing
1701b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org                                            </td> is in the HTML [LOTSOFSTUFF]
1801b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org                                            is replaced by)
1901b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
2001b3bc768461bd303bff39f8cd1663682254e407joi@chromium.orgThe other problem with using general parsers (such as TCHTMLParser) is that
2101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.orgwe want to make sure we output the TotalRecall template with as little changes
2201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.orgas possible in terms of whitespace characters, layout etc.  With any parser
2301b3bc768461bd303bff39f8cd1663682254e407joi@chromium.orgthat generates a parse tree, and generates output by dumping the parse tree,
2401b3bc768461bd303bff39f8cd1663682254e407joi@chromium.orgwe would always have little inconsistencies which could cause bugs (the
2501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.orgTotalRecall template stuff is quite brittle and can break if e.g. a tab
2601b3bc768461bd303bff39f8cd1663682254e407joi@chromium.orgcharacter is replaced with spaces).
2701b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
2801b3bc768461bd303bff39f8cd1663682254e407joi@chromium.orgThe solution, which may be applicable to some other HTML-like template
2901b3bc768461bd303bff39f8cd1663682254e407joi@chromium.orglanguages floating around Google, is to create a parser with a simple state
3001b3bc768461bd303bff39f8cd1663682254e407joi@chromium.orgmachine that keeps track of what kind of tag it's inside, and whether it's in
3101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.orga translateable section or not.  Translateable sections are:
3201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
3301b3bc768461bd303bff39f8cd1663682254e407joi@chromium.orga) text (including [BINGO] replaceables) inside of tags that
3401b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org   can contain translateable text (which is all tags except
3501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org   for a few)
3601b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
3701b3bc768461bd303bff39f8cd1663682254e407joi@chromium.orgb) text inside of an 'alt' attribute in an <image> element, or
3801b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org   the 'value' attribute of a <submit>, <button> or <text>
3901b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org   element.
4001b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
4101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.orgThe parser does not build up a parse tree but rather a "skeleton" which
4201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.orgis a list of nontranslateable strings intermingled with grit.clique.MessageClique
4301b3bc768461bd303bff39f8cd1663682254e407joi@chromium.orgobjects.  This simplifies the parser considerably compared to a regular HTML
4401b3bc768461bd303bff39f8cd1663682254e407joi@chromium.orgparser.  To output a translated document, each item in the skeleton is
4501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.orgprinted out, with the relevant Translation from each MessageCliques being used
4601b3bc768461bd303bff39f8cd1663682254e407joi@chromium.orgfor the requested language.
4701b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
4801b3bc768461bd303bff39f8cd1663682254e407joi@chromium.orgThis implementation borrows some code, constants and ideas from
4901b3bc768461bd303bff39f8cd1663682254e407joi@chromium.orgextern.tclib.api.handlers.html.TCHTMLParser.
5001b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org'''
5101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
5201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
5301b3bc768461bd303bff39f8cd1663682254e407joi@chromium.orgimport re
5401b3bc768461bd303bff39f8cd1663682254e407joi@chromium.orgimport types
5501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
5601b3bc768461bd303bff39f8cd1663682254e407joi@chromium.orgfrom grit import clique
5701b3bc768461bd303bff39f8cd1663682254e407joi@chromium.orgfrom grit import exception
5801fadb72b6e94e6511eaffd1874a8cc095f098a7joi@chromium.orgfrom grit import lazy_re
5901b3bc768461bd303bff39f8cd1663682254e407joi@chromium.orgfrom grit import util
6001b3bc768461bd303bff39f8cd1663682254e407joi@chromium.orgfrom grit import tclib
6101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
6201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.orgfrom grit.gather import interface
6301b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
6401b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
6501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org# HTML tags which break (separate) chunks.
6601b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org_BLOCK_TAGS = ['script', 'p', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'hr', 'br',
6701b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org              'body', 'style', 'head', 'title', 'table', 'tr', 'td', 'th',
6801b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org              'ul', 'ol', 'dl', 'nl', 'li', 'div', 'object', 'center',
6901b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org              'html', 'link', 'form', 'select', 'textarea',
7001b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org              'button', 'option', 'map', 'area', 'blockquote', 'pre',
7101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org              'meta', 'xmp', 'noscript', 'label', 'tbody', 'thead',
728a303dc2cabedfd395b49f21310ea21c77056fe7joi@chromium.org              'script', 'style', 'pre', 'iframe', 'img', 'input', 'nowrap',
738a303dc2cabedfd395b49f21310ea21c77056fe7joi@chromium.org              'fieldset', 'legend']
7401b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
7501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org# HTML tags which may appear within a chunk.
7601b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org_INLINE_TAGS = ['b', 'i', 'u', 'tt', 'code', 'font', 'a', 'span', 'small',
7701b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org               'key', 'nobr', 'url', 'em', 's', 'sup', 'strike',
7801b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org               'strong']
7901b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
8001b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org# HTML tags within which linebreaks are significant.
8101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org_PREFORMATTED_TAGS = ['textarea', 'xmp', 'pre']
8201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
8301b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org# An array mapping some of the inline HTML tags to more meaningful
8401b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org# names for those tags.  This will be used when generating placeholders
8501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org# representing these tags.
8601b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org_HTML_PLACEHOLDER_NAMES = { 'a' : 'link', 'br' : 'break', 'b' : 'bold',
8701b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  'i' : 'italic', 'li' : 'item', 'ol' : 'ordered_list', 'p' : 'paragraph',
8801b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  'ul' : 'unordered_list', 'img' : 'image', 'em' : 'emphasis' }
8901b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
9001b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org# We append each of these characters in sequence to distinguish between
9101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org# different placeholders with basically the same name (e.g. BOLD1, BOLD2).
9201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org# Keep in mind that a placeholder name must not be a substring of any other
9301b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org# placeholder name in the same message, so we can't simply count (BOLD_1
9401b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org# would be a substring of BOLD_10).
9501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org_SUFFIXES = '123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ'
9601b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
9701b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org# Matches whitespace in an HTML document.  Also matches HTML comments, which are
9801b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org# treated as whitespace.
9901fadb72b6e94e6511eaffd1874a8cc095f098a7joi@chromium.org_WHITESPACE = lazy_re.compile(r'(\s|&nbsp;|\\n|\\r|<!--\s*desc\s*=.*?-->)+',
10001fadb72b6e94e6511eaffd1874a8cc095f098a7joi@chromium.org                              re.DOTALL)
10101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
1028a303dc2cabedfd395b49f21310ea21c77056fe7joi@chromium.org# Matches whitespace sequences which can be folded into a single whitespace
1038a303dc2cabedfd395b49f21310ea21c77056fe7joi@chromium.org# character.  This matches single characters so that non-spaces are replaced
1048a303dc2cabedfd395b49f21310ea21c77056fe7joi@chromium.org# with spaces.
1058a303dc2cabedfd395b49f21310ea21c77056fe7joi@chromium.org_FOLD_WHITESPACE = lazy_re.compile(r'\s+')
1068a303dc2cabedfd395b49f21310ea21c77056fe7joi@chromium.org
10701b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org# Finds a non-whitespace character
10801fadb72b6e94e6511eaffd1874a8cc095f098a7joi@chromium.org_NON_WHITESPACE = lazy_re.compile(r'\S')
10901b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
11001b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org# Matches two or more &nbsp; in a row (a single &nbsp is not changed into
11101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org# placeholders because different languages require different numbers of spaces
11201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org# and placeholders must match exactly; more than one is probably a "special"
11301b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org# whitespace sequence and should be turned into a placeholder).
11401fadb72b6e94e6511eaffd1874a8cc095f098a7joi@chromium.org_NBSP = lazy_re.compile(r'&nbsp;(&nbsp;)+')
11501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
11601b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org# Matches nontranslateable chunks of the document
11701fadb72b6e94e6511eaffd1874a8cc095f098a7joi@chromium.org_NONTRANSLATEABLES = lazy_re.compile(r'''
11801b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  <\s*script.+?<\s*/\s*script\s*>
11901b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  |
12001b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  <\s*style.+?<\s*/\s*style\s*>
12101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  |
12201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  <!--.+?-->
12301b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  |
12401b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  <\?IMPORT\s.+?>           # import tag
12501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  |
12601b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  <\s*[a-zA-Z_]+:.+?>       # custom tag (open)
12701b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  |
12801b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  <\s*/\s*[a-zA-Z_]+:.+?>   # custom tag (close)
12901b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  |
13001b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  <!\s*[A-Z]+\s*([^>]+|"[^"]+"|'[^']+')*?>
13101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  ''', re.MULTILINE | re.DOTALL | re.VERBOSE | re.IGNORECASE)
13201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
13301b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org# Matches a tag and its attributes
13401fadb72b6e94e6511eaffd1874a8cc095f098a7joi@chromium.org_ELEMENT = lazy_re.compile(r'''
13501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  # Optional closing /, element name
13601b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  <\s*(?P<closing>/)?\s*(?P<element>[a-zA-Z0-9]+)\s*
13701b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  # Attributes and/or replaceables inside the tag, if any
13801b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  (?P<atts>(
13901b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    \s*([a-zA-Z_][-:.a-zA-Z_0-9]*) # Attribute name
14001b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    (\s*=\s*(\'[^\']*\'|"[^"]*"|[-a-zA-Z0-9./,:;+*%?!&$\(\)_#=~\'"@]*))?
14101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    |
14201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    \s*\[(\$?\~)?([A-Z0-9-_]+?)(\~\$?)?\]
14301b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  )*)
14401b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  \s*(?P<empty>/)?\s*> # Optional empty-tag closing /, and tag close
14501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  ''',
14601b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  re.MULTILINE | re.DOTALL | re.VERBOSE)
14701b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
14801b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org# Matches elements that may have translateable attributes.  The value of these
14901b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org# special attributes is given by group 'value1' or 'value2'.  Note that this
15001b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org# regexp demands that the attribute value be quoted; this is necessary because
15101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org# the non-tree-building nature of the parser means we don't know when we're
15201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org# writing out attributes, so we wouldn't know to escape spaces.
15301fadb72b6e94e6511eaffd1874a8cc095f098a7joi@chromium.org_SPECIAL_ELEMENT = lazy_re.compile(r'''
15401b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  <\s*(
15501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    input[^>]+?value\s*=\s*(\'(?P<value3>[^\']*)\'|"(?P<value4>[^"]*)")
15601b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    [^>]+type\s*=\s*"?'?(button|reset|text|submit)'?"?
15701b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    |
15801b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    (
15901b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      table[^>]+?title\s*=
16001b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      |
16101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      img[^>]+?alt\s*=
16201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      |
16301b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      input[^>]+?type\s*=\s*"?'?(button|reset|text|submit)'?"?[^>]+?value\s*=
16401b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    )
16501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    \s*(\'(?P<value1>[^\']*)\'|"(?P<value2>[^"]*)")
16601b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  )[^>]*?>
16701b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  ''', re.MULTILINE | re.DOTALL | re.VERBOSE | re.IGNORECASE)
16801b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
16901b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org# Matches stuff that is translateable if it occurs in the right context
17001b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org# (between tags).  This includes all characters and character entities.
17101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org# Note that this also matches &nbsp; which needs to be handled as whitespace
17201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org# before this regexp is applied.
17301fadb72b6e94e6511eaffd1874a8cc095f098a7joi@chromium.org_CHARACTERS = lazy_re.compile(r'''
17401b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  (
17501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    \w
17601b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    |
17701b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    [\!\@\#\$\%\^\*\(\)\-\=\_\+\[\]\{\}\\\|\;\:\'\"\,\.\/\?\`\~]
17801b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    |
17901b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    &(\#[0-9]+|\#x[0-9a-fA-F]+|[A-Za-z0-9]+);
18001b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  )+
18101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  ''', re.MULTILINE | re.DOTALL | re.VERBOSE)
18201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
18301b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org# Matches Total Recall's "replaceable" tags, which are just any text
18401b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org# in capitals enclosed by delimiters like [] or [~~] or [$~~$] (e.g. [HELLO],
18501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org# [~HELLO~] and [$~HELLO~$]).
18601fadb72b6e94e6511eaffd1874a8cc095f098a7joi@chromium.org_REPLACEABLE = lazy_re.compile(r'\[(\$?\~)?(?P<name>[A-Z0-9-_]+?)(\~\$?)?\]',
18701fadb72b6e94e6511eaffd1874a8cc095f098a7joi@chromium.org                               re.MULTILINE)
18801b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
18901b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
19001b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org# Matches the silly [!]-prefixed "header" that is used in some TotalRecall
19101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org# templates.
19201fadb72b6e94e6511eaffd1874a8cc095f098a7joi@chromium.org_SILLY_HEADER = lazy_re.compile(r'\[!\]\ntitle\t(?P<title>[^\n]+?)\n.+?\n\n',
19301fadb72b6e94e6511eaffd1874a8cc095f098a7joi@chromium.org                                re.MULTILINE | re.DOTALL)
19401b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
19501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
19601b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org# Matches a comment that provides a description for the message it occurs in.
19701fadb72b6e94e6511eaffd1874a8cc095f098a7joi@chromium.org_DESCRIPTION_COMMENT = lazy_re.compile(
19801b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  r'<!--\s*desc\s*=\s*(?P<description>.+?)\s*-->', re.DOTALL)
19901b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
2008a303dc2cabedfd395b49f21310ea21c77056fe7joi@chromium.org# Matches a comment which is used to break apart multiple messages.
2018a303dc2cabedfd395b49f21310ea21c77056fe7joi@chromium.org_MESSAGE_BREAK_COMMENT = lazy_re.compile(r'<!--\s*message-break\s*-->',
2028a303dc2cabedfd395b49f21310ea21c77056fe7joi@chromium.org                                         re.DOTALL)
2038a303dc2cabedfd395b49f21310ea21c77056fe7joi@chromium.org
204807b78b2bb3a1ada6afb52bc0301ce717c230503joi@chromium.org# Matches a comment which is used to prevent block tags from splitting a message
205807b78b2bb3a1ada6afb52bc0301ce717c230503joi@chromium.org_MESSAGE_NO_BREAK_COMMENT = re.compile(r'<!--\s*message-no-break\s*-->',
206807b78b2bb3a1ada6afb52bc0301ce717c230503joi@chromium.org                                       re.DOTALL)
207807b78b2bb3a1ada6afb52bc0301ce717c230503joi@chromium.org
20801b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
20901b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org_DEBUG = 0
21001b3bc768461bd303bff39f8cd1663682254e407joi@chromium.orgdef _DebugPrint(text):
21101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  if _DEBUG:
21201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    print text.encode('utf-8')
21301b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
21401b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
21501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.orgclass HtmlChunks(object):
21601b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  '''A parser that knows how to break an HTML-like document into a list of
21701b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  chunks, where each chunk is either translateable or non-translateable.
21801b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  The chunks are unmodified sections of the original document, so concatenating
21901b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  the text of all chunks would result in the original document.'''
22001b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
22101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  def InTranslateable(self):
22201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    return self.last_translateable != -1
22301b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
22401b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  def Rest(self):
22501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    return self.text_[self.current:]
22601b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
22701b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  def StartTranslateable(self):
22801b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    assert not self.InTranslateable()
22901b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    if self.current != 0:
23001b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      # Append a nontranslateable chunk
23101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      chunk_text = self.text_[self.chunk_start : self.last_nontranslateable + 1]
23201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      # Needed in the case where document starts with a translateable.
23301b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      if len(chunk_text) > 0:
23401b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org        self.AddChunk(False, chunk_text)
23501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    self.chunk_start = self.last_nontranslateable + 1
23601b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    self.last_translateable = self.current
23701b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    self.last_nontranslateable = -1
23801b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
23901b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  def EndTranslateable(self):
24001b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    assert self.InTranslateable()
24101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    # Append a translateable chunk
24201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    self.AddChunk(True,
24301b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org                  self.text_[self.chunk_start : self.last_translateable + 1])
24401b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    self.chunk_start = self.last_translateable + 1
24501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    self.last_translateable = -1
24601b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    self.last_nontranslateable = self.current
24701b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
24801b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  def AdvancePast(self, match):
24901b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    self.current += match.end()
25001b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
25101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  def AddChunk(self, translateable, text):
25201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    '''Adds a chunk to self, removing linebreaks and duplicate whitespace
25301b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    if appropriate.
25401b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    '''
25501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    m = _DESCRIPTION_COMMENT.search(text)
25601b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    if m:
25701b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      self.last_description = m.group('description')
2588a303dc2cabedfd395b49f21310ea21c77056fe7joi@chromium.org      # Remove the description from the output text
25901b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      text = _DESCRIPTION_COMMENT.sub('', text)
26001b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
2618a303dc2cabedfd395b49f21310ea21c77056fe7joi@chromium.org    m = _MESSAGE_BREAK_COMMENT.search(text)
2628a303dc2cabedfd395b49f21310ea21c77056fe7joi@chromium.org    if m:
2638a303dc2cabedfd395b49f21310ea21c77056fe7joi@chromium.org      # Remove the coment from the output text.  It should already effectively
2648a303dc2cabedfd395b49f21310ea21c77056fe7joi@chromium.org      # break apart messages.
2658a303dc2cabedfd395b49f21310ea21c77056fe7joi@chromium.org      text = _MESSAGE_BREAK_COMMENT.sub('', text)
2668a303dc2cabedfd395b49f21310ea21c77056fe7joi@chromium.org
2678a303dc2cabedfd395b49f21310ea21c77056fe7joi@chromium.org    if translateable and not self.last_element_ in _PREFORMATTED_TAGS:
2688a303dc2cabedfd395b49f21310ea21c77056fe7joi@chromium.org      if self.fold_whitespace_:
2698a303dc2cabedfd395b49f21310ea21c77056fe7joi@chromium.org        # Fold whitespace sequences if appropriate.  This is optional because it
2708a303dc2cabedfd395b49f21310ea21c77056fe7joi@chromium.org        # alters the output strings.
2718a303dc2cabedfd395b49f21310ea21c77056fe7joi@chromium.org        text = _FOLD_WHITESPACE.sub(' ', text)
2728a303dc2cabedfd395b49f21310ea21c77056fe7joi@chromium.org      else:
2738a303dc2cabedfd395b49f21310ea21c77056fe7joi@chromium.org        text = text.replace('\n', ' ')
2748a303dc2cabedfd395b49f21310ea21c77056fe7joi@chromium.org        text = text.replace('\r', ' ')
2758a303dc2cabedfd395b49f21310ea21c77056fe7joi@chromium.org        # This whitespace folding doesn't work in all cases, thus the
2768a303dc2cabedfd395b49f21310ea21c77056fe7joi@chromium.org        # fold_whitespace flag to support backwards compatibility.
2778a303dc2cabedfd395b49f21310ea21c77056fe7joi@chromium.org        text = text.replace('   ', ' ')
2788a303dc2cabedfd395b49f21310ea21c77056fe7joi@chromium.org        text = text.replace('  ', ' ')
2798a303dc2cabedfd395b49f21310ea21c77056fe7joi@chromium.org
28001b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    if translateable:
28101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      description = self.last_description
28201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      self.last_description = ''
28301b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    else:
28401b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      description = ''
28501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
28601b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    if text != '':
28701b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      self.chunks_.append((translateable, text, description))
28801b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
2898a303dc2cabedfd395b49f21310ea21c77056fe7joi@chromium.org  def Parse(self, text, fold_whitespace):
29001b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    '''Parses self.text_ into an intermediate format stored in self.chunks_
29101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    which is translateable and nontranslateable chunks.  Also returns
29201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    self.chunks_
29301b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
2948a303dc2cabedfd395b49f21310ea21c77056fe7joi@chromium.org    Args:
2958a303dc2cabedfd395b49f21310ea21c77056fe7joi@chromium.org      text: The HTML for parsing.
2968a303dc2cabedfd395b49f21310ea21c77056fe7joi@chromium.org      fold_whitespace: Whether whitespace sequences should be folded into a
2978a303dc2cabedfd395b49f21310ea21c77056fe7joi@chromium.org        single space.
2988a303dc2cabedfd395b49f21310ea21c77056fe7joi@chromium.org
29901b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    Return:
30001b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      [chunk1, chunk2, chunk3, ...]  (instances of class Chunk)
30101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    '''
30201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    #
30301b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    # Chunker state
30401b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    #
30501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
30601b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    self.text_ = text
3078a303dc2cabedfd395b49f21310ea21c77056fe7joi@chromium.org    self.fold_whitespace_ = fold_whitespace
30801b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
30901b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    # A list of tuples (is_translateable, text) which represents the document
31001b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    # after chunking.
31101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    self.chunks_ = []
31201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
31301b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    # Start index of the last chunk, whether translateable or not
31401b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    self.chunk_start = 0
31501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
31601b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    # Index of the last for-sure translateable character if we are parsing
31701b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    # a translateable chunk, -1 to indicate we are not in a translateable chunk.
31801b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    # This is needed so that we don't include trailing whitespace in the
31901b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    # translateable chunk (whitespace is neutral).
32001b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    self.last_translateable = -1
32101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
32201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    # Index of the last for-sure nontranslateable character if we are parsing
32301b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    # a nontranslateable chunk, -1 if we are not in a nontranslateable chunk.
32401b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    # This is needed to make sure we can group e.g. "<b>Hello</b> there"
32501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    # together instead of just "Hello</b> there" which would be much worse
32601b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    # for translation.
32701b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    self.last_nontranslateable = -1
32801b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
32901b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    # Index of the character we're currently looking at.
33001b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    self.current = 0
33101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
33201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    # The name of the last block element parsed.
33301b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    self.last_element_ = ''
33401b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
33501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    # The last explicit description we found.
33601b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    self.last_description = ''
33701b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
338807b78b2bb3a1ada6afb52bc0301ce717c230503joi@chromium.org    # Whether no-break was the last chunk seen
339807b78b2bb3a1ada6afb52bc0301ce717c230503joi@chromium.org    self.last_nobreak = False
340807b78b2bb3a1ada6afb52bc0301ce717c230503joi@chromium.org
34101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    while self.current < len(self.text_):
34201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      _DebugPrint('REST: %s' % self.text_[self.current:self.current+60])
34301b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
344807b78b2bb3a1ada6afb52bc0301ce717c230503joi@chromium.org      m = _MESSAGE_NO_BREAK_COMMENT.match(self.Rest())
345807b78b2bb3a1ada6afb52bc0301ce717c230503joi@chromium.org      if m:
346807b78b2bb3a1ada6afb52bc0301ce717c230503joi@chromium.org        self.AdvancePast(m)
347807b78b2bb3a1ada6afb52bc0301ce717c230503joi@chromium.org        self.last_nobreak = True
348807b78b2bb3a1ada6afb52bc0301ce717c230503joi@chromium.org        continue
349807b78b2bb3a1ada6afb52bc0301ce717c230503joi@chromium.org
350807b78b2bb3a1ada6afb52bc0301ce717c230503joi@chromium.org      # Try to match whitespace
35101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      m = _WHITESPACE.match(self.Rest())
35201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      if m:
35301b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org        # Whitespace is neutral, it just advances 'current' and does not switch
35401b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org        # between translateable/nontranslateable.  If we are in a
35501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org        # nontranslateable section that extends to the current point, we extend
35601b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org        # it to include the whitespace.  If we are in a translateable section,
35701b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org        # we do not extend it until we find
35801b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org        # more translateable parts, because we never want a translateable chunk
35901b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org        # to end with whitespace.
36001b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org        if (not self.InTranslateable() and
36101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org            self.last_nontranslateable == self.current - 1):
36201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org          self.last_nontranslateable = self.current + m.end() - 1
36301b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org        self.AdvancePast(m)
36401b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org        continue
36501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
36601b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      # Then we try to match nontranslateables
36701b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      m = _NONTRANSLATEABLES.match(self.Rest())
36801b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      if m:
36901b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org        if self.InTranslateable():
37001b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org          self.EndTranslateable()
37101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org        self.last_nontranslateable = self.current + m.end() - 1
37201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org        self.AdvancePast(m)
37301b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org        continue
37401b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
37501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      # Now match all other HTML element tags (opening, closing, or empty, we
37601b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      # don't care).
37701b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      m = _ELEMENT.match(self.Rest())
37801b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      if m:
37901b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org        element_name = m.group('element').lower()
38001b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org        if element_name in _BLOCK_TAGS:
38101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org          self.last_element_ = element_name
38201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org          if self.InTranslateable():
383807b78b2bb3a1ada6afb52bc0301ce717c230503joi@chromium.org            if self.last_nobreak:
384807b78b2bb3a1ada6afb52bc0301ce717c230503joi@chromium.org              self.last_nobreak = False
385807b78b2bb3a1ada6afb52bc0301ce717c230503joi@chromium.org            else:
386807b78b2bb3a1ada6afb52bc0301ce717c230503joi@chromium.org              self.EndTranslateable()
38701b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
38801b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org          # Check for "special" elements, i.e. ones that have a translateable
38901b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org          # attribute, and handle them correctly.  Note that all of the
39001b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org          # "special" elements are block tags, so no need to check for this
39101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org          # if the tag is not a block tag.
39201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org          sm = _SPECIAL_ELEMENT.match(self.Rest())
39301b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org          if sm:
39401b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org            # Get the appropriate group name
39501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org            for group in sm.groupdict().keys():
39601b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org              if sm.groupdict()[group]:
39701b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org                break
39801b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
39901b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org            # First make a nontranslateable chunk up to and including the
40001b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org            # quote before the translateable attribute value
40101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org            self.AddChunk(False, self.text_[
40201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org              self.chunk_start : self.current + sm.start(group)])
40301b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org            # Then a translateable for the translateable bit
40401b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org            self.AddChunk(True, self.Rest()[sm.start(group) : sm.end(group)])
40501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org            # Finally correct the data invariant for the parser
40601b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org            self.chunk_start = self.current + sm.end(group)
40701b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
40801b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org          self.last_nontranslateable = self.current + m.end() - 1
40901b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org        elif self.InTranslateable():
41001b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org          # We're in a translateable and the tag is an inline tag, so we
41101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org          # need to include it in the translateable.
41201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org          self.last_translateable = self.current + m.end() - 1
41301b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org        self.AdvancePast(m)
41401b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org        continue
41501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
41601b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      # Anything else we find must be translateable, so we advance one character
41701b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      # at a time until one of the above matches.
41801b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      if not self.InTranslateable():
41901b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org        self.StartTranslateable()
42001b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      else:
42101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org        self.last_translateable = self.current
42201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      self.current += 1
42301b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
42401b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    # Close the final chunk
42501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    if self.InTranslateable():
42601b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      self.AddChunk(True, self.text_[self.chunk_start : ])
42701b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    else:
42801b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      self.AddChunk(False, self.text_[self.chunk_start : ])
42901b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
43001b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    return self.chunks_
43101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
43201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
43301b3bc768461bd303bff39f8cd1663682254e407joi@chromium.orgdef HtmlToMessage(html, include_block_tags=False, description=''):
43401b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  '''Takes a bit of HTML, which must contain only "inline" HTML elements,
43501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  and changes it into a tclib.Message.  This involves escaping any entities and
43601b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  replacing any HTML code with placeholders.
43701b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
43801b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  If include_block_tags is true, no error will be given if block tags (e.g.
43901b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  <p> or <br>) are included in the HTML.
44001b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
44101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  Args:
44201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    html: 'Hello <b>[USERNAME]</b>, how&nbsp;<i>are</i> you?'
44301b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    include_block_tags: False
44401b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
44501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  Return:
44601b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    tclib.Message('Hello START_BOLD1USERNAMEEND_BOLD, '
44701b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org                  'howNBSPSTART_ITALICareEND_ITALIC you?',
44801b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org                  [ Placeholder('START_BOLD', '<b>', ''),
44901b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org                    Placeholder('USERNAME', '[USERNAME]', ''),
45001b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org                    Placeholder('END_BOLD', '</b>', ''),
45101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org                    Placeholder('START_ITALIC', '<i>', ''),
45201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org                    Placeholder('END_ITALIC', '</i>', ''), ])
45301b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  '''
45401b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  # Approach is:
45501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  # - first placeholderize, finding <elements>, [REPLACEABLES] and &nbsp;
45601b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  # - then escape all character entities in text in-between placeholders
45701b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
45801b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  parts = []  # List of strings (for text chunks) and tuples (ID, original)
45901b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org              # for placeholders
46001b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
46101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  count_names = {}  # Map of base names to number of times used
46201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  end_names = {}  # Map of base names to stack of end tags (for correct nesting)
46301b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
46401b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  def MakeNameClosure(base, type = ''):
46501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    '''Returns a closure that can be called once all names have been allocated
46601b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    to return the final name of the placeholder.  This allows us to minimally
46701b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    number placeholders for non-overlap.
46801b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
46901b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    Also ensures that END_XXX_Y placeholders have the same Y as the
47001b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    corresponding BEGIN_XXX_Y placeholder when we have nested tags of the same
47101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    type.
47201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
47301b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    Args:
47401b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      base: 'phname'
47501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      type: '' | 'begin' | 'end'
47601b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
47701b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    Return:
47801b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      Closure()
47901b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    '''
480807b78b2bb3a1ada6afb52bc0301ce717c230503joi@chromium.org    name = base.upper()
48101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    if type != '':
48201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      name = ('%s_%s' % (type, base)).upper()
48301b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
48401b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    if name in count_names.keys():
48501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      count_names[name] += 1
48601b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    else:
48701b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      count_names[name] = 1
48801b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
48901b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    def MakeFinalName(name_ = name, index = count_names[name] - 1):
49001b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      if (type.lower() == 'end' and
49101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org          base in end_names.keys() and len(end_names[base])):
49201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org        return end_names[base].pop(-1)  # For correct nesting
49301b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      if count_names[name_] != 1:
49401b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org        name_ = '%s_%s' % (name_, _SUFFIXES[index])
49501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org        # We need to use a stack to ensure that the end-tag suffixes match
49601b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org        # the begin-tag suffixes.  Only needed when more than one tag of the
49701b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org        # same type.
49801b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org        if type == 'begin':
49901b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org          end_name = ('END_%s_%s' % (base, _SUFFIXES[index])).upper()
50001b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org          if base in end_names.keys():
50101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org            end_names[base].append(end_name)
50201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org          else:
50301b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org            end_names[base] = [end_name]
50401b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
50501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      return name_
50601b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
50701b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    return MakeFinalName
50801b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
50901b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  current = 0
510807b78b2bb3a1ada6afb52bc0301ce717c230503joi@chromium.org  last_nobreak = False
51101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
51201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  while current < len(html):
513807b78b2bb3a1ada6afb52bc0301ce717c230503joi@chromium.org    m = _MESSAGE_NO_BREAK_COMMENT.match(html[current:])
514807b78b2bb3a1ada6afb52bc0301ce717c230503joi@chromium.org    if m:
515807b78b2bb3a1ada6afb52bc0301ce717c230503joi@chromium.org      last_nobreak = True
516807b78b2bb3a1ada6afb52bc0301ce717c230503joi@chromium.org      current += m.end()
517807b78b2bb3a1ada6afb52bc0301ce717c230503joi@chromium.org      continue
518807b78b2bb3a1ada6afb52bc0301ce717c230503joi@chromium.org
51901b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    m = _NBSP.match(html[current:])
52001b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    if m:
52101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      parts.append((MakeNameClosure('SPACE'), m.group()))
52201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      current += m.end()
52301b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      continue
52401b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
52501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    m = _REPLACEABLE.match(html[current:])
52601b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    if m:
52701b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      # Replaceables allow - but placeholders don't, so replace - with _
52801b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      ph_name = MakeNameClosure('X_%s_X' % m.group('name').replace('-', '_'))
52901b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      parts.append((ph_name, m.group()))
53001b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      current += m.end()
53101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      continue
53201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
53301b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    m = _SPECIAL_ELEMENT.match(html[current:])
53401b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    if m:
53501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      if not include_block_tags:
536807b78b2bb3a1ada6afb52bc0301ce717c230503joi@chromium.org        if last_nobreak:
537807b78b2bb3a1ada6afb52bc0301ce717c230503joi@chromium.org          last_nobreak = False
538807b78b2bb3a1ada6afb52bc0301ce717c230503joi@chromium.org        else:
539807b78b2bb3a1ada6afb52bc0301ce717c230503joi@chromium.org          raise exception.BlockTagInTranslateableChunk(html)
54001b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      element_name = 'block'  # for simplification
54101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      # Get the appropriate group name
54201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      for group in m.groupdict().keys():
54301b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org        if m.groupdict()[group]:
54401b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org          break
54501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      parts.append((MakeNameClosure(element_name, 'begin'),
54601b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org                    html[current : current + m.start(group)]))
54701b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      parts.append(m.group(group))
54801b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      parts.append((MakeNameClosure(element_name, 'end'),
54901b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org                    html[current + m.end(group) : current + m.end()]))
55001b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      current += m.end()
55101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      continue
55201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
55301b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    m = _ELEMENT.match(html[current:])
55401b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    if m:
55501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      element_name = m.group('element').lower()
55601b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      if not include_block_tags and not element_name in _INLINE_TAGS:
557807b78b2bb3a1ada6afb52bc0301ce717c230503joi@chromium.org        if last_nobreak:
558807b78b2bb3a1ada6afb52bc0301ce717c230503joi@chromium.org          last_nobreak = False
559807b78b2bb3a1ada6afb52bc0301ce717c230503joi@chromium.org        else:
560807b78b2bb3a1ada6afb52bc0301ce717c230503joi@chromium.org          raise exception.BlockTagInTranslateableChunk(html[current:])
56101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      if element_name in _HTML_PLACEHOLDER_NAMES:  # use meaningful names
56201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org        element_name = _HTML_PLACEHOLDER_NAMES[element_name]
56301b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
56401b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      # Make a name for the placeholder
56501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      type = ''
56601b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      if not m.group('empty'):
56701b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org        if m.group('closing'):
56801b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org          type = 'end'
56901b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org        else:
57001b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org          type = 'begin'
57101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      parts.append((MakeNameClosure(element_name, type), m.group()))
57201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      current += m.end()
57301b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      continue
57401b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
57501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    if len(parts) and isinstance(parts[-1], types.StringTypes):
57601b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      parts[-1] += html[current]
57701b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    else:
57801b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      parts.append(html[current])
57901b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    current += 1
58001b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
58101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  msg_text = ''
58201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  placeholders = []
58301b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  for part in parts:
58401b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    if isinstance(part, types.TupleType):
58501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      final_name = part[0]()
58601b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      original = part[1]
58701b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      msg_text += final_name
58801b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      placeholders.append(tclib.Placeholder(final_name, original, '(HTML code)'))
58901b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    else:
59001b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      msg_text += part
59101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
59201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  msg = tclib.Message(text=msg_text, placeholders=placeholders,
59301b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org                      description=description)
59401b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  content = msg.GetContent()
59501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  for ix in range(len(content)):
59601b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    if isinstance(content[ix], types.StringTypes):
59701b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      content[ix] = util.UnescapeHtml(content[ix], replace_nbsp=False)
59801b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
59901b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  return msg
60001b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
60101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
60201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.orgclass TrHtml(interface.GathererBase):
60301b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  '''Represents a document or message in the template format used by
60401b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  Total Recall for HTML documents.'''
60501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
606ec8016c73b3b945b6284746230913d88653f35e7benrg@chromium.org  def __init__(self, *args, **kwargs):
607ec8016c73b3b945b6284746230913d88653f35e7benrg@chromium.org    super(TrHtml, self).__init__(*args, **kwargs)
60801b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    self.have_parsed_ = False
60901b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    self.skeleton_ = []  # list of strings and MessageClique objects
6108a303dc2cabedfd395b49f21310ea21c77056fe7joi@chromium.org    self.fold_whitespace_ = False
6118a303dc2cabedfd395b49f21310ea21c77056fe7joi@chromium.org
6128a303dc2cabedfd395b49f21310ea21c77056fe7joi@chromium.org  def SetAttributes(self, attrs):
6138a303dc2cabedfd395b49f21310ea21c77056fe7joi@chromium.org    '''Sets node attributes used by the gatherer.
6148a303dc2cabedfd395b49f21310ea21c77056fe7joi@chromium.org
6158a303dc2cabedfd395b49f21310ea21c77056fe7joi@chromium.org    This checks the fold_whitespace attribute.
6168a303dc2cabedfd395b49f21310ea21c77056fe7joi@chromium.org
6178a303dc2cabedfd395b49f21310ea21c77056fe7joi@chromium.org    Args:
6188a303dc2cabedfd395b49f21310ea21c77056fe7joi@chromium.org      attrs: The mapping of node attributes.
6198a303dc2cabedfd395b49f21310ea21c77056fe7joi@chromium.org    '''
6208a303dc2cabedfd395b49f21310ea21c77056fe7joi@chromium.org    self.fold_whitespace_ = ('fold_whitespace' in attrs and
6218a303dc2cabedfd395b49f21310ea21c77056fe7joi@chromium.org                             attrs['fold_whitespace'] == 'true')
62201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
62301b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  def GetText(self):
62401b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    '''Returns the original text of the HTML document'''
62501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    return self.text_
62601b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
627ec8016c73b3b945b6284746230913d88653f35e7benrg@chromium.org  def GetTextualIds(self):
628ec8016c73b3b945b6284746230913d88653f35e7benrg@chromium.org    return [self.extkey]
629ec8016c73b3b945b6284746230913d88653f35e7benrg@chromium.org
63001b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  def GetCliques(self):
63101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    '''Returns the message cliques for each translateable message in the
63201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    document.'''
633705a118ab4a1f2fe348fccdcd4786a0b5bf426ecjoi@chromium.org    return [x for x in self.skeleton_ if isinstance(x, clique.MessageClique)]
63401b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
63501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  def Translate(self, lang, pseudo_if_not_available=True,
63601b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org                skeleton_gatherer=None, fallback_to_english=False):
63701b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    '''Returns this document with translateable messages filled with
63801b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    the translation for language 'lang'.
63901b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
64001b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    Args:
64101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      lang: 'en'
64201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      pseudo_if_not_available: True
64301b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
64401b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    Return:
64501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      'ID_THIS_SECTION TYPE\n...BEGIN\n  "Translated message"\n......\nEND
64601b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
64701b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    Raises:
64801b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      grit.exception.NotReady() if used before Parse() has been successfully
64901b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      called.
65001b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      grit.exception.NoSuchTranslation() if 'pseudo_if_not_available' is false
65101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      and there is no translation for the requested language.
65201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    '''
65301b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    if len(self.skeleton_) == 0:
65401b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      raise exception.NotReady()
65501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
65601b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    # TODO(joi) Implement support for skeleton gatherers here.
65701b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
65801b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    out = []
65901b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    for item in self.skeleton_:
66001b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      if isinstance(item, types.StringTypes):
66101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org        out.append(item)
66201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      else:
66301b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org        msg = item.MessageForLanguage(lang,
66401b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org                                      pseudo_if_not_available,
66501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org                                      fallback_to_english)
66601b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org        for content in msg.GetContent():
66701b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org          if isinstance(content, tclib.Placeholder):
66801b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org            out.append(content.GetOriginal())
66901b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org          else:
67001b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org            # We escape " characters to increase the chance that attributes
67101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org            # will be properly escaped.
67201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org            out.append(util.EscapeHtml(content, True))
67301b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
67401b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    return ''.join(out)
67501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
67601b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org  def Parse(self):
67701b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    if self.have_parsed_:
67801b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      return
67901b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    self.have_parsed_ = True
68001b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
681b9161407f737461b5db16a29782f8a31d19e602dbenrg@chromium.org    text = self._LoadInputFile()
682ec8016c73b3b945b6284746230913d88653f35e7benrg@chromium.org
683ec8016c73b3b945b6284746230913d88653f35e7benrg@chromium.org    # Ignore the BOM character if the document starts with one.
684ec8016c73b3b945b6284746230913d88653f35e7benrg@chromium.org    if text.startswith(u'\ufeff'):
685ec8016c73b3b945b6284746230913d88653f35e7benrg@chromium.org      text = text[1:]
686ec8016c73b3b945b6284746230913d88653f35e7benrg@chromium.org
687ec8016c73b3b945b6284746230913d88653f35e7benrg@chromium.org    self.text_ = text
688ec8016c73b3b945b6284746230913d88653f35e7benrg@chromium.org
689ec8016c73b3b945b6284746230913d88653f35e7benrg@chromium.org    # Parsing is done in two phases:  First, we break the document into
690ec8016c73b3b945b6284746230913d88653f35e7benrg@chromium.org    # translateable and nontranslateable chunks.  Second, we run through each
691ec8016c73b3b945b6284746230913d88653f35e7benrg@chromium.org    # translateable chunk and insert placeholders for any HTML elements,
692ec8016c73b3b945b6284746230913d88653f35e7benrg@chromium.org    # unescape escaped characters, etc.
69301b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
69401b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    # First handle the silly little [!]-prefixed header because it's not
69501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    # handled by our HTML parsers.
69601b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    m = _SILLY_HEADER.match(text)
69701b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    if m:
69801b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      self.skeleton_.append(text[:m.start('title')])
69901b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      self.skeleton_.append(self.uberclique.MakeClique(
70001b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org        tclib.Message(text=text[m.start('title'):m.end('title')])))
70101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      self.skeleton_.append(text[m.end('title') : m.end()])
70201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      text = text[m.end():]
70301b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
7048a303dc2cabedfd395b49f21310ea21c77056fe7joi@chromium.org    chunks = HtmlChunks().Parse(text, self.fold_whitespace_)
70501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
70601b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    for chunk in chunks:
70701b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      if chunk[0]:  # Chunk is translateable
70801b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org        self.skeleton_.append(self.uberclique.MakeClique(
70901b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org          HtmlToMessage(chunk[1], description=chunk[2])))
71001b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      else:
71101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org        self.skeleton_.append(chunk[1])
71201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
71301b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    # Go through the skeleton and change any messages that consist solely of
71401b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    # placeholders and whitespace into nontranslateable strings.
71501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org    for ix in range(len(self.skeleton_)):
71601b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      got_text = False
71701b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org      if isinstance(self.skeleton_[ix], clique.MessageClique):
71801b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org        msg = self.skeleton_[ix].GetMessage()
71901b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org        for item in msg.GetContent():
72001b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org          if (isinstance(item, types.StringTypes) and _NON_WHITESPACE.search(item)
72101b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org              and item != '&nbsp;'):
72201b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org            got_text = True
72301b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org            break
72401b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org        if not got_text:
72501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org          self.skeleton_[ix] = msg.GetRealContent()
72601b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
72777cbaa8b1f1af05d8ba2c2a951c74e7909318830joi@chromium.org  def SubstituteMessages(self, substituter):
72877cbaa8b1f1af05d8ba2c2a951c74e7909318830joi@chromium.org    '''Applies substitutions to all messages in the tree.
72977cbaa8b1f1af05d8ba2c2a951c74e7909318830joi@chromium.org
73077cbaa8b1f1af05d8ba2c2a951c74e7909318830joi@chromium.org    Goes through the skeleton and finds all MessageCliques.
73177cbaa8b1f1af05d8ba2c2a951c74e7909318830joi@chromium.org
73277cbaa8b1f1af05d8ba2c2a951c74e7909318830joi@chromium.org    Args:
73377cbaa8b1f1af05d8ba2c2a951c74e7909318830joi@chromium.org      substituter: a grit.util.Substituter object.
73477cbaa8b1f1af05d8ba2c2a951c74e7909318830joi@chromium.org    '''
73577cbaa8b1f1af05d8ba2c2a951c74e7909318830joi@chromium.org    new_skel = []
73677cbaa8b1f1af05d8ba2c2a951c74e7909318830joi@chromium.org    for chunk in self.skeleton_:
73777cbaa8b1f1af05d8ba2c2a951c74e7909318830joi@chromium.org      if isinstance(chunk, clique.MessageClique):
73877cbaa8b1f1af05d8ba2c2a951c74e7909318830joi@chromium.org        old_message = chunk.GetMessage()
73977cbaa8b1f1af05d8ba2c2a951c74e7909318830joi@chromium.org        new_message = substituter.SubstituteMessage(old_message)
74077cbaa8b1f1af05d8ba2c2a951c74e7909318830joi@chromium.org        if new_message is not old_message:
74177cbaa8b1f1af05d8ba2c2a951c74e7909318830joi@chromium.org          new_skel.append(self.uberclique.MakeClique(new_message))
74277cbaa8b1f1af05d8ba2c2a951c74e7909318830joi@chromium.org          continue
74377cbaa8b1f1af05d8ba2c2a951c74e7909318830joi@chromium.org      new_skel.append(chunk)
74477cbaa8b1f1af05d8ba2c2a951c74e7909318830joi@chromium.org    self.skeleton_ = new_skel
74501b3bc768461bd303bff39f8cd1663682254e407joi@chromium.org
746