RuleBasedTransliterator.java revision 7935b1839a081ed19ae0d33029ad3c09632a2caa
17935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert/* 27935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert ******************************************************************************* 37935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * Copyright (C) 1996-2014, International Business Machines Corporation and * 47935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * others. All Rights Reserved. * 57935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert ******************************************************************************* 67935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert */ 77935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubertpackage com.ibm.icu.text; 87935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert 97935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubertimport java.util.HashMap; 107935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubertimport java.util.Map; 117935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert 127935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert/** 137935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <code>RuleBasedTransliterator</code> is a transliterator 147935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * that reads a set of rules in order to determine how to perform 157935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * translations. Rule sets are stored in resource bundles indexed by 167935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * name. Rules within a rule set are separated by semicolons (';'). 177935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * To include a literal semicolon, prefix it with a backslash ('\'). 187935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * Unicode Pattern_White_Space is ignored. 197935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * If the first non-blank character on a line is '#', 207935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * the entire line is ignored as a comment. </p> 217935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * 227935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <p>Each set of rules consists of two groups, one forward, and one 237935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * reverse. This is a convention that is not enforced; rules for one 247935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * direction may be omitted, with the result that translations in 257935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * that direction will not modify the source text. In addition, 267935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * bidirectional forward-reverse rules may be specified for 277935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * symmetrical transformations.</p> 287935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * 297935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <p><b>Rule syntax</b> </p> 307935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * 317935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <p>Rule statements take one of the following forms: </p> 327935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * 337935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <dl> 347935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <dt><code>$alefmadda=\u0622;</code></dt> 357935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <dd><strong>Variable definition.</strong> The name on the 367935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * left is assigned the text on the right. In this example, 377935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * after this statement, instances of the left hand name, 387935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * "<code>$alefmadda</code>", will be replaced by 397935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * the Unicode character U+0622. Variable names must begin 407935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * with a letter and consist only of letters, digits, and 417935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * underscores. Case is significant. Duplicate names cause 427935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * an exception to be thrown, that is, variables cannot be 437935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * redefined. The right hand side may contain well-formed 447935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * text of any length, including no text at all ("<code>$empty=;</code>"). 457935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * The right hand side may contain embedded <code>UnicodeSet</code> 467935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * patterns, for example, "<code>$softvowel=[eiyEIY]</code>".</dd> 477935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <dd> </dd> 487935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <dt><code>ai>$alefmadda;</code></dt> 497935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <dd><strong>Forward translation rule.</strong> This rule 507935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * states that the string on the left will be changed to the 517935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * string on the right when performing forward 527935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * transliteration.</dd> 537935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <dt> </dt> 547935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <dt><code>ai<$alefmadda;</code></dt> 557935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <dd><strong>Reverse translation rule.</strong> This rule 567935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * states that the string on the right will be changed to 577935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * the string on the left when performing reverse 587935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * transliteration.</dd> 597935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * </dl> 607935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * 617935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <dl> 627935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <dt><code>ai<>$alefmadda;</code></dt> 637935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <dd><strong>Bidirectional translation rule.</strong> This 647935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * rule states that the string on the right will be changed 657935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * to the string on the left when performing forward 667935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * transliteration, and vice versa when performing reverse 677935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * transliteration.</dd> 687935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * </dl> 697935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * 707935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <p>Translation rules consist of a <em>match pattern</em> and an <em>output 717935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * string</em>. The match pattern consists of literal characters, 727935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * optionally preceded by context, and optionally followed by 737935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * context. Context characters, like literal pattern characters, 747935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * must be matched in the text being transliterated. However, unlike 757935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * literal pattern characters, they are not replaced by the output 767935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * text. For example, the pattern "<code>abc{def}</code>" 777935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * indicates the characters "<code>def</code>" must be 787935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * preceded by "<code>abc</code>" for a successful match. 797935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * If there is a successful match, "<code>def</code>" will 807935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * be replaced, but not "<code>abc</code>". The final '<code>}</code>' 817935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * is optional, so "<code>abc{def</code>" is equivalent to 827935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * "<code>abc{def}</code>". Another example is "<code>{123}456</code>" 837935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * (or "<code>123}456</code>") in which the literal 847935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * pattern "<code>123</code>" must be followed by "<code>456</code>". 857935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * </p> 867935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * 877935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <p>The output string of a forward or reverse rule consists of 887935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * characters to replace the literal pattern characters. If the 897935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * output string contains the character '<code>|</code>', this is 907935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * taken to indicate the location of the <em>cursor</em> after 917935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * replacement. The cursor is the point in the text at which the 927935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * next replacement, if any, will be applied. The cursor is usually 937935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * placed within the replacement text; however, it can actually be 947935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * placed into the precending or following context by using the 957935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * special character '<code>@</code>'. Examples:</p> 967935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * 977935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <blockquote> 987935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <p><code>a {foo} z > | @ bar; # foo -> bar, move cursor 997935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * before a<br> 1007935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * {foo} xyz > bar @@|; # foo -> bar, cursor between 1017935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * y and z</code></p> 1027935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * </blockquote> 1037935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * 1047935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <p><b>UnicodeSet</b></p> 1057935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * 1067935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <p><code>UnicodeSet</code> patterns may appear anywhere that 1077935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * makes sense. They may appear in variable definitions. 1087935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * Contrariwise, <code>UnicodeSet</code> patterns may themselves 1097935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * contain variable references, such as "<code>$a=[a-z];$not_a=[^$a]</code>", 1107935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * or "<code>$range=a-z;$ll=[$range]</code>".</p> 1117935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * 1127935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <p><code>UnicodeSet</code> patterns may also be embedded directly 1137935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * into rule strings. Thus, the following two rules are equivalent:</p> 1147935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * 1157935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <blockquote> 1167935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <p><code>$vowel=[aeiou]; $vowel>'*'; # One way to do this<br> 1177935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * [aeiou]>'*'; 1187935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * # 1197935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * Another way</code></p> 1207935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * </blockquote> 1217935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * 1227935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <p>See {@link UnicodeSet} for more documentation and examples.</p> 1237935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * 1247935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <p><b>Segments</b></p> 1257935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * 1267935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <p>Segments of the input string can be matched and copied to the 1277935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * output string. This makes certain sets of rules simpler and more 1287935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * general, and makes reordering possible. For example:</p> 1297935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * 1307935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <blockquote> 1317935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <p><code>([a-z]) > $1 $1; 1327935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * # 1337935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * double lowercase letters<br> 1347935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * ([:Lu:]) ([:Ll:]) > $2 $1; # reverse order of Lu-Ll pairs</code></p> 1357935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * </blockquote> 1367935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * 1377935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <p>The segment of the input string to be copied is delimited by 1387935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * "<code>(</code>" and "<code>)</code>". Up to 1397935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * nine segments may be defined. Segments may not overlap. In the 1407935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * output string, "<code>$1</code>" through "<code>$9</code>" 1417935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * represent the input string segments, in left-to-right order of 1427935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * definition.</p> 1437935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * 1447935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <p><b>Anchors</b></p> 1457935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * 1467935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <p>Patterns can be anchored to the beginning or the end of the text. This is done with the 1477935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * special characters '<code>^</code>' and '<code>$</code>'. For example:</p> 1487935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * 1497935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <blockquote> 1507935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <p><code>^ a > 'BEG_A'; # match 'a' at start of text<br> 1517935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * a > 'A'; # match other instances 1527935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * of 'a'<br> 1537935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * z $ > 'END_Z'; # match 'z' at end of text<br> 1547935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * z > 'Z'; # match other instances 1557935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * of 'z'</code></p> 1567935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * </blockquote> 1577935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * 1587935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <p>It is also possible to match the beginning or the end of the text using a <code>UnicodeSet</code>. 1597935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * This is done by including a virtual anchor character '<code>$</code>' at the end of the 1607935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * set pattern. Although this is usually the match chafacter for the end anchor, the set will 1617935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * match either the beginning or the end of the text, depending on its placement. For 1627935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * example:</p> 1637935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * 1647935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <blockquote> 1657935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <p><code>$x = [a-z$]; # match 'a' through 'z' OR anchor<br> 1667935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * $x 1 > 2; # match '1' after a-z or at the start<br> 1677935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * 3 $x > 4; # match '3' before a-z or at the end</code></p> 1687935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * </blockquote> 1697935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * 1707935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <p><b>Example</b> </p> 1717935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * 1727935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <p>The following example rules illustrate many of the features of 1737935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * the rule language. </p> 1747935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * 1757935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <table border="0" cellpadding="4"> 1767935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <tr> 1777935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <td valign="top">Rule 1.</td> 1787935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <td valign="top" nowrap><code>abc{def}>x|y</code></td> 1797935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * </tr> 1807935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <tr> 1817935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <td valign="top">Rule 2.</td> 1827935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <td valign="top" nowrap><code>xyz>r</code></td> 1837935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * </tr> 1847935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <tr> 1857935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <td valign="top">Rule 3.</td> 1867935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <td valign="top" nowrap><code>yz>q</code></td> 1877935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * </tr> 1887935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * </table> 1897935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * 1907935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <p>Applying these rules to the string "<code>adefabcdefz</code>" 1917935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * yields the following results: </p> 1927935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * 1937935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <table border="0" cellpadding="4"> 1947935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <tr> 1957935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <td valign="top" nowrap><code>|adefabcdefz</code></td> 1967935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <td valign="top">Initial state, no rules match. Advance 1977935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * cursor.</td> 1987935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * </tr> 1997935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <tr> 2007935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <td valign="top" nowrap><code>a|defabcdefz</code></td> 2017935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <td valign="top">Still no match. Rule 1 does not match 2027935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * because the preceding context is not present.</td> 2037935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * </tr> 2047935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <tr> 2057935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <td valign="top" nowrap><code>ad|efabcdefz</code></td> 2067935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <td valign="top">Still no match. Keep advancing until 2077935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * there is a match...</td> 2087935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * </tr> 2097935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <tr> 2107935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <td valign="top" nowrap><code>ade|fabcdefz</code></td> 2117935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <td valign="top">...</td> 2127935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * </tr> 2137935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <tr> 2147935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <td valign="top" nowrap><code>adef|abcdefz</code></td> 2157935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <td valign="top">...</td> 2167935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * </tr> 2177935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <tr> 2187935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <td valign="top" nowrap><code>adefa|bcdefz</code></td> 2197935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <td valign="top">...</td> 2207935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * </tr> 2217935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <tr> 2227935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <td valign="top" nowrap><code>adefab|cdefz</code></td> 2237935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <td valign="top">...</td> 2247935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * </tr> 2257935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <tr> 2267935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <td valign="top" nowrap><code>adefabc|defz</code></td> 2277935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <td valign="top">Rule 1 matches; replace "<code>def</code>" 2287935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * with "<code>xy</code>" and back up the cursor 2297935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * to before the '<code>y</code>'.</td> 2307935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * </tr> 2317935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <tr> 2327935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <td valign="top" nowrap><code>adefabcx|yz</code></td> 2337935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <td valign="top">Although "<code>xyz</code>" is 2347935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * present, rule 2 does not match because the cursor is 2357935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * before the '<code>y</code>', not before the '<code>x</code>'. 2367935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * Rule 3 does match. Replace "<code>yz</code>" 2377935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * with "<code>q</code>".</td> 2387935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * </tr> 2397935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <tr> 2407935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <td valign="top" nowrap><code>adefabcxq|</code></td> 2417935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <td valign="top">The cursor is at the end; 2427935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * transliteration is complete.</td> 2437935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * </tr> 2447935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * </table> 2457935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * 2467935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <p>The order of rules is significant. If multiple rules may match 2477935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * at some point, the first matching rule is applied. </p> 2487935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * 2497935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <p>Forward and reverse rules may have an empty output string. 2507935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * Otherwise, an empty left or right hand side of any statement is a 2517935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * syntax error. </p> 2527935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * 2537935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <p>Single quotes are used to quote any character other than a 2547935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * digit or letter. To specify a single quote itself, inside or 2557935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * outside of quotes, use two single quotes in a row. For example, 2567935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * the rule "<code>'>'>o''clock</code>" changes the 2577935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * string "<code>></code>" to the string "<code>o'clock</code>". 2587935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * </p> 2597935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * 2607935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <p><b>Notes</b> </p> 2617935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * 2627935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <p>While a RuleBasedTransliterator is being built, it checks that 2637935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * the rules are added in proper order. For example, if the rule 2647935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * "a>x" is followed by the rule "ab>y", 2657935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * then the second rule will throw an exception. The reason is that 2667935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * the second rule can never be triggered, since the first rule 2677935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * always matches anything it matches. In other words, the first 2687935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * rule <em>masks</em> the second rule. </p> 2697935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * 2707935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * <p>Copyright (c) IBM Corporation 1999-2000. All rights reserved.</p> 2717935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * 2727935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * @author Alan Liu 2737935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * @internal 2747935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * @deprecated This API is ICU internal only. 2757935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert */ 2767935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert@Deprecated 2777935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubertpublic class RuleBasedTransliterator extends Transliterator { 2787935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert 2797935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert private Data data; 2807935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert 2817935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert// /** 2827935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert// * Constructs a new transliterator from the given rules. 2837935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert// * @param rules rules, separated by ';' 2847935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert// * @param direction either FORWARD or REVERSE. 2857935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert// * @exception IllegalArgumentException if rules are malformed 2867935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert// * or direction is invalid. 2877935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert// */ 2887935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert// public RuleBasedTransliterator(String ID, String rules, int direction, 2897935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert// UnicodeFilter filter) { 2907935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert// super(ID, filter); 2917935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert// if (direction != FORWARD && direction != REVERSE) { 2927935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert// throw new IllegalArgumentException("Invalid direction"); 2937935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert// } 2947935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert// 2957935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert// TransliteratorParser parser = new TransliteratorParser(); 2967935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert// parser.parse(rules, direction); 2977935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert// if (parser.idBlockVector.size() != 0 || 2987935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert// parser.compoundFilter != null) { 2997935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert// throw new IllegalArgumentException("::ID blocks illegal in RuleBasedTransliterator constructor"); 3007935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert// } 3017935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert// 3027935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert// data = (Data)parser.dataVector.get(0); 3037935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert// setMaximumContextLength(data.ruleSet.getMaximumContextLength()); 3047935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert// } 3057935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert 3067935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert// /** 3077935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert// * Constructs a new transliterator from the given rules in the 3087935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert// * <code>FORWARD</code> direction. 3097935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert// * @param rules rules, separated by ';' 3107935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert// * @exception IllegalArgumentException if rules are malformed 3117935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert// * or direction is invalid. 3127935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert// */ 3137935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert// public RuleBasedTransliterator(String ID, String rules) { 3147935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert// this(ID, rules, FORWARD, null); 3157935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert// } 3167935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert 3177935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert RuleBasedTransliterator(String ID, Data data, UnicodeFilter filter) { 3187935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert super(ID, filter); 3197935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert this.data = data; 3207935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert setMaximumContextLength(data.ruleSet.getMaximumContextLength()); 3217935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert } 3227935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert 3237935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert /** 3247935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * Implements {@link Transliterator#handleTransliterate}. 3257935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * @internal 3267935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * @deprecated This API is ICU internal only. 3277935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert */ 3287935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert @Deprecated 3297935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert protected void handleTransliterate(Replaceable text, 3307935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert Position index, boolean incremental) { 3317935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert /* We keep start and limit fixed the entire time, 3327935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * relative to the text -- limit may move numerically if text is 3337935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * inserted or removed. The cursor moves from start to limit, with 3347935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * replacements happening under it. 3357935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * 3367935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * Example: rules 1. ab>x|y 3377935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * 2. yc>z 3387935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * 3397935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * |eabcd start - no match, advance cursor 3407935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * e|abcd match rule 1 - change text & adjust cursor 3417935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * ex|ycd match rule 2 - change text & adjust cursor 3427935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * exz|d no match, advance cursor 3437935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * exzd| done 3447935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert */ 3457935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert 3467935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert /* A rule like 3477935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * a>b|a 3487935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * creates an infinite loop. To prevent that, we put an arbitrary 3497935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * limit on the number of iterations that we take, one that is 3507935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * high enough that any reasonable rules are ok, but low enough to 3517935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * prevent a server from hanging. The limit is 16 times the 3527935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * number of characters n, unless n is so large that 16n exceeds a 3537935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * uint32_t. 3547935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert */ 3557935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert synchronized(data) { 3567935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert int loopCount = 0; 3577935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert int loopLimit = (index.limit - index.start) << 4; 3587935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert if (loopLimit < 0) { 3597935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert loopLimit = 0x7FFFFFFF; 3607935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert } 3617935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert 3627935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert while (index.start < index.limit && 3637935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert loopCount <= loopLimit && 3647935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert data.ruleSet.transliterate(text, index, incremental)) { 3657935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert ++loopCount; 3667935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert } 3677935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert } 3687935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert } 3697935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert 3707935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert 3717935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert static class Data { 3727935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert public Data() { 3737935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert variableNames = new HashMap<String, char[]>(); 3747935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert ruleSet = new TransliterationRuleSet(); 3757935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert } 3767935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert 3777935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert /** 3787935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * Rule table. May be empty. 3797935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert */ 3807935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert public TransliterationRuleSet ruleSet; 3817935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert 3827935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert /** 3837935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * Map variable name (String) to variable (char[]). A variable name 3847935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * corresponds to zero or more characters, stored in a char[] array in 3857935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * this hash. One or more of these chars may also correspond to a 3867935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * UnicodeSet, in which case the character in the char[] in this hash is 3877935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * a stand-in: it is an index for a secondary lookup in 3887935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * data.variables. The stand-in also represents the UnicodeSet in 3897935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * the stored rules. 3907935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert */ 3917935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert Map<String, char[]> variableNames; 3927935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert 3937935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert /** 3947935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * Map category variable (Character) to UnicodeMatcher or UnicodeReplacer. 3957935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * Variables that correspond to a set of characters are mapped 3967935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * from variable name to a stand-in character in data.variableNames. 3977935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * The stand-in then serves as a key in this hash to lookup the 3987935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * actual UnicodeSet object. In addition, the stand-in is 3997935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * stored in the rule text to represent the set of characters. 4007935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * variables[i] represents character (variablesBase + i). 4017935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert */ 4027935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert Object[] variables; 4037935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert 4047935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert /** 4057935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * The character that represents variables[0]. Characters 4067935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * variablesBase through variablesBase + 4077935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * variables.length - 1 represent UnicodeSet objects. 4087935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert */ 4097935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert char variablesBase; 4107935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert 4117935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert /** 4127935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * Return the UnicodeMatcher represented by the given character, or 4137935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * null if none. 4147935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert */ 4157935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert public UnicodeMatcher lookupMatcher(int standIn) { 4167935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert int i = standIn - variablesBase; 4177935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert return (i >= 0 && i < variables.length) 4187935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert ? (UnicodeMatcher) variables[i] : null; 4197935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert } 4207935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert 4217935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert /** 4227935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * Return the UnicodeReplacer represented by the given character, or 4237935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * null if none. 4247935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert */ 4257935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert public UnicodeReplacer lookupReplacer(int standIn) { 4267935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert int i = standIn - variablesBase; 4277935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert return (i >= 0 && i < variables.length) 4287935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert ? (UnicodeReplacer) variables[i] : null; 4297935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert } 4307935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert } 4317935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert 4327935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert 4337935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert /** 4347935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * Return a representation of this transliterator as source rules. 4357935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * These rules will produce an equivalent transliterator if used 4367935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * to construct a new transliterator. 4377935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * @param escapeUnprintable if TRUE then convert unprintable 4387935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * character to their hex escape representations, \\uxxxx or 4397935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * \\Uxxxxxxxx. Unprintable characters are those other than 4407935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * U+000A, U+0020..U+007E. 4417935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * @return rules string 4427935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * @internal 4437935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * @deprecated This API is ICU internal only. 4447935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert */ 4457935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert @Deprecated 4467935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert public String toRules(boolean escapeUnprintable) { 4477935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert return data.ruleSet.toRules(escapeUnprintable); 4487935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert } 4497935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert 4507935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert// /** 4517935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert// * Return the set of all characters that may be modified by this 4527935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert// * Transliterator, ignoring the effect of our filter. 4537935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert// */ 4547935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert// protected UnicodeSet handleGetSourceSet() { 4557935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert// return data.ruleSet.getSourceTargetSet(false, unicodeFilter); 4567935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert// } 4577935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert// 4587935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert// /** 4597935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert// * Returns the set of all characters that may be generated as 4607935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert// * replacement text by this transliterator. 4617935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert// */ 4627935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert// public UnicodeSet getTargetSet() { 4637935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert// return data.ruleSet.getSourceTargetSet(true, unicodeFilter); 4647935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert// } 4657935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert 4667935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert /** 4677935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * @internal 4687935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * @deprecated This API is ICU internal only. 4697935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert */ 4707935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert @Deprecated 4717935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert @Override 4727935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert public void addSourceTargetSet(UnicodeSet filter, UnicodeSet sourceSet, UnicodeSet targetSet) { 4737935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert data.ruleSet.addSourceTargetSet(filter, sourceSet, targetSet); 4747935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert } 4757935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert 4767935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert /** 4777935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * Temporary hack for registry problem. Needs to be replaced by better architecture. 4787935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * @internal 4797935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert * @deprecated This API is ICU internal only. 4807935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert */ 4817935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert @Deprecated 4827935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert public Transliterator safeClone() { 4837935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert UnicodeFilter filter = getFilter(); 4847935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert if (filter != null && filter instanceof UnicodeSet) { 4857935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert filter = new UnicodeSet((UnicodeSet)filter); 4867935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert } 4877935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert return new RuleBasedTransliterator(getID(), data, filter); 4887935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert } 4897935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert} 4907935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert 4917935b1839a081ed19ae0d33029ad3c09632a2caaFredrik Roubert 492