1/* GENERATED SOURCE. DO NOT MODIFY. */ 2// © 2016 and later: Unicode, Inc. and others. 3// License & terms of use: http://www.unicode.org/copyright.html#License 4/* 5 ******************************************************************************* 6 * Copyright (C) 1996-2016, International Business Machines Corporation and * 7 * others. All Rights Reserved. * 8 ******************************************************************************* 9 */ 10package android.icu.text; 11 12import java.util.HashMap; 13import java.util.Map; 14 15/** 16 * <code>RuleBasedTransliterator</code> is a transliterator 17 * that reads a set of rules in order to determine how to perform 18 * translations. Rule sets are stored in resource bundles indexed by 19 * name. Rules within a rule set are separated by semicolons (';'). 20 * To include a literal semicolon, prefix it with a backslash ('\'). 21 * Unicode Pattern_White_Space is ignored. 22 * If the first non-blank character on a line is '#', 23 * the entire line is ignored as a comment. 24 * 25 * <p>Each set of rules consists of two groups, one forward, and one 26 * reverse. This is a convention that is not enforced; rules for one 27 * direction may be omitted, with the result that translations in 28 * that direction will not modify the source text. In addition, 29 * bidirectional forward-reverse rules may be specified for 30 * symmetrical transformations. 31 * 32 * <p><b>Rule syntax</b> 33 * 34 * <p>Rule statements take one of the following forms: 35 * 36 * <dl> 37 * <dt><code>$alefmadda=\u0622;</code></dt> 38 * <dd><strong>Variable definition.</strong> The name on the 39 * left is assigned the text on the right. In this example, 40 * after this statement, instances of the left hand name, 41 * "<code>$alefmadda</code>", will be replaced by 42 * the Unicode character U+0622. Variable names must begin 43 * with a letter and consist only of letters, digits, and 44 * underscores. Case is significant. Duplicate names cause 45 * an exception to be thrown, that is, variables cannot be 46 * redefined. The right hand side may contain well-formed 47 * text of any length, including no text at all ("<code>$empty=;</code>"). 48 * The right hand side may contain embedded <code>UnicodeSet</code> 49 * patterns, for example, "<code>$softvowel=[eiyEIY]</code>".</dd> 50 * <dd> </dd> 51 * <dt><code>ai>$alefmadda;</code></dt> 52 * <dd><strong>Forward translation rule.</strong> This rule 53 * states that the string on the left will be changed to the 54 * string on the right when performing forward 55 * transliteration.</dd> 56 * <dt> </dt> 57 * <dt><code>ai<$alefmadda;</code></dt> 58 * <dd><strong>Reverse translation rule.</strong> This rule 59 * states that the string on the right will be changed to 60 * the string on the left when performing reverse 61 * transliteration.</dd> 62 * </dl> 63 * 64 * <dl> 65 * <dt><code>ai<>$alefmadda;</code></dt> 66 * <dd><strong>Bidirectional translation rule.</strong> This 67 * rule states that the string on the right will be changed 68 * to the string on the left when performing forward 69 * transliteration, and vice versa when performing reverse 70 * transliteration.</dd> 71 * </dl> 72 * 73 * <p>Translation rules consist of a <em>match pattern</em> and an <em>output 74 * string</em>. The match pattern consists of literal characters, 75 * optionally preceded by context, and optionally followed by 76 * context. Context characters, like literal pattern characters, 77 * must be matched in the text being transliterated. However, unlike 78 * literal pattern characters, they are not replaced by the output 79 * text. For example, the pattern "<code>abc{def}</code>" 80 * indicates the characters "<code>def</code>" must be 81 * preceded by "<code>abc</code>" for a successful match. 82 * If there is a successful match, "<code>def</code>" will 83 * be replaced, but not "<code>abc</code>". The final '<code>}</code>' 84 * is optional, so "<code>abc{def</code>" is equivalent to 85 * "<code>abc{def}</code>". Another example is "<code>{123}456</code>" 86 * (or "<code>123}456</code>") in which the literal 87 * pattern "<code>123</code>" must be followed by "<code>456</code>". 88 * 89 * <p>The output string of a forward or reverse rule consists of 90 * characters to replace the literal pattern characters. If the 91 * output string contains the character '<code>|</code>', this is 92 * taken to indicate the location of the <em>cursor</em> after 93 * replacement. The cursor is the point in the text at which the 94 * next replacement, if any, will be applied. The cursor is usually 95 * placed within the replacement text; however, it can actually be 96 * placed into the precending or following context by using the 97 * special character '<code>@</code>'. Examples: 98 * 99 * <blockquote> 100 * <p><code>a {foo} z > | @ bar; # foo -> bar, move cursor 101 * before a<br> 102 * {foo} xyz > bar @@|; # foo -> bar, cursor between 103 * y and z</code> 104 * </blockquote> 105 * 106 * <p><b>UnicodeSet</b> 107 * 108 * <p><code>UnicodeSet</code> patterns may appear anywhere that 109 * makes sense. They may appear in variable definitions. 110 * Contrariwise, <code>UnicodeSet</code> patterns may themselves 111 * contain variable references, such as "<code>$a=[a-z];$not_a=[^$a]</code>", 112 * or "<code>$range=a-z;$ll=[$range]</code>". 113 * 114 * <p><code>UnicodeSet</code> patterns may also be embedded directly 115 * into rule strings. Thus, the following two rules are equivalent: 116 * 117 * <blockquote> 118 * <p><code>$vowel=[aeiou]; $vowel>'*'; # One way to do this<br> 119 * [aeiou]>'*'; 120 * # 121 * Another way</code> 122 * </blockquote> 123 * 124 * <p>See {@link UnicodeSet} for more documentation and examples. 125 * 126 * <p><b>Segments</b> 127 * 128 * <p>Segments of the input string can be matched and copied to the 129 * output string. This makes certain sets of rules simpler and more 130 * general, and makes reordering possible. For example: 131 * 132 * <blockquote> 133 * <p><code>([a-z]) > $1 $1; 134 * # 135 * double lowercase letters<br> 136 * ([:Lu:]) ([:Ll:]) > $2 $1; # reverse order of Lu-Ll pairs</code> 137 * </blockquote> 138 * 139 * <p>The segment of the input string to be copied is delimited by 140 * "<code>(</code>" and "<code>)</code>". Up to 141 * nine segments may be defined. Segments may not overlap. In the 142 * output string, "<code>$1</code>" through "<code>$9</code>" 143 * represent the input string segments, in left-to-right order of 144 * definition. 145 * 146 * <p><b>Anchors</b> 147 * 148 * <p>Patterns can be anchored to the beginning or the end of the text. This is done with the 149 * special characters '<code>^</code>' and '<code>$</code>'. For example: 150 * 151 * <blockquote> 152 * <p><code>^ a > 'BEG_A'; # match 'a' at start of text<br> 153 * a > 'A'; # match other instances 154 * of 'a'<br> 155 * z $ > 'END_Z'; # match 'z' at end of text<br> 156 * z > 'Z'; # match other instances 157 * of 'z'</code> 158 * </blockquote> 159 * 160 * <p>It is also possible to match the beginning or the end of the text using a <code>UnicodeSet</code>. 161 * This is done by including a virtual anchor character '<code>$</code>' at the end of the 162 * set pattern. Although this is usually the match chafacter for the end anchor, the set will 163 * match either the beginning or the end of the text, depending on its placement. For 164 * example: 165 * 166 * <blockquote> 167 * <p><code>$x = [a-z$]; # match 'a' through 'z' OR anchor<br> 168 * $x 1 > 2; # match '1' after a-z or at the start<br> 169 * 3 $x > 4; # match '3' before a-z or at the end</code> 170 * </blockquote> 171 * 172 * <p><b>Example</b> 173 * 174 * <p>The following example rules illustrate many of the features of 175 * the rule language. 176 * 177 * <table border="0" cellpadding="4"> 178 * <tr> 179 * <td style="vertical-align: top;">Rule 1.</td> 180 * <td style="vertical-align: top; write-space: nowrap;"><code>abc{def}>x|y</code></td> 181 * </tr> 182 * <tr> 183 * <td style="vertical-align: top;">Rule 2.</td> 184 * <td style="vertical-align: top; write-space: nowrap;"><code>xyz>r</code></td> 185 * </tr> 186 * <tr> 187 * <td style="vertical-align: top;">Rule 3.</td> 188 * <td style="vertical-align: top; write-space: nowrap;"><code>yz>q</code></td> 189 * </tr> 190 * </table> 191 * 192 * <p>Applying these rules to the string "<code>adefabcdefz</code>" 193 * yields the following results: 194 * 195 * <table border="0" cellpadding="4"> 196 * <tr> 197 * <td style="vertical-align: top; write-space: nowrap;"><code>|adefabcdefz</code></td> 198 * <td style="vertical-align: top;">Initial state, no rules match. Advance 199 * cursor.</td> 200 * </tr> 201 * <tr> 202 * <td style="vertical-align: top; write-space: nowrap;"><code>a|defabcdefz</code></td> 203 * <td style="vertical-align: top;">Still no match. Rule 1 does not match 204 * because the preceding context is not present.</td> 205 * </tr> 206 * <tr> 207 * <td style="vertical-align: top; write-space: nowrap;"><code>ad|efabcdefz</code></td> 208 * <td style="vertical-align: top;">Still no match. Keep advancing until 209 * there is a match...</td> 210 * </tr> 211 * <tr> 212 * <td style="vertical-align: top; write-space: nowrap;"><code>ade|fabcdefz</code></td> 213 * <td style="vertical-align: top;">...</td> 214 * </tr> 215 * <tr> 216 * <td style="vertical-align: top; write-space: nowrap;"><code>adef|abcdefz</code></td> 217 * <td style="vertical-align: top;">...</td> 218 * </tr> 219 * <tr> 220 * <td style="vertical-align: top; write-space: nowrap;"><code>adefa|bcdefz</code></td> 221 * <td style="vertical-align: top;">...</td> 222 * </tr> 223 * <tr> 224 * <td style="vertical-align: top; write-space: nowrap;"><code>adefab|cdefz</code></td> 225 * <td style="vertical-align: top;">...</td> 226 * </tr> 227 * <tr> 228 * <td style="vertical-align: top; write-space: nowrap;"><code>adefabc|defz</code></td> 229 * <td style="vertical-align: top;">Rule 1 matches; replace "<code>def</code>" 230 * with "<code>xy</code>" and back up the cursor 231 * to before the '<code>y</code>'.</td> 232 * </tr> 233 * <tr> 234 * <td style="vertical-align: top; write-space: nowrap;"><code>adefabcx|yz</code></td> 235 * <td style="vertical-align: top;">Although "<code>xyz</code>" is 236 * present, rule 2 does not match because the cursor is 237 * before the '<code>y</code>', not before the '<code>x</code>'. 238 * Rule 3 does match. Replace "<code>yz</code>" 239 * with "<code>q</code>".</td> 240 * </tr> 241 * <tr> 242 * <td style="vertical-align: top; write-space: nowrap;"><code>adefabcxq|</code></td> 243 * <td style="vertical-align: top;">The cursor is at the end; 244 * transliteration is complete.</td> 245 * </tr> 246 * </table> 247 * 248 * <p>The order of rules is significant. If multiple rules may match 249 * at some point, the first matching rule is applied. 250 * 251 * <p>Forward and reverse rules may have an empty output string. 252 * Otherwise, an empty left or right hand side of any statement is a 253 * syntax error. 254 * 255 * <p>Single quotes are used to quote any character other than a 256 * digit or letter. To specify a single quote itself, inside or 257 * outside of quotes, use two single quotes in a row. For example, 258 * the rule "<code>'>'>o''clock</code>" changes the 259 * string "<code>></code>" to the string "<code>o'clock</code>". 260 * 261 * <p><b>Notes</b> 262 * 263 * <p>While a RuleBasedTransliterator is being built, it checks that 264 * the rules are added in proper order. For example, if the rule 265 * "a>x" is followed by the rule "ab>y", 266 * then the second rule will throw an exception. The reason is that 267 * the second rule can never be triggered, since the first rule 268 * always matches anything it matches. In other words, the first 269 * rule <em>masks</em> the second rule. 270 * 271 * <p>Copyright (c) IBM Corporation 1999-2000. All rights reserved. 272 * 273 * @author Alan Liu 274 * @deprecated This API is ICU internal only. 275 * @hide Only a subset of ICU is exposed in Android 276 * @hide draft / provisional / internal are hidden on Android 277 */ 278@Deprecated 279public class RuleBasedTransliterator extends Transliterator { 280 281 private final Data data; 282 283// /** 284// * Constructs a new transliterator from the given rules. 285// * @param rules rules, separated by ';' 286// * @param direction either FORWARD or REVERSE. 287// * @exception IllegalArgumentException if rules are malformed 288// * or direction is invalid. 289// */ 290// public RuleBasedTransliterator(String ID, String rules, int direction, 291// UnicodeFilter filter) { 292// super(ID, filter); 293// if (direction != FORWARD && direction != REVERSE) { 294// throw new IllegalArgumentException("Invalid direction"); 295// } 296// 297// TransliteratorParser parser = new TransliteratorParser(); 298// parser.parse(rules, direction); 299// if (parser.idBlockVector.size() != 0 || 300// parser.compoundFilter != null) { 301// throw new IllegalArgumentException("::ID blocks illegal in RuleBasedTransliterator constructor"); 302// } 303// 304// data = (Data)parser.dataVector.get(0); 305// setMaximumContextLength(data.ruleSet.getMaximumContextLength()); 306// } 307 308// /** 309// * Constructs a new transliterator from the given rules in the 310// * <code>FORWARD</code> direction. 311// * @param rules rules, separated by ';' 312// * @exception IllegalArgumentException if rules are malformed 313// * or direction is invalid. 314// */ 315// public RuleBasedTransliterator(String ID, String rules) { 316// this(ID, rules, FORWARD, null); 317// } 318 319 RuleBasedTransliterator(String ID, Data data, UnicodeFilter filter) { 320 super(ID, filter); 321 this.data = data; 322 setMaximumContextLength(data.ruleSet.getMaximumContextLength()); 323 } 324 325 /** 326 * Implements {@link Transliterator#handleTransliterate}. 327 * @deprecated This API is ICU internal only. 328 * @hide draft / provisional / internal are hidden on Android 329 */ 330 @Override 331 @Deprecated 332 protected void handleTransliterate(Replaceable text, 333 Position index, boolean incremental) { 334 /* We keep start and limit fixed the entire time, 335 * relative to the text -- limit may move numerically if text is 336 * inserted or removed. The cursor moves from start to limit, with 337 * replacements happening under it. 338 * 339 * Example: rules 1. ab>x|y 340 * 2. yc>z 341 * 342 * |eabcd start - no match, advance cursor 343 * e|abcd match rule 1 - change text & adjust cursor 344 * ex|ycd match rule 2 - change text & adjust cursor 345 * exz|d no match, advance cursor 346 * exzd| done 347 */ 348 349 /* A rule like 350 * a>b|a 351 * creates an infinite loop. To prevent that, we put an arbitrary 352 * limit on the number of iterations that we take, one that is 353 * high enough that any reasonable rules are ok, but low enough to 354 * prevent a server from hanging. The limit is 16 times the 355 * number of characters n, unless n is so large that 16n exceeds a 356 * uint32_t. 357 */ 358 synchronized(data) { 359 int loopCount = 0; 360 int loopLimit = (index.limit - index.start) << 4; 361 if (loopLimit < 0) { 362 loopLimit = 0x7FFFFFFF; 363 } 364 365 while (index.start < index.limit && 366 loopCount <= loopLimit && 367 data.ruleSet.transliterate(text, index, incremental)) { 368 ++loopCount; 369 } 370 } 371 } 372 373 374 static class Data { 375 public Data() { 376 variableNames = new HashMap<String, char[]>(); 377 ruleSet = new TransliterationRuleSet(); 378 } 379 380 /** 381 * Rule table. May be empty. 382 */ 383 public TransliterationRuleSet ruleSet; 384 385 /** 386 * Map variable name (String) to variable (char[]). A variable name 387 * corresponds to zero or more characters, stored in a char[] array in 388 * this hash. One or more of these chars may also correspond to a 389 * UnicodeSet, in which case the character in the char[] in this hash is 390 * a stand-in: it is an index for a secondary lookup in 391 * data.variables. The stand-in also represents the UnicodeSet in 392 * the stored rules. 393 */ 394 Map<String, char[]> variableNames; 395 396 /** 397 * Map category variable (Character) to UnicodeMatcher or UnicodeReplacer. 398 * Variables that correspond to a set of characters are mapped 399 * from variable name to a stand-in character in data.variableNames. 400 * The stand-in then serves as a key in this hash to lookup the 401 * actual UnicodeSet object. In addition, the stand-in is 402 * stored in the rule text to represent the set of characters. 403 * variables[i] represents character (variablesBase + i). 404 */ 405 Object[] variables; 406 407 /** 408 * The character that represents variables[0]. Characters 409 * variablesBase through variablesBase + 410 * variables.length - 1 represent UnicodeSet objects. 411 */ 412 char variablesBase; 413 414 /** 415 * Return the UnicodeMatcher represented by the given character, or 416 * null if none. 417 */ 418 public UnicodeMatcher lookupMatcher(int standIn) { 419 int i = standIn - variablesBase; 420 return (i >= 0 && i < variables.length) 421 ? (UnicodeMatcher) variables[i] : null; 422 } 423 424 /** 425 * Return the UnicodeReplacer represented by the given character, or 426 * null if none. 427 */ 428 public UnicodeReplacer lookupReplacer(int standIn) { 429 int i = standIn - variablesBase; 430 return (i >= 0 && i < variables.length) 431 ? (UnicodeReplacer) variables[i] : null; 432 } 433 } 434 435 436 /** 437 * Return a representation of this transliterator as source rules. 438 * These rules will produce an equivalent transliterator if used 439 * to construct a new transliterator. 440 * @param escapeUnprintable if TRUE then convert unprintable 441 * character to their hex escape representations, \\uxxxx or 442 * \\Uxxxxxxxx. Unprintable characters are those other than 443 * U+000A, U+0020..U+007E. 444 * @return rules string 445 * @deprecated This API is ICU internal only. 446 * @hide draft / provisional / internal are hidden on Android 447 */ 448 @Override 449 @Deprecated 450 public String toRules(boolean escapeUnprintable) { 451 return data.ruleSet.toRules(escapeUnprintable); 452 } 453 454// /** 455// * Return the set of all characters that may be modified by this 456// * Transliterator, ignoring the effect of our filter. 457// */ 458// protected UnicodeSet handleGetSourceSet() { 459// return data.ruleSet.getSourceTargetSet(false, unicodeFilter); 460// } 461// 462// /** 463// * Returns the set of all characters that may be generated as 464// * replacement text by this transliterator. 465// */ 466// public UnicodeSet getTargetSet() { 467// return data.ruleSet.getSourceTargetSet(true, unicodeFilter); 468// } 469 470 /** 471 * @deprecated This API is ICU internal only. 472 * @hide draft / provisional / internal are hidden on Android 473 */ 474 @Deprecated 475 @Override 476 public void addSourceTargetSet(UnicodeSet filter, UnicodeSet sourceSet, UnicodeSet targetSet) { 477 data.ruleSet.addSourceTargetSet(filter, sourceSet, targetSet); 478 } 479 480 /** 481 * Temporary hack for registry problem. Needs to be replaced by better architecture. 482 * @deprecated This API is ICU internal only. 483 * @hide draft / provisional / internal are hidden on Android 484 */ 485 @Deprecated 486 public Transliterator safeClone() { 487 UnicodeFilter filter = getFilter(); 488 if (filter != null && filter instanceof UnicodeSet) { 489 filter = new UnicodeSet((UnicodeSet)filter); 490 } 491 return new RuleBasedTransliterator(getID(), data, filter); 492 } 493} 494 495 496