16f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.orgCopyright (c) 2002-2010, International Business Machines Corporation and others. All Rights Reserved. 26f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org 36f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org 46f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.orgIMPORTANT: 56f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org 66f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.orgThis sample was originally intended as an exercise for the ICU Workshop (September 2000). 76f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.orgThe code currently provided in the solution file is the answer to the exercises, each step can still be found in the 'answers' subdirectory. 86f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org 96f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org 106f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org 116f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org http://www.icu-project.org/docs/workshop_2000/agenda.html 126f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org 136f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org Day 2: September 12th 2000 146f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org Pre-requisite: 156f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org 1. All the hardware and software requirements from Day 1. 166f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org 2. Attended or fully understand Day 1 material. 176f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org 3. Read through the ICU user's guide at 186f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org http://www.icu-project.org/userguide/. 196f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org 206f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org #Transformation Support 216f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org 10:45am - 12:00pm 226f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org Alan Liu 236f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org 246f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org Topics: 256f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org 1. What is the Unicode normalization? 266f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org 2. What kind of case mapping support is available in ICU? 276f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org 3. What is Transliteration and how do I use a Transliterator on a document? 286f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org 4. How do I add my own Transliterator? 296f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org 306f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org 316f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.orgINSTRUCTIONS 326f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org------------ 336f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org 346f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.orgThis exercise was developed and tested on ICU release 1.6.0, Win32, 356f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.orgMicrosoft Visual C++ 6.0. It should work on other ICU releases and 366f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.orgother platforms as well. 376f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org 386f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org MSVC: 396f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org Open the file "translit.sln" in Microsoft Visual C++. 406f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org 416f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org Unix: 426f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org - Build and install ICU with a prefix, for example '--prefix=/home/srl/ICU' 436f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org - Set the variable ICU_PREFIX=/home/srl/ICU and use GNU make in 446f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org this directory. 456f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org - You may use 'make check' to invoke this sample. 466f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org 476f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org 486f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.orgPROBLEMS 496f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org-------- 506f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org 516f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.orgProblem 0: 526f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org 536f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org To start with, the program prints out a series of dates formatted in 546f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org Greek. Set up the program, build it, and run it. 556f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org 566f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.orgProblem 1: Basic Transliterator (Easy) 576f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org 586f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org The Greek text shows up almost entirely as Unicode escapes. These 596f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org are unreadable on a US machine. Use an existing system 606f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org transliterator to transliterate the Greek text to Latin so it can be 616f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org phonetically read on a US machine. If you don't know the names of 626f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org the system transliterators, use Transliterator::getAvailableID() and 636f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org Transliterator::countAvailableIDs(), or look directly in the index 646f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org table icu/data/translit_index.txt. 656f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org 666f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.orgProblem 2: RuleBasedTransliterator (Medium) 676f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org 686f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org Some of the text is still unreadable and shows up as Unicode escape 696f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org sequences. Create a RuleBasedTransliterator to change the 706f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org unreadable characters to close ASCII equivalents. For example, the 716f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org rule "\u00C0 > A;" will change an 'A' with a grave accent to a plain 726f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org 'A'. 736f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org 746f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org To save typing, use UnicodeSets to handle ranges of characters. 756f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org 766f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org See the included file "U0080.pdf" for a table of the U+00C0 to U+00FF 776f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org Unicode block. 786f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org 796f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.orgProblem 3: Transliterator subclassing; Normalizer (Difficult) 806f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org 816f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org The rule-based approach is flexible and, in most cases, the best 826f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org choice for creating a new transliterator. Sometimes, however, a 836f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org more elegant algorithmic solution is available. Instead of typing 846f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org in a list of rules, you can write C++ code to accomplish the desired 856f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org transliteration. 866f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org 876f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org Use a Normalizer to remove accents from characters. You will need 886f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org to convert each character to a sequence of base and combining 896f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org characters by applying a canonical denormalization transformation. 906f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org Then discard the combining characters (the accents etc.) leaving the 916f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org base character. Wrap this all up in a subclass of the 926f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org Transliterator class that overrides the pure virtual 936f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org handleTransliterate() method. 946f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org 956f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org 966f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.orgANSWERS 976f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org------- 986f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org 996f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.orgThe exercise includes answers. These are in the "answers" directory, 1006f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.organd are numbered 1, 2, etc. In some cases new files that the user 1016f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.orgneeds to create are included in the answers directory. 1026f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org 1036f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.orgIf you get stuck and you want to move to the next step, copy the 1046f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.organswers file into the main directory in order to proceed. E.g., 1056f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org"main_1.cpp" contains the original "main.cpp" file. "main_2.cpp" 1066f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.orgcontains the "main.cpp" file after problem 1. Etc. 1076f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org 1086f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.org 1096f31ac30b9092fd02a8c97e5216cf53f3e4fae4jshin@chromium.orgHave fun! 110