icu4jni/text/RuleBasedCollator.java

/**
*******************************************************************************
* Copyright (C) 1996-2005, International Business Machines Corporation and    *
* others. All Rights Reserved.                                                *
*******************************************************************************
*
*
*******************************************************************************
*/

package com.ibm.icu4jni.text;

import java.util.Locale;
import java.text.CharacterIterator;
import java.text.ParseException;
import com.ibm.icu4jni.common.ErrorCode;

/**
* Concrete implementation class for Collation.
* <p>
* The collation table is composed of a list of collation rules, where each
* rule is of three forms:
* <pre>
*    < modifier >
*    < relation > < text-argument >
*    < reset > < text-argument >
* </pre>
* <p>
* <code>RuleBasedCollator</code> has the following restrictions for efficiency
* (other subclasses may be used for more complex languages) :
* <ol>
* <li> If a French secondary ordering is specified it applies to the whole
*      collator object.
* <li> All non-mentioned Unicode characters are at the end of the collation
*      order.
* <li> If a character is not located in the RuleBasedCollator, the default
*      Unicode Collation Algorithm (UCA) rulebased table is automatically
*      searched as a backup.
* </ol>
*
* The following demonstrates how to create your own collation rules:
* <UL Type=disc>
*    <LI><strong>Text-Argument</strong>: A text-argument is any sequence of
*        characters, excluding special characters (that is, common whitespace
*        characters [0009-000D, 0020] and rule syntax characters [0021-002F,
*        003A-0040, 005B-0060, 007B-007E]). If those characters are desired,
*        you can put them in single quotes (e.g. ampersand => '&'). Note that
*        unquoted white space characters are ignored; e.g. <code>b c</code> is
*        treated as <code>bc</code>.
*    <LI><strong>Modifier</strong>: There is a single modifier which is used
*        to specify that all accents (secondary differences) are backwards.
*        <p>'@' : Indicates that accents are sorted backwards, as in French.
*    <LI><strong>Relation</strong>: The relations are the following:
*        <UL Type=square>
*            <LI>'<' : Greater, as a letter difference (primary)
*            <LI>';' : Greater, as an accent difference (secondary)
*            <LI>',' : Greater, as a case difference (tertiary)
*            <LI>'=' : Equal
*        </UL>
*    <LI><strong>Reset</strong>: There is a single reset which is used
*        primarily for contractions and expansions, but which can also be used
*        to add a modification at the end of a set of rules.
*        <p>'&' : Indicates that the next rule follows the position to where
*            the reset text-argument would be sorted.
* </UL>
*
* <p>
* This sounds more complicated than it is in practice. For example, the
* following are equivalent ways of expressing the same thing:
* <blockquote>
* <pre>
* a < b < c
* a < b & b < c
* a < c & a < b
* </pre>
* </blockquote>
* Notice that the order is important, as the subsequent item goes immediately
* after the text-argument. The following are not equivalent:
* <blockquote>
* <pre>
* a < b & a < c
* a < c & a < b
* </pre>
* </blockquote>
* Either the text-argument must already be present in the sequence, or some
* initial substring of the text-argument must be present. (e.g. "a < b & ae <
* e" is valid since "a" is present in the sequence before "ae" is reset). In
* this latter case, "ae" is not entered and treated as a single character;
* instead, "e" is sorted as if it were expanded to two characters: "a"
* followed by an "e". This difference appears in natural languages: in
* traditional Spanish "ch" is treated as though it contracts to a single
* character (expressed as "c < ch < d"), while in traditional German a-umlaut
* is treated as though it expanded to two characters (expressed as "a,A < b,B
* ... & ae;? & AE;?"). [? and ? are, of course, the escape sequences for
* a-umlaut.]
* <p>
* <strong>Ignorable Characters</strong>
* <p>
* For ignorable characters, the first rule must start with a relation (the
* examples we have used above are really fragments; "a < b" really should be
* "< a < b"). If, however, the first relation is not "<", then all the all
* text-arguments up to the first "<" are ignorable. For example, ", - < a < b"
* makes "-" an ignorable character, as we saw earlier in the word
* "black-birds". In the samples for different languages, you see that most
* accents are ignorable.
*
* <p><strong>Normalization and Accents</strong>
* <p>
* <code>RuleBasedCollator</code> automatically processes its rule table to
* include both pre-composed and combining-character versions of accented
* characters. Even if the provided rule string contains only base characters
* and separate combining accent characters, the pre-composed accented
* characters matching all canonical combinations of characters from the rule
* string will be entered in the table.
* <p>
* This allows you to use a RuleBasedCollator to compare accented strings even
* when the collator is set to NO_DECOMPOSITION. However, if the strings to be
* collated contain combining sequences that may not be in canonical order, you
* should set the collator to CANONICAL_DECOMPOSITION to enable sorting of
* combining sequences.
* For more information, see
* <A HREF="http://www.aw.com/devpress">The Unicode Standard, Version 3.0</A>.)
*
* <p><strong>Errors</strong>
* <p>
* The following are errors:
* <UL Type=disc>
*     <LI>A text-argument contains unquoted punctuation symbols
*        (e.g. "a < b-c < d").
*     <LI>A relation or reset character not followed by a text-argument
*        (e.g. "a < , b").
*     <LI>A reset where the text-argument (or an initial substring of the
*         text-argument) is not already in the sequence or allocated in the
*         default UCA table.
*         (e.g. "a < b & e < f")
* </UL>
* If you produce one of these errors, a <code>RuleBasedCollator</code> throws
* a <code>ParseException</code>.
*
* <p><strong>Examples</strong>
* <p>Simple:     "< a < b < c < d"
* <p>Norwegian:  "< a,A< b,B< c,C< d,D< e,E< f,F< g,G< h,H< i,I< j,J
*                 < k,K< l,L< m,M< n,N< o,O< p,P< q,Q< r,R< s,S< t,T
*                < u,U< v,V< w,W< x,X< y,Y< z,Z
*                 < ?=a?,?=A?
*                 ;aa,AA< ?,?< ?,?"
*
* <p>
* Normally, to create a rule-based Collator object, you will use
* <code>Collator</code>'s factory method <code>getInstance</code>.
* However, to create a rule-based Collator object with specialized rules
* tailored to your needs, you construct the <code>RuleBasedCollator</code>
* with the rules contained in a <code>String</code> object. For example:
* <blockquote>
* <pre>
* String Simple = "< a < b < c < d";
* RuleBasedCollator mySimple = new RuleBasedCollator(Simple);
* </pre>
* </blockquote>
* Or:
* <blockquote>
* <pre>
* String Norwegian = "< a,A< b,B< c,C< d,D< e,E< f,F< g,G< h,H< i,I< j,J" +
*                 "< k,K< l,L< m,M< n,N< o,O< p,P< q,Q< r,R< s,S< t,T" +
*                 "< u,U< v,V< w,W< x,X< y,Y< z,Z" +
*                 "< ?=a?,?=A?" +
*                 ";aa,AA< ?,?< ?,?";
* RuleBasedCollator myNorwegian = new RuleBasedCollator(Norwegian);
* </pre>
* </blockquote>
*
* <p>
* Combining <code>Collator</code>s is as simple as concatenating strings.
* Here's an example that combines two <code>Collator</code>s from two
* different locales:
* <blockquote>
* <pre>
* // Create an en_US Collator object
* RuleBasedCollator en_USCollator = (RuleBasedCollator)
*     Collator.getInstance(new Locale("en", "US", ""));
* // Create a da_DK Collator object
* RuleBasedCollator da_DKCollator = (RuleBasedCollator)
*     Collator.getInstance(new Locale("da", "DK", ""));
* // Combine the two
* // First, get the collation rules from en_USCollator
* String en_USRules = en_USCollator.getRules();
* // Second, get the collation rules from da_DKCollator
* String da_DKRules = da_DKCollator.getRules();
* RuleBasedCollator newCollator =
*     new RuleBasedCollator(en_USRules + da_DKRules);
* // newCollator has the combined rules
* </pre>
* </blockquote>
*
* <p>
* Another more interesting example would be to make changes on an existing
* table to create a new <code>Collator</code> object.  For example, add
* "& C < ch, cH, Ch, CH" to the <code>en_USCollator</code> object to create
* your own:
* <blockquote>
* <pre>
* // Create a new Collator object with additional rules
* String addRules = "& C < ch, cH, Ch, CH";
* RuleBasedCollator myCollator =
*     new RuleBasedCollator(en_USCollator + addRules);
* // myCollator contains the new rules
* </pre>
* </blockquote>
*
* <p>
* The following example demonstrates how to change the order of
* non-spacing accents,
* <blockquote>
* <pre>
* // old rule
* String oldRules = "=?;?;?"    // main accents Diaeresis 00A8, Macron 00AF
*                               // Acute 00BF
*                 + "< a , A ; ae, AE ; ? , ?"
*                 + "< b , B < c, C < e, E & C < d, D";
* // change the order of accent characters
* String addOn = "& ?;?;?;"; // Acute 00BF, Macron 00AF, Diaeresis 00A8
* RuleBasedCollator myCollator = new RuleBasedCollator(oldRules + addOn);
* </pre>
* </blockquote>
*
* <p>
* The last example shows how to put new primary ordering in before the
* default setting. For example, in Japanese <code>Collator</code>, you
* can either sort English characters before or after Japanese characters,
* <blockquote>
* <pre>
* // get en_US Collator rules
* RuleBasedCollator en_USCollator =
*                      (RuleBasedCollator)Collator.getInstance(Locale.US);
* // add a few Japanese character to sort before English characters
* // suppose the last character before the first base letter 'a' in
* // the English collation rule is ?
* String jaString = "& \\u30A2 , \\u30FC < \\u30C8";
* RuleBasedCollator myJapaneseCollator = new
*     RuleBasedCollator(en_USCollator.getRules() + jaString);
* </pre>
* </blockquote>
* <P>
* @author syn wee quek
* @stable ICU 2.4
*/

public final class RuleBasedCollator extends Collator
{
  // public constructors ------------------------------------------

  /**
  * RuleBasedCollator constructor. This takes the table rules and builds a
  * collation table out of them. Please see RuleBasedCollator class
  * description for more details on the collation rule syntax.
  * @param rules the collation rules to build the collation table from.
  * @exception ParseException thrown if rules are empty or a Runtime error
  *            if collator can not be created.
  * @stable ICU 2.4
  */
  public RuleBasedCollator(String rules) throws ParseException
  {
    // BEGIN android-changed
    if (rules == null) {
      throw new NullPointerException();
    }
    // if (rules.length() == 0)
    //   throw new ParseException("Build rules empty.", 0);
    // END android-changed
    m_collator_ = NativeCollation.openCollatorFromRules(rules,
                              CollationAttribute.VALUE_OFF,
                              CollationAttribute.VALUE_DEFAULT_STRENGTH);
  }

  /**
  * RuleBasedCollator constructor. This takes the table rules and builds a
  * collation table out of them. Please see RuleBasedCollator class
  * description for more details on the collation rule syntax.
  * @param rules the collation rules to build the collation table from.
  * @param strength collation strength
  * @exception ParseException thrown if rules are empty or a Runtime error
  *            if collator can not be created.
  * @see #PRIMARY
  * @see #SECONDARY
  * @see #TERTIARY
  * @see #QUATERNARY
  * @see #IDENTICAL
  * @stable ICU 2.4
  */
  public RuleBasedCollator(String rules, int strength) throws ParseException
  {
    // BEGIN android-changed
    if (rules == null) {
      throw new NullPointerException();
    }
    // if (rules.length() == 0)
    //   throw new ParseException("Build rules empty.", 0);
    // END android-changed
    if (!CollationAttribute.checkStrength(strength))
      throw ErrorCode.getException(ErrorCode.U_ILLEGAL_ARGUMENT_ERROR);

    m_collator_ = NativeCollation.openCollatorFromRules(rules,
                                CollationAttribute.VALUE_OFF,
                                strength);
  }

  /**
  * RuleBasedCollator constructor. This takes the table rules and builds a
  * collation table out of them. Please see RuleBasedCollator class
  * description for more details on the collation rule syntax.
  * <p>Note API change starting from release 2.4. Prior to release 2.4, the
  * normalizationmode argument values are from the class
  * com.ibm.icu4jni.text.Normalization. In 2.4,
  * the valid normalizationmode arguments for this API are
  * CollationAttribute.VALUE_ON and CollationAttribute.VALUE_OFF.
  * </p>
  * @param rules the collation rules to build the collation table from.
  * @param strength collation strength
  * @param normalizationmode normalization mode
  * @exception IllegalArgumentException thrown when constructor error occurs
  * @see #PRIMARY
  * @see #SECONDARY
  * @see #TERTIARY
  * @see #QUATERNARY
  * @see #IDENTICAL
  * @see #CANONICAL_DECOMPOSITION
  * @see #NO_DECOMPOSITION
  * @stable ICU 2.4
  */
  public RuleBasedCollator(String rules, int normalizationmode, int strength)
  {
    // BEGIN android-added
    if (rules == null) {
      throw new NullPointerException();
    }
    // END android-added
    if (!CollationAttribute.checkStrength(strength) ||
        !CollationAttribute.checkNormalization(normalizationmode)) {
      throw ErrorCode.getException(ErrorCode.U_ILLEGAL_ARGUMENT_ERROR);
    }

    m_collator_ = NativeCollation.openCollatorFromRules(rules,
                                          normalizationmode, strength);
  }

  // public methods -----------------------------------------------

  /**
  * Makes a complete copy of the current object.
  * @return a copy of this object if data clone is a success, otherwise null
  * @stable ICU 2.4
  */
  public Object clone()
  {
    RuleBasedCollator result = null;
    int collatoraddress = NativeCollation.safeClone(m_collator_);
    result = new RuleBasedCollator(collatoraddress);
    return (Collator)result;
  }

  /**
  * The comparison function compares the character data stored in two
  * different strings. Returns information about whether a string is less
  * than, greater than or equal to another string.
  * <p>Example of use:
  * <br>
  * <code>
  *   Collator myCollation = Collator.createInstance(Locale::US);
  *   myCollation.setStrength(CollationAttribute.VALUE_PRIMARY);
  *   // result would be Collator.RESULT_EQUAL ("abc" == "ABC")
  *   // (no primary difference between "abc" and "ABC")
  *   int result = myCollation.compare("abc", "ABC",3);
  *   myCollation.setStrength(CollationAttribute.VALUE_TERTIARY);
  *   // result would be Collation::LESS (abc" &lt;&lt;&lt; "ABC")
  *   // (with tertiary difference between "abc" and "ABC")
  *   int result = myCollation.compare("abc", "ABC",3);
  * </code>
  * @param source The source string.
  * @param target The target string.
  * @return result of the comparison, Collator.RESULT_EQUAL,
  *         Collator.RESULT_GREATER or Collator.RESULT_LESS
  * @stable ICU 2.4
  */
  public int compare(String source, String target)
  {
    return NativeCollation.compare(m_collator_, source, target);
  }

  /**
  * Get the normalization mode for this object.
  * The normalization mode influences how strings are compared.
  * @see #CANONICAL_DECOMPOSITION
  * @see #NO_DECOMPOSITION
  * @stable ICU 2.4
  */
  public int getDecomposition()
  {
    return NativeCollation.getNormalization(m_collator_);
  }

  /**
  * <p>Sets the decomposition mode of the Collator object on or off.
  * If the decomposition mode is set to on, string would be decomposed into
  * NFD format where necessary before sorting.</p>
  * </p>
  * @param decompositionmode the new decomposition mode
  * @see #CANONICAL_DECOMPOSITION
  * @see #NO_DECOMPOSITION
  * @stable ICU 2.4
  */
  public void setDecomposition(int decompositionmode)
  {
    if (!CollationAttribute.checkNormalization(decompositionmode))
      throw ErrorCode.getException(ErrorCode.U_ILLEGAL_ARGUMENT_ERROR);
    NativeCollation.setAttribute(m_collator_,
                                 CollationAttribute.NORMALIZATION_MODE,
                                 decompositionmode);
  }

  /**
  * Determines the minimum strength that will be use in comparison or
  * transformation.
  * <p>
  * E.g. with strength == CollationAttribute.VALUE_SECONDARY, the tertiary difference
  * is ignored
  * </p>
  * <p>
  * E.g. with strength == PRIMARY, the secondary and tertiary difference are
  * ignored.
  * </p>
  * @return the current comparison level.
  * @see #PRIMARY
  * @see #SECONDARY
  * @see #TERTIARY
  * @see #QUATERNARY
  * @see #IDENTICAL
  * @stable ICU 2.4
  */
  public int getStrength()
  {
    return NativeCollation.getAttribute(m_collator_,
                                        CollationAttribute.STRENGTH);
  }

  /**
  * Sets the minimum strength to be used in comparison or transformation.
  * <p>Example of use:
  * <br>
  * <code>
  * Collator myCollation = Collator.createInstance(Locale::US);
  * myCollation.setStrength(PRIMARY);
  * // result will be "abc" == "ABC"
  * // tertiary differences will be ignored
  * int result = myCollation->compare("abc", "ABC");
  * </code>
  * @param strength the new comparison level.
  * @exception IllegalArgumentException when argument does not belong to any collation strength
  *            mode or error occurs while setting data.
  * @see #PRIMARY
  * @see #SECONDARY
  * @see #TERTIARY
  * @see #QUATERNARY
  * @see #IDENTICAL
  * @stable ICU 2.4
  */
  public void setStrength(int strength)
  {
    if (!CollationAttribute.checkStrength(strength))
      throw ErrorCode.getException(ErrorCode.U_ILLEGAL_ARGUMENT_ERROR);
    NativeCollation.setAttribute(m_collator_, CollationAttribute.STRENGTH,
                                 strength);
  }

  /**
  * Sets the attribute to be used in comparison or transformation.
  * <p>Example of use:
  * <br>
  * <code>
  *  Collator myCollation = Collator.createInstance(Locale::US);
  *  myCollation.setAttribute(CollationAttribute.CASE_LEVEL,
  *                           CollationAttribute.VALUE_ON);
  *  int result = myCollation->compare("\\u30C3\\u30CF",
  *                                    "\\u30C4\\u30CF");
  * // result will be Collator.RESULT_LESS.
  * </code>
  * @param type the attribute to be set from CollationAttribute
  * @param value attribute value from CollationAttribute
  * @stable ICU 2.4
  */
  public void setAttribute(int type, int value)
  {
    if (!CollationAttribute.checkAttribute(type, value))
      throw ErrorCode.getException(ErrorCode.U_ILLEGAL_ARGUMENT_ERROR);
    NativeCollation.setAttribute(m_collator_, type, value);
  }

  /**
  * Gets the attribute to be used in comparison or transformation.
  * @param type the attribute to be set from CollationAttribute
  * @return value attribute value from CollationAttribute
  * @stable ICU 2.4
  */
  public int getAttribute(int type)
  {
    if (!CollationAttribute.checkType(type))
      throw ErrorCode.getException(ErrorCode.U_ILLEGAL_ARGUMENT_ERROR);
    return NativeCollation.getAttribute(m_collator_, type);
  }

  /**
  * Get the sort key as an CollationKey object from the argument string.
  * To retrieve sort key in terms of byte arrays, use the method as below<br>
  * <br>
  * <code>
  * Collator collator = Collator.getInstance();
  * byte[] array = collator.getSortKey(source);
  * </code><br>
  * Byte array result are zero-terminated and can be compared using
  * java.util.Arrays.equals();
  * @param source string to be processed.
  * @return the sort key
  * @stable ICU 2.4
  */
  public CollationKey getCollationKey(String source)
  {
    // BEGIN android-removed
    // return new CollationKey(NativeCollation.getSortKey(m_collator_, source));
    // END android-removed
    // BEGIN android-added
    if(source == null) {
        return null;
    }
    byte[] key = NativeCollation.getSortKey(m_collator_, source);
    if(key == null) {
      return null;
    }
    return new CollationKey(key);
    // END android-added
  }

  /**
  * Get a sort key for the argument string
  * Sort keys may be compared using java.util.Arrays.equals
  * @param source string for key to be generated
  * @return sort key
  * @stable ICU 2.4
  */
  public byte[] getSortKey(String source)
  {
    return NativeCollation.getSortKey(m_collator_, source);
  }

  /**
  * Get the collation rules of this Collation object
  * The rules will follow the rule syntax.
  * @return collation rules.
  * @stable ICU 2.4
  */
  public String getRules()
  {
    return NativeCollation.getRules(m_collator_);
  }

  /**
  * Create a CollationElementIterator object that will iterator over the
  * elements in a string, using the collation rules defined in this
  * RuleBasedCollator
  * @param source string to iterate over
  * @return address of C collationelement
  * @exception IllegalArgumentException thrown when error occurs
  * @stable ICU 2.4
  */
  public CollationElementIterator getCollationElementIterator(String source)
  {
    CollationElementIterator result = new CollationElementIterator(
         NativeCollation.getCollationElementIterator(m_collator_, source));
    // result.setOwnCollationElementIterator(true);
    return result;
  }

  // BEGIN android-added
  /**
  * Create a CollationElementIterator object that will iterator over the
  * elements in a string, using the collation rules defined in this
  * RuleBasedCollator
  * @param source string to iterate over
  * @return address of C collationelement
  * @exception IllegalArgumentException thrown when error occurs
  * @stable ICU 2.4
  */
  public CollationElementIterator getCollationElementIterator(
          CharacterIterator source)
  {
    CollationElementIterator result = new CollationElementIterator(
         NativeCollation.getCollationElementIterator(m_collator_,
                 source.toString()));
    // result.setOwnCollationElementIterator(true);
    return result;
  }
  // END android-added

  /**
  * Returns a hash of this collation object
  * Note this method is not complete, it only returns 0 at the moment.
  * @return hash of this collation object
  * @stable ICU 2.4
  */
  public int hashCode()
  {
    // since rules do not change once it is created, we can cache the hash
    if (m_hashcode_ == 0) {
      m_hashcode_ = NativeCollation.hashCode(m_collator_);
      if (m_hashcode_ == 0)
        m_hashcode_ = 1;
    }
    return m_hashcode_;
  }

  /**
  * Checks if argument object is equals to this object.
  * @param target object
  * @return true if source is equivalent to target, false otherwise
  * @stable ICU 2.4
  */
  public boolean equals(Object target)
  {
    if (this == target)
      return true;
    if (target == null)
      return false;
    if (getClass() != target.getClass())
      return false;

    RuleBasedCollator tgtcoll = (RuleBasedCollator)target;
    return getRules().equals(tgtcoll.getRules()) &&
           getStrength() == tgtcoll.getStrength() &&
           getDecomposition() == tgtcoll.getDecomposition();
  }

  // package constructor ----------------------------------------

  /**
  * RuleBasedCollator default constructor. This constructor takes the default
  * locale. The only caller of this class should be Collator.getInstance().
  * Current implementation createInstance() returns a RuleBasedCollator(Locale)
  * instance. The RuleBasedCollator will be created in the following order,
  * <ul>
  * <li> Data from argument locale resource bundle if found, otherwise
  * <li> Data from parent locale resource bundle of arguemtn locale if found,
  *      otherwise
  * <li> Data from built-in default collation rules if found, other
  * <li> null is returned
  * </ul>
  */
  RuleBasedCollator()
  {
    m_collator_ = NativeCollation.openCollator();
  }

  /**
  * RuleBasedCollator constructor. This constructor takes a locale. The
  * only caller of this class should be Collator.createInstance().
  * Current implementation createInstance() returns a RuleBasedCollator(Locale)
  * instance. The RuleBasedCollator will be created in the following order,
  * <ul>
  * <li> Data from argument locale resource bundle if found, otherwise
  * <li> Data from parent locale resource bundle of arguemtn locale if found,
  *      otherwise
  * <li> Data from built-in default collation rules if found, other
  * <li> null is returned
  * </ul>
  * @param locale locale used
  */
  RuleBasedCollator(Locale locale)
  {
    if (locale == null) {
      m_collator_ = NativeCollation.openCollator();
    }
    else {
      m_collator_ = NativeCollation.openCollator(locale.toString());
    }
  }

  // protected methods --------------------------------------------

  /**
  * Garbage collection.
  * Close C collator and reclaim memory.
  */
  protected void finalize()
  {
    NativeCollation.closeCollator(m_collator_);
  }

  // private data members -----------------------------------------

  /**
  * C collator
  */
  private int m_collator_;

  /**
  * Hash code for rules
  */
  private int m_hashcode_ = 0;

  // private constructor -----------------------------------------

  /**
  * Private use constructor.
  * Does not create any instance of the C collator. Accepts argument as the
  * C collator for new instance.
  * @param collatoraddress address of C collator
  */
  private RuleBasedCollator(int collatoraddress)
  {
    m_collator_ = collatoraddress;
  }
}