ucsdet.h revision b13da9df870a61b11249bf741347908dbea0edd8
1b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru/* 2b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru ********************************************************************** 3b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * Copyright (C) 2005-2007, International Business Machines 4b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * Corporation and others. All Rights Reserved. 5b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru ********************************************************************** 6b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * file name: ucsdet.h 7b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * encoding: US-ASCII 8b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * indentation:4 9b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * 10b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * created on: 2005Aug04 11b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * created by: Andy Heninger 12b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * 13b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * ICU Character Set Detection, API for C 14b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * 15b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * Draft version 18 Oct 2005 16b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * 17b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru */ 18b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru 19b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru#ifndef __UCSDET_H 20b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru#define __UCSDET_H 21b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru 22b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru#include "unicode/utypes.h" 23b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru 24b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru#if !UCONFIG_NO_CONVERSION 25b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru#include "unicode/uenum.h" 26b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru 27b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru/** 28b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * \file 29b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * \brief C API: Charset Detection API 30b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * 31b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * This API provides a facility for detecting the 32b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * charset or encoding of character data in an unknown text format. 33b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * The input data can be from an array of bytes. 34b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * <p> 35b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * Character set detection is at best an imprecise operation. The detection 36b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * process will attempt to identify the charset that best matches the characteristics 37b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * of the byte data, but the process is partly statistical in nature, and 38b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * the results can not be guaranteed to always be correct. 39b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * <p> 40b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * For best accuracy in charset detection, the input data should be primarily 41b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * in a single language, and a minimum of a few hundred bytes worth of plain text 42b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * in the language are needed. The detection process will attempt to 43b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * ignore html or xml style markup that could otherwise obscure the content. 44b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru */ 45b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru 46b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru 47b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Querustruct UCharsetDetector; 48b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru/** 49b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * Structure representing a charset detector 50b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * @stable ICU 3.6 51b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru */ 52b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Querutypedef struct UCharsetDetector UCharsetDetector; 53b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru 54b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Querustruct UCharsetMatch; 55b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru/** 56b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * Opaque structure representing a match that was identified 57b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * from a charset detection operation. 58b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * @stable ICU 3.6 59b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru */ 60b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Querutypedef struct UCharsetMatch UCharsetMatch; 61b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru 62b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru/** 63b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * Open a charset detector. 64b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * 65b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * @param status Any error conditions occurring during the open 66b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * operation are reported back in this variable. 67b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * @return the newly opened charset detector. 68b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * @stable ICU 3.6 69b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru */ 70b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste QueruU_STABLE UCharsetDetector * U_EXPORT2 71b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queruucsdet_open(UErrorCode *status); 72b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru 73b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru/** 74b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * Close a charset detector. All storage and any other resources 75b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * owned by this charset detector will be released. Failure to 76b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * close a charset detector when finished with it can result in 77b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * memory leaks in the application. 78b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * 79b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * @param ucsd The charset detector to be closed. 80b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * @stable ICU 3.6 81b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru */ 82b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste QueruU_STABLE void U_EXPORT2 83b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queruucsdet_close(UCharsetDetector *ucsd); 84b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru 85b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru/** 86b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * Set the input byte data whose charset is to detected. 87b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * 88b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * Ownership of the input text byte array remains with the caller. 89b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * The input string must not be altered or deleted until the charset 90b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * detector is either closed or reset to refer to different input text. 91b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * 92b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * @param ucsd the charset detector to be used. 93b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * @param textIn the input text of unknown encoding. . 94b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * @param len the length of the input text, or -1 if the text 95b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * is NUL terminated. 96b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * @param status any error conditions are reported back in this variable. 97b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * 98b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * @stable ICU 3.6 99b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru */ 100b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste QueruU_STABLE void U_EXPORT2 101b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queruucsdet_setText(UCharsetDetector *ucsd, const char *textIn, int32_t len, UErrorCode *status); 102b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru 103b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru 104b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru/** Set the declared encoding for charset detection. 105b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * The declared encoding of an input text is an encoding obtained 106b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * by the user from an http header or xml declaration or similar source that 107b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * can be provided as an additional hint to the charset detector. 108b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * 109b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * How and whether the declared encoding will be used during the 110b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * detection process is TBD. 111b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * 112b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * @param ucsd the charset detector to be used. 113b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * @param encoding an encoding for the current data obtained from 114b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * a header or declaration or other source outside 115b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * of the byte data itself. 116b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * @param length the length of the encoding name, or -1 if the name string 117b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * is NUL terminated. 118b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * @param status any error conditions are reported back in this variable. 119b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * 120b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * @stable ICU 3.6 121b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru */ 122b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste QueruU_STABLE void U_EXPORT2 123b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queruucsdet_setDeclaredEncoding(UCharsetDetector *ucsd, const char *encoding, int32_t length, UErrorCode *status); 124b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru 125b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru 126b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru/** 127b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * Return the charset that best matches the supplied input data. 128b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * 129b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * Note though, that because the detection 130b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * only looks at the start of the input data, 131b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * there is a possibility that the returned charset will fail to handle 132b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * the full set of input data. 133b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * <p> 134b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * The returned UCharsetMatch object is owned by the UCharsetDetector. 135b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * It will remain valid until the detector input is reset, or until 136b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * the detector is closed. 137b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * <p> 138b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * The function will fail if 139b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * <ul> 140b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * <li>no charset appears to match the data.</li> 141b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * <li>no input text has been provided</li> 142b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * </ul> 143b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * 144b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * @param ucsd the charset detector to be used. 145b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * @param status any error conditions are reported back in this variable. 146b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * @return a UCharsetMatch representing the best matching charset, 147b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * or NULL if no charset matches the byte data. 148b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * 149b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * @stable ICU 3.6 150b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru */ 151b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste QueruU_STABLE const UCharsetMatch * U_EXPORT2 152b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queruucsdet_detect(UCharsetDetector *ucsd, UErrorCode *status); 153b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru 154b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru 155b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru/** 156b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * Find all charset matches that appear to be consistent with the input, 157b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * returning an array of results. The results are ordered with the 158b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * best quality match first. 159b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * 160b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * Because the detection only looks at a limited amount of the 161b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * input byte data, some of the returned charsets may fail to handle 162b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * the all of input data. 163b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * <p> 164b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * The returned UCharsetMatch objects are owned by the UCharsetDetector. 165b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * They will remain valid until the detector is closed or modified 166b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * 167b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * <p> 168b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * Return an error if 169b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * <ul> 170b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * <li>no charsets appear to match the input data.</li> 171b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * <li>no input text has been provided</li> 172b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * </ul> 173b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * 174b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * @param ucsd the charset detector to be used. 175b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * @param matchesFound pointer to a variable that will be set to the 176b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * number of charsets identified that are consistent with 177b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * the input data. Output only. 178b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * @param status any error conditions are reported back in this variable. 179b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * @return A pointer to an array of pointers to UCharSetMatch objects. 180b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * This array, and the UCharSetMatch instances to which it refers, 181b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * are owned by the UCharsetDetector, and will remain valid until 182b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * the detector is closed or modified. 183b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * @stable ICU 3.6 184b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru */ 185b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste QueruU_STABLE const UCharsetMatch ** U_EXPORT2 186b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queruucsdet_detectAll(UCharsetDetector *ucsd, int32_t *matchesFound, UErrorCode *status); 187b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru 188b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru 189b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru 190b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru/** 191b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * Get the name of the charset represented by a UCharsetMatch. 192b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * 193b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * The storage for the returned name string is owned by the 194b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * UCharsetMatch, and will remain valid while the UCharsetMatch 195b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * is valid. 196b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * 197b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * The name returned is suitable for use with the ICU conversion APIs. 198b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * 199b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * @param ucsm The charset match object. 200b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * @param status Any error conditions are reported back in this variable. 201b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * @return The name of the matching charset. 202b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * 203b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * @stable ICU 3.6 204b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru */ 205b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste QueruU_STABLE const char * U_EXPORT2 206b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queruucsdet_getName(const UCharsetMatch *ucsm, UErrorCode *status); 207b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru 208b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru/** 209b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * Get a confidence number for the quality of the match of the byte 210b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * data with the charset. Confidence numbers range from zero to 100, 211b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * with 100 representing complete confidence and zero representing 212b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * no confidence. 213b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * 214b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * The confidence values are somewhat arbitrary. They define an 215b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * an ordering within the results for any single detection operation 216b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * but are not generally comparable between the results for different input. 217b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * 218b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * A confidence value of ten does have a general meaning - it is used 219b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * for charsets that can represent the input data, but for which there 220b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * is no other indication that suggests that the charset is the correct one. 221b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * Pure 7 bit ASCII data, for example, is compatible with a 222b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * great many charsets, most of which will appear as possible matches 223b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * with a confidence of 10. 224b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * 225b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * @param ucsm The charset match object. 226b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * @param status Any error conditions are reported back in this variable. 227b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * @return A confidence number for the charset match. 228b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * 229b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * @stable ICU 3.6 230b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru */ 231b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste QueruU_STABLE int32_t U_EXPORT2 232b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queruucsdet_getConfidence(const UCharsetMatch *ucsm, UErrorCode *status); 233b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru 234b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru/** 235b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * Get the RFC 3066 code for the language of the input data. 236b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * 237b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * The Charset Detection service is intended primarily for detecting 238b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * charsets, not language. For some, but not all, charsets, a language is 239b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * identified as a byproduct of the detection process, and that is what 240b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * is returned by this function. 241b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * 242b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * CAUTION: 243b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * 1. Language information is not available for input data encoded in 244b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * all charsets. In particular, no language is identified 245b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * for UTF-8 input data. 246b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * 247b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * 2. Closely related languages may sometimes be confused. 248b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * 249b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * If more accurate language detection is required, a linguistic 250b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * analysis package should be used. 251b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * 252b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * The storage for the returned name string is owned by the 253b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * UCharsetMatch, and will remain valid while the UCharsetMatch 254b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * is valid. 255b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * 256b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * @param ucsm The charset match object. 257b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * @param status Any error conditions are reported back in this variable. 258b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * @return The RFC 3066 code for the language of the input data, or 259b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * an empty string if the language could not be determined. 260b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * 261b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * @stable ICU 3.6 262b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru */ 263b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste QueruU_STABLE const char * U_EXPORT2 264b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queruucsdet_getLanguage(const UCharsetMatch *ucsm, UErrorCode *status); 265b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru 266b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru 267b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru/** 268b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * Get the entire input text as a UChar string, placing it into 269b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * a caller-supplied buffer. A terminating 270b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * NUL character will be appended to the buffer if space is available. 271b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * 272b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * The number of UChars in the output string, not including the terminating 273b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * NUL, is returned. 274b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * 275b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * If the supplied buffer is smaller than required to hold the output, 276b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * the contents of the buffer are undefined. The full output string length 277b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * (in UChars) is returned as always, and can be used to allocate a buffer 278b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * of the correct size. 279b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * 280b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * 281b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * @param ucsm The charset match object. 282b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * @param buf A UChar buffer to be filled with the converted text data. 283b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * @param cap The capacity of the buffer in UChars. 284b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * @param status Any error conditions are reported back in this variable. 285b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * @return The number of UChars in the output string. 286b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * 287b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * @stable ICU 3.6 288b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru */ 289b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste QueruU_STABLE int32_t U_EXPORT2 290b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queruucsdet_getUChars(const UCharsetMatch *ucsm, 291b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru UChar *buf, int32_t cap, UErrorCode *status); 292b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru 293b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru 294b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru 295b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru/** 296b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * Get an iterator over the set of all detectable charsets - 297b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * over the charsets that are known to the charset detection 298b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * service. 299b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * 300b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * The returned UEnumeration provides access to the names of 301b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * the charsets. 302b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * 303b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * The state of the Charset detector that is passed in does not 304b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * affect the result of this function, but requiring a valid, open 305b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * charset detector as a parameter insures that the charset detection 306b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * service has been safely initialized and that the required detection 307b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * data is available. 308b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * 309b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * @param ucsd a Charset detector. 310b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * @param status Any error conditions are reported back in this variable. 311b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * @return an iterator providing access to the detectable charset names. 312b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * @stable ICU 3.6 313b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru */ 314b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste QueruU_STABLE UEnumeration * U_EXPORT2 315b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queruucsdet_getAllDetectableCharsets(const UCharsetDetector *ucsd, UErrorCode *status); 316b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru 317b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru 318b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru/** 319b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * Test whether input filtering is enabled for this charset detector. 320b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * Input filtering removes text that appears to be HTML or xml 321b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * markup from the input before applying the code page detection 322b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * heuristics. 323b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * 324b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * @param ucsd The charset detector to check. 325b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * @return TRUE if filtering is enabled. 326b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * @stable ICU 3.6 327b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru */ 328b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste QueruU_STABLE UBool U_EXPORT2 329b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queruucsdet_isInputFilterEnabled(const UCharsetDetector *ucsd); 330b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru 331b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru 332b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru/** 333b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * Enable filtering of input text. If filtering is enabled, 334b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * text within angle brackets ("<" and ">") will be removed 335b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * before detection, which will remove most HTML or xml markup. 336b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * 337b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * @param ucsd the charset detector to be modified. 338b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * @param filter <code>true</code> to enable input text filtering. 339b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * @return The previous setting. 340b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * 341b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru * @stable ICU 3.6 342b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru */ 343b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste QueruU_STABLE UBool U_EXPORT2 344b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queruucsdet_enableInputFilter(UCharsetDetector *ucsd, UBool filter); 345b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru 346b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru#endif 347b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru#endif /* __UCSDET_H */ 348b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru 349b13da9df870a61b11249bf741347908dbea0edd8Jean-Baptiste Queru 350