2259a71ba68c8b2d0d74396277fd25cdc0692350 |
|
22-Dec-2017 |
Victor Chang <vichang@google.com> |
Make the Android fast-path UTF-8 decoder follow the Unicode Standard and the W3C Encoding standard. The behavior of UTF-8 decoder in the RI has been made to strictly follow the Unicode standard since OpenJDK 8. JDK-7096080 Essentially, it rejects 1. 3-byte surrogate/6-byte surrogate pair (CESU-8 sequence) 2. treats an ill-formed sequence, e.g. a surrogate, as individual ill-formed bytes. This change updates Android's fast-path UTF-8 decoder to - follow the Unicode standard - have a behavior closer to RI OpenJDK 8 - have consistent behavior between java.nio.charset.CharsetDecoder and fast-path code It implements the W3C recommended UTF-8 decoder. https://www.w3.org/TR/encoding/#utf-8-decoder Behavior change of the fast-path UTF-8 decoder - No longer behaves like a decoder for Modified UTF-8 and CESU-8 sequence -- If an app needs to decode a Modified UTF-8 / CESU-8 sequence, the app can use public API DataInputStream.readUTF or JNI function NewStringUTF See example at StringTest.decodeModifiedUTF8 - Treat overlong sequence as ill-formed. For example, byte sequence "c0 b1" is over-long form of character '1' U+0031. - Treat surrogate (U+D800..U+DFFF) as ill-formed - Maximal subpart should be replaced by a single U+FFFD. For example, in byte sequence "41 C0 AF 41 F4 80 80 41", the maximal subparts are "C0", "AF", and "F4 80 80". "F4 80 80" can be the initial subsequence of "F4 80 80 80", but "C0 AF" can't be the initial subsequence of any well-formed code unit sequence. Thus, the output should be "A\ufffd\ufffdA\ufffdA". Test change: - CharsetEncoder2Test.testUtf8Encoding: UTF-8 encoded Surrogate is treated as invalid - X500PrincipalTest.testValidDN: Overlong sequence is now treated as invalid. According to my test, Android Conscrypt (and BoringSSL) has rejected a certificate with such overlong sequence in CN since OC MR1. Thus, it has little use case to create X500Principal with overlong UTF-8 sequence. Also, RI doesn't pass this test either. Context: From my understanding, certificate and X500 principal are stored in ASN.1 format. The RFC standards quoted in X500Principal don't prohibit overlong UTF-8 sequences. But the new standards RFC5280 for X.509 and RFC3629 for UTF-8 explicitly prohibits any overlong UTF-8 sequences. Performance change: The performance of the fast-path decoder is similar before and after the change. === Before the change === CharsetBenchmark Experiment {instrument=runtime, benchmarkMethod=time_new_String_BString, vm=default, parameters={length=10000, name=UTF-8}} Results: runtime(ns): min=574795.84, 1st qu.=574795.84, median=574795.84, mean=574795.84, 3rd qu.=574795.84, max=574795.84 Trial Report (1 of 4): CharsetUtf8Benchmark Experiment {instrument=runtime, benchmarkMethod=time_ascii, vm=default, parameters={}} Results: runtime(ns): min=58290943.00, 1st qu.=58290943.00, median=58290943.00, mean=58290943.00, 3rd qu.=58290943.00, max=58290943.00 Trial Report (2 of 4): Experiment {instrument=runtime, benchmarkMethod=time_bmp2, vm=default, parameters={}} Results: runtime(ns): min=77581414.00, 1st qu.=77581414.00, median=77581414.00, mean=77581414.00, 3rd qu.=77581414.00, max=77581414.00 Trial Report (3 of 4): Experiment {instrument=runtime, benchmarkMethod=time_bmp3, vm=default, parameters={}} Results: runtime(ns): min=57457297.00, 1st qu.=57457297.00, median=57457297.00, mean=57457297.00, 3rd qu.=57457297.00, max=57457297.00 Trial Report (4 of 4): Experiment {instrument=runtime, benchmarkMethod=time_supplementary, vm=default, parameters={}} Results: runtime(ns): min=60723183.00, 1st qu.=60723183.00, median=60723183.00, mean=60723183.00, 3rd qu.=60723183.00, max=60723183.00 === After the change === CharsetBenchmark Experiment {instrument=runtime, benchmarkMethod=time_new_String_BString, vm=default, parameters={length=10000, name=UTF-8}} Results: runtime(ns): min=523638.25, 1st qu.=523638.25, median=523638.25, mean=523638.25, 3rd qu.=523638.25, max=523638.25 CharsetUtf8Benchmark Trial Report (1 of 4): Experiment {instrument=runtime, benchmarkMethod=time_ascii, vm=default, parameters={}} Results: runtime(ns): min=57101725.00, 1st qu.=57101725.00, median=57101725.00, mean=57101725.00, 3rd qu.=57101725.00, max=57101725.00 Trial Report (2 of 4): Experiment {instrument=runtime, benchmarkMethod=time_bmp2, vm=default, parameters={}} Results: runtime(ns): min=76573080.00, 1st qu.=76573080.00, median=76573080.00, mean=76573080.00, 3rd qu.=76573080.00, max=76573080.00 Trial Report (3 of 4): Experiment {instrument=runtime, benchmarkMethod=time_bmp3, vm=default, parameters={}} Results: runtime(ns): min=59655214.00, 1st qu.=59655214.00, median=59655214.00, mean=59655214.00, 3rd qu.=59655214.00, max=59655214.00 Trial Report (4 of 4): Experiment {instrument=runtime, benchmarkMethod=time_supplementary, vm=default, parameters={}} Results: runtime(ns): min=67283548.00, 1st qu.=67283548.00, median=67283548.00, mean=67283548.00, 3rd qu.=67283548.00, max=67283548.00 Test: cts-tradefed run cts-dev -m CtsLibcoreTestCases Test: cts-tradefed run cts-dev -m CtsLibcoreOjTestCases Bug: 69599767 Bug: 70511691 Change-Id: I2c3e84808b19c969905813f6654ba552b6745354
|