History log of /external/skia/src/opts/SkColor_opts_SSE2.h
Revision Date Author Comments (<<< Hide modified files) (Show modified files >>>)
52fe583a7b21ee1cb04e95b05db9946be899b26d 13-Mar-2017 Florin Malita <fmalita@chromium.org> Remove SK_SUPPORT_LEGACY_BROKEN_LERP support

Chromium change landed.

BUG=chromium:696216

CQ_INCLUDE_TRYBOTS=skia.primary:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD

Change-Id: I3e67392b0fdad8c5a3ad256e4f190123dff6c846
Reviewed-on: https://skia-review.googlesource.com/9551
Reviewed-by: Mike Reed <reed@google.com>
Reviewed-by: Mike Klein <mtklein@chromium.org>
Commit-Queue: Florin Malita <fmalita@chromium.org>
/external/skia/src/opts/SkColor_opts_SSE2.h
40254c2c2dc28a34f96294d5a1ad94a99b0be8a6 05-Aug-2016 lsalzman <lsalzman@mozilla.com> SkBlendARGB32 and S32[A]_Blend_BlitRow32 are currently formulated as: SkAlphaMulQ(src, src_scale) + SkAlphaMulQ(dst, dst_scale), which boils down to ((src*src_scale)>>8) + ((dst*dst_scale)>>8). In particular, note that the intermediate precision is discarded before the two parts are added together, causing the final result to possibly inaccurate.

In Firefox, we use SkCanvas::saveLayer in combination with a backdrop that initializes the layer to the background. When this is blended back onto background using transparency, where the source and destination pixel colors are the same, the resulting color after the blend is not preserved due to the lost precision mentioned above. In cases where this operation is repeatedly performed, this causes substantially noticeable differences in color as evidenced in this downstream Firefox bug report: https://bugzilla.mozilla.org/show_bug.cgi?id=1200684

In the test-case in the downstream report, essentially it does blend(src=0xFF2E3338, dst=0xFF2E3338, scale=217), which gives the result 0xFF2E3237, while we would expect to get back 0xFF2E3338.

This problem goes away if the blend is instead reformulated to effectively do (src*src_scale + dst*dst_scale)>>8, which keeps the intermediate precision during the addition before shifting it off.

This modifies the blending operations thusly. The performance should remain mostly unchanged, or possibly improve slightly, so there should be no real downside to doing this, with the benefit of making the results more accurate. Without this, it is currently unsafe for Firefox to blend a layer back onto itself that was initialized with a copy of its background.

BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2097883002
CQ_INCLUDE_TRYBOTS=master.client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot

[mtklein adds...]
No public API changes.
TBR=reed@google.com

Review-Url: https://codereview.chromium.org/2097883002
/external/skia/src/opts/SkColor_opts_SSE2.h
4bf1ce2709f5f4a4850d9b04b3213be732cbdf89 02-Feb-2015 stephana <stephana@google.com> Revert of Revert of SSE4 opaque blend using intrinsics instead of assembly. (patchset #1 id:1 of https://codereview.chromium.org/873553003/)

Reason for revert:
Reverted the wrong CL.

Original issue's description:
> Revert of SSE4 opaque blend using intrinsics instead of assembly. (patchset #16 id:300001 of https://codereview.chromium.org/874863002/)
>
> Reason for revert:
> This causes a bug on the 'hittestpath' GM on MacMini 4,1
>
> See:
>
> https://gold.skia.org/#/triage/hittestpath?head=0
>
> for details.
>
> Original issue's description:
> > SSE4 opaque blend using intrinsics instead of assembly.
> >
> > Since we had such a hard time with the assembly versions of this blit (to the
> > point that we have them completely disabled everywhere), I thought I'd take
> > a shot at writing a version of the blit using intrinsics.
> >
> > The key feature of SSE4 we're exploiting is that we can use ptest (_mm_test*)
> > to skip the blend when the 16 src pixels we consider each loop are all opaque
> > or all transparent. _mm_shuffle_epi8 from SSSE3 also lends a hand to extract
> > all those alphas.
> >
> > It's worth looking to see if we can backport this type of logic to SSE2 using
> > _mm_movemask_epi8, or up to 32 pixels at a time using AVX.
> >
> > My local performance testing doesn't show this to be an unambiguous win
> > (there are probably microbenchmarks and SKPs where we'd be better off just
> > powering through the blend rather than looking at alphas), but the potential
> > does seem tantalizing enough to let skiaperf vet it on the bots. (< 1.0x is a win.)
> >
> > DM says it draws pixel perfect compare to the old code.
> >
> > Microbenchmarks:
> > bitmap_RGBA_8888_A_source_stripes_two 14us -> 14.4us 1.03x
> > bitmap_RGBA_8888_A_source_stripes_three 14.3us -> 14.5us 1.01x
> > bitmap_RGBA_8888_scale_bilerp 61.9us -> 62.2us 1.01x
> > bitmap_RGBA_8888_update_volatile_scale_rotate_bilerp 102us -> 101us 0.99x
> > bitmap_RGBA_8888_scale_rotate_bilerp 103us -> 101us 0.99x
> > bitmap_RGBA_8888_scale 18.4us -> 18.2us 0.99x
> > bitmap_RGBA_8888_A_scale_rotate_bicubic 71us -> 70us 0.99x
> > bitmap_RGBA_8888_update_scale_rotate_bilerp 103us -> 101us 0.99x
> > bitmap_RGBA_8888_A_scale_rotate_bilerp 112us -> 109us 0.98x
> > bitmap_RGBA_8888_update_volatile 5.72us -> 5.58us 0.98x
> > bitmap_RGBA_8888 5.73us -> 5.58us 0.97x
> > bitmap_RGBA_8888_update 5.78us -> 5.6us 0.97x
> > bitmap_RGBA_8888_A_scale_bilerp 70.7us -> 68us 0.96x
> > bitmap_RGBA_8888_A_scale_bicubic 23.7us -> 21.8us 0.92x
> > bitmap_RGBA_8888_A 13.9us -> 10.9us 0.78x
> > bitmap_RGBA_8888_A_source_opaque 14us -> 6.29us 0.45x
> > bitmap_RGBA_8888_A_source_transparent 14us -> 3.65us 0.26x
> >
> > Running over our ~70 SKP web page captures, this looks like we spend 0.7x
> > the time in S32A_Opaque_BlitRow compared to the SSE2 version, which should
> > be a decent predictor of real-world impact.
> >
> > BUG=chromium:399842
> >
> > Committed: https://skia.googlesource.com/skia/+/04bc91b972417038fecfa87c484771eac2b9b785
> >
> > CQ_EXTRA_TRYBOTS=client.skia:Test-Mac10.6-MacMini4.1-GeForce320M-x86_64-Release-Trybot
> >
> > Committed: https://skia.googlesource.com/skia/+/6dbfb21a6c88af6d94e8c823c3ad559f1a41b493
>
> TBR=henrik.smiding@intel.com,mtklein@google.com,herb@google.com,reed@google.com,thakis@chromium.org,mtklein@chromium.org
> NOPRESUBMIT=true
> NOTREECHECKS=true
> NOTRY=true
> BUG=chromium:399842
>
> Committed: https://skia.googlesource.com/skia/+/4988891a1173cd405bf1c1dd3a3668c451f45e4c

TBR=henrik.smiding@intel.com,mtklein@google.com,herb@google.com,reed@google.com,thakis@chromium.org,mtklein@chromium.org
NOPRESUBMIT=true
NOTREECHECKS=true
NOTRY=true
BUG=chromium:399842

Review URL: https://codereview.chromium.org/894083002
/external/skia/src/opts/SkColor_opts_SSE2.h
4988891a1173cd405bf1c1dd3a3668c451f45e4c 02-Feb-2015 stephana <stephana@google.com> Revert of SSE4 opaque blend using intrinsics instead of assembly. (patchset #16 id:300001 of https://codereview.chromium.org/874863002/)

Reason for revert:
This causes a bug on the 'hittestpath' GM on MacMini 4,1

See:

https://gold.skia.org/#/triage/hittestpath?head=0

for details.

Original issue's description:
> SSE4 opaque blend using intrinsics instead of assembly.
>
> Since we had such a hard time with the assembly versions of this blit (to the
> point that we have them completely disabled everywhere), I thought I'd take
> a shot at writing a version of the blit using intrinsics.
>
> The key feature of SSE4 we're exploiting is that we can use ptest (_mm_test*)
> to skip the blend when the 16 src pixels we consider each loop are all opaque
> or all transparent. _mm_shuffle_epi8 from SSSE3 also lends a hand to extract
> all those alphas.
>
> It's worth looking to see if we can backport this type of logic to SSE2 using
> _mm_movemask_epi8, or up to 32 pixels at a time using AVX.
>
> My local performance testing doesn't show this to be an unambiguous win
> (there are probably microbenchmarks and SKPs where we'd be better off just
> powering through the blend rather than looking at alphas), but the potential
> does seem tantalizing enough to let skiaperf vet it on the bots. (< 1.0x is a win.)
>
> DM says it draws pixel perfect compare to the old code.
>
> Microbenchmarks:
> bitmap_RGBA_8888_A_source_stripes_two 14us -> 14.4us 1.03x
> bitmap_RGBA_8888_A_source_stripes_three 14.3us -> 14.5us 1.01x
> bitmap_RGBA_8888_scale_bilerp 61.9us -> 62.2us 1.01x
> bitmap_RGBA_8888_update_volatile_scale_rotate_bilerp 102us -> 101us 0.99x
> bitmap_RGBA_8888_scale_rotate_bilerp 103us -> 101us 0.99x
> bitmap_RGBA_8888_scale 18.4us -> 18.2us 0.99x
> bitmap_RGBA_8888_A_scale_rotate_bicubic 71us -> 70us 0.99x
> bitmap_RGBA_8888_update_scale_rotate_bilerp 103us -> 101us 0.99x
> bitmap_RGBA_8888_A_scale_rotate_bilerp 112us -> 109us 0.98x
> bitmap_RGBA_8888_update_volatile 5.72us -> 5.58us 0.98x
> bitmap_RGBA_8888 5.73us -> 5.58us 0.97x
> bitmap_RGBA_8888_update 5.78us -> 5.6us 0.97x
> bitmap_RGBA_8888_A_scale_bilerp 70.7us -> 68us 0.96x
> bitmap_RGBA_8888_A_scale_bicubic 23.7us -> 21.8us 0.92x
> bitmap_RGBA_8888_A 13.9us -> 10.9us 0.78x
> bitmap_RGBA_8888_A_source_opaque 14us -> 6.29us 0.45x
> bitmap_RGBA_8888_A_source_transparent 14us -> 3.65us 0.26x
>
> Running over our ~70 SKP web page captures, this looks like we spend 0.7x
> the time in S32A_Opaque_BlitRow compared to the SSE2 version, which should
> be a decent predictor of real-world impact.
>
> BUG=chromium:399842
>
> Committed: https://skia.googlesource.com/skia/+/04bc91b972417038fecfa87c484771eac2b9b785
>
> CQ_EXTRA_TRYBOTS=client.skia:Test-Mac10.6-MacMini4.1-GeForce320M-x86_64-Release-Trybot
>
> Committed: https://skia.googlesource.com/skia/+/6dbfb21a6c88af6d94e8c823c3ad559f1a41b493

TBR=henrik.smiding@intel.com,mtklein@google.com,herb@google.com,reed@google.com,thakis@chromium.org,mtklein@chromium.org
NOPRESUBMIT=true
NOTREECHECKS=true
NOTRY=true
BUG=chromium:399842

Review URL: https://codereview.chromium.org/873553003
/external/skia/src/opts/SkColor_opts_SSE2.h
6dbfb21a6c88af6d94e8c823c3ad559f1a41b493 27-Jan-2015 mtklein <mtklein@chromium.org> SSE4 opaque blend using intrinsics instead of assembly.

Since we had such a hard time with the assembly versions of this blit (to the
point that we have them completely disabled everywhere), I thought I'd take
a shot at writing a version of the blit using intrinsics.

The key feature of SSE4 we're exploiting is that we can use ptest (_mm_test*)
to skip the blend when the 16 src pixels we consider each loop are all opaque
or all transparent. _mm_shuffle_epi8 from SSSE3 also lends a hand to extract
all those alphas.

It's worth looking to see if we can backport this type of logic to SSE2 using
_mm_movemask_epi8, or up to 32 pixels at a time using AVX.

My local performance testing doesn't show this to be an unambiguous win
(there are probably microbenchmarks and SKPs where we'd be better off just
powering through the blend rather than looking at alphas), but the potential
does seem tantalizing enough to let skiaperf vet it on the bots. (< 1.0x is a win.)

DM says it draws pixel perfect compare to the old code.

Microbenchmarks:
bitmap_RGBA_8888_A_source_stripes_two 14us -> 14.4us 1.03x
bitmap_RGBA_8888_A_source_stripes_three 14.3us -> 14.5us 1.01x
bitmap_RGBA_8888_scale_bilerp 61.9us -> 62.2us 1.01x
bitmap_RGBA_8888_update_volatile_scale_rotate_bilerp 102us -> 101us 0.99x
bitmap_RGBA_8888_scale_rotate_bilerp 103us -> 101us 0.99x
bitmap_RGBA_8888_scale 18.4us -> 18.2us 0.99x
bitmap_RGBA_8888_A_scale_rotate_bicubic 71us -> 70us 0.99x
bitmap_RGBA_8888_update_scale_rotate_bilerp 103us -> 101us 0.99x
bitmap_RGBA_8888_A_scale_rotate_bilerp 112us -> 109us 0.98x
bitmap_RGBA_8888_update_volatile 5.72us -> 5.58us 0.98x
bitmap_RGBA_8888 5.73us -> 5.58us 0.97x
bitmap_RGBA_8888_update 5.78us -> 5.6us 0.97x
bitmap_RGBA_8888_A_scale_bilerp 70.7us -> 68us 0.96x
bitmap_RGBA_8888_A_scale_bicubic 23.7us -> 21.8us 0.92x
bitmap_RGBA_8888_A 13.9us -> 10.9us 0.78x
bitmap_RGBA_8888_A_source_opaque 14us -> 6.29us 0.45x
bitmap_RGBA_8888_A_source_transparent 14us -> 3.65us 0.26x

Running over our ~70 SKP web page captures, this looks like we spend 0.7x
the time in S32A_Opaque_BlitRow compared to the SSE2 version, which should
be a decent predictor of real-world impact.

BUG=chromium:399842

Committed: https://skia.googlesource.com/skia/+/04bc91b972417038fecfa87c484771eac2b9b785

CQ_EXTRA_TRYBOTS=client.skia:Test-Mac10.6-MacMini4.1-GeForce320M-x86_64-Release-Trybot

Review URL: https://codereview.chromium.org/874863002
/external/skia/src/opts/SkColor_opts_SSE2.h
2d80dd264704984bca4c2915f31ef427033b0b9b 26-Jan-2015 bungeman <bungeman@google.com> Revert of SSE4 opaque blend using intrinsics instead of assembly. (patchset #14 id:260001 of https://codereview.chromium.org/874863002/)

Reason for revert:
This kills Mac 10.6 bots.

FAILED: c++ -MMD -MF obj/src/opts/opts_sse4.SkBlitRow_opts_SSE4.o.d -DSK_INTERNAL -DSK_GAMMA_SRGB -DSK_GAMMA_APPLY_TO_A8 -DSK_SCALAR_TO_FLOAT_EXCLUDED -DSK_ALLOW_STATIC_GLOBAL_INITIALIZERS=1 -DSK_SUPPORT_GPU=1 -DSK_SUPPORT_OPENCL=0 -DSK_FORCE_DISTANCE_FIELD_TEXT=0 -DSK_BUILD_FOR_MAC -DSK_CRASH_HANDLER -DSK_DEVELOPER=1 -I../../src/core -I../../src/utils -I../../include/c -I../../include/config -I../../include/core -I../../include/pathops -I../../include/pipe -I../../include/utils/mac -I../../include/effects -O0 -gdwarf-2 -mmacosx-version-min=10.6 -arch x86_64 -mssse3 -Wall -Wextra -Winit-self -Wpointer-arith -Wsign-compare -Wno-unused-parameter -Wno-invalid-offsetof -msse4.1 -c ../../src/opts/SkBlitRow_opts_SSE4.cpp -o obj/src/opts/opts_sse4.SkBlitRow_opts_SSE4.o
../../src/opts/SkBlitRow_opts_SSE4.cpp:15:27: warning: x86intrin.h: No such file or directory
../../src/opts/SkBlitRow_opts_SSE4.cpp: In function 'void S32A_Opaque_BlitRow32_SSE4(SkPMColor*, const SkPMColor*, int, U8CPU)':
../../src/opts/SkBlitRow_opts_SSE4.cpp:40: error: '_mm_testz_si128' was not declared in this scope
../../src/opts/SkBlitRow_opts_SSE4.cpp:45: error: '_mm_testc_si128' was not declared in this scope

Original issue's description:
> SSE4 opaque blend using intrinsics instead of assembly.
>
> Since we had such a hard time with the assembly versions of this blit (to the
> point that we have them completely disabled everywhere), I thought I'd take
> a shot at writing a version of the blit using intrinsics.
>
> The key feature of SSE4 we're exploiting is that we can use ptest (_mm_test*)
> to skip the blend when the 16 src pixels we consider each loop are all opaque
> or all transparent. _mm_shuffle_epi8 from SSSE3 also lends a hand to extract
> all those alphas.
>
> It's worth looking to see if we can backport this type of logic to SSE2 using
> _mm_movemask_epi8, or up to 32 pixels at a time using AVX.
>
> My local performance testing doesn't show this to be an unambiguous win
> (there are probably microbenchmarks and SKPs where we'd be better off just
> powering through the blend rather than looking at alphas), but the potential
> does seem tantalizing enough to let skiaperf vet it on the bots. (< 1.0x is a win.)
>
> DM says it draws pixel perfect compare to the old code.
>
> Microbenchmarks:
> bitmap_RGBA_8888_A_source_stripes_two 14us -> 14.4us 1.03x
> bitmap_RGBA_8888_A_source_stripes_three 14.3us -> 14.5us 1.01x
> bitmap_RGBA_8888_scale_bilerp 61.9us -> 62.2us 1.01x
> bitmap_RGBA_8888_update_volatile_scale_rotate_bilerp 102us -> 101us 0.99x
> bitmap_RGBA_8888_scale_rotate_bilerp 103us -> 101us 0.99x
> bitmap_RGBA_8888_scale 18.4us -> 18.2us 0.99x
> bitmap_RGBA_8888_A_scale_rotate_bicubic 71us -> 70us 0.99x
> bitmap_RGBA_8888_update_scale_rotate_bilerp 103us -> 101us 0.99x
> bitmap_RGBA_8888_A_scale_rotate_bilerp 112us -> 109us 0.98x
> bitmap_RGBA_8888_update_volatile 5.72us -> 5.58us 0.98x
> bitmap_RGBA_8888 5.73us -> 5.58us 0.97x
> bitmap_RGBA_8888_update 5.78us -> 5.6us 0.97x
> bitmap_RGBA_8888_A_scale_bilerp 70.7us -> 68us 0.96x
> bitmap_RGBA_8888_A_scale_bicubic 23.7us -> 21.8us 0.92x
> bitmap_RGBA_8888_A 13.9us -> 10.9us 0.78x
> bitmap_RGBA_8888_A_source_opaque 14us -> 6.29us 0.45x
> bitmap_RGBA_8888_A_source_transparent 14us -> 3.65us 0.26x
>
> Running over our ~70 SKP web page captures, this looks like we spend 0.7x
> the time in S32A_Opaque_BlitRow compared to the SSE2 version, which should
> be a decent predictor of real-world impact.
>
> BUG=chromium:399842
>
> Committed: https://skia.googlesource.com/skia/+/04bc91b972417038fecfa87c484771eac2b9b785

TBR=henrik.smiding@intel.com,mtklein@google.com,herb@google.com,reed@google.com,thakis@chromium.org,mtklein@chromium.org
NOPRESUBMIT=true
NOTREECHECKS=true
NOTRY=true
BUG=chromium:399842

Review URL: https://codereview.chromium.org/874033004
/external/skia/src/opts/SkColor_opts_SSE2.h
04bc91b972417038fecfa87c484771eac2b9b785 26-Jan-2015 mtklein <mtklein@chromium.org> SSE4 opaque blend using intrinsics instead of assembly.

Since we had such a hard time with the assembly versions of this blit (to the
point that we have them completely disabled everywhere), I thought I'd take
a shot at writing a version of the blit using intrinsics.

The key feature of SSE4 we're exploiting is that we can use ptest (_mm_test*)
to skip the blend when the 16 src pixels we consider each loop are all opaque
or all transparent. _mm_shuffle_epi8 from SSSE3 also lends a hand to extract
all those alphas.

It's worth looking to see if we can backport this type of logic to SSE2 using
_mm_movemask_epi8, or up to 32 pixels at a time using AVX.

My local performance testing doesn't show this to be an unambiguous win
(there are probably microbenchmarks and SKPs where we'd be better off just
powering through the blend rather than looking at alphas), but the potential
does seem tantalizing enough to let skiaperf vet it on the bots. (< 1.0x is a win.)

DM says it draws pixel perfect compare to the old code.

Microbenchmarks:
bitmap_RGBA_8888_A_source_stripes_two 14us -> 14.4us 1.03x
bitmap_RGBA_8888_A_source_stripes_three 14.3us -> 14.5us 1.01x
bitmap_RGBA_8888_scale_bilerp 61.9us -> 62.2us 1.01x
bitmap_RGBA_8888_update_volatile_scale_rotate_bilerp 102us -> 101us 0.99x
bitmap_RGBA_8888_scale_rotate_bilerp 103us -> 101us 0.99x
bitmap_RGBA_8888_scale 18.4us -> 18.2us 0.99x
bitmap_RGBA_8888_A_scale_rotate_bicubic 71us -> 70us 0.99x
bitmap_RGBA_8888_update_scale_rotate_bilerp 103us -> 101us 0.99x
bitmap_RGBA_8888_A_scale_rotate_bilerp 112us -> 109us 0.98x
bitmap_RGBA_8888_update_volatile 5.72us -> 5.58us 0.98x
bitmap_RGBA_8888 5.73us -> 5.58us 0.97x
bitmap_RGBA_8888_update 5.78us -> 5.6us 0.97x
bitmap_RGBA_8888_A_scale_bilerp 70.7us -> 68us 0.96x
bitmap_RGBA_8888_A_scale_bicubic 23.7us -> 21.8us 0.92x
bitmap_RGBA_8888_A 13.9us -> 10.9us 0.78x
bitmap_RGBA_8888_A_source_opaque 14us -> 6.29us 0.45x
bitmap_RGBA_8888_A_source_transparent 14us -> 3.65us 0.26x

Running over our ~70 SKP web page captures, this looks like we spend 0.7x
the time in S32A_Opaque_BlitRow compared to the SSE2 version, which should
be a decent predictor of real-world impact.

BUG=chromium:399842

Review URL: https://codereview.chromium.org/874863002
/external/skia/src/opts/SkColor_opts_SSE2.h
8d029a4ebeda9d99c40b864e9c93c55be58b05af 26-Jan-2015 mtklein <mtklein@chromium.org> Don't do a pointless << 0.

It's very common (universal?) that alpha is the top byte.
You'd hope the compiler would remove the left shift then,
but I've seen Clang just do a dumb left shift of zero. :(

BUG=skia:

Review URL: https://codereview.chromium.org/872243003
/external/skia/src/opts/SkColor_opts_SSE2.h
785982ed80ce43653f64640d04dae7eaf5e2f809 25-Nov-2014 mtklein <mtklein@chromium.org> Eliminate static initializers in SkColor_SSE2.h.

Chrome hates static initializers.

Two global masks can become a single local mask instead. Perf looks like a no-op:

$ c --match bitmaprect_80 bitmap_RGBA --config 8888
bitmap_RGBA_8888_scale 13.7us -> 14.1us 1.03x
bitmap_RGBA_8888_update_volatile 4.53us -> 4.6us 1.02x
bitmap_RGBA_8888 4.55us -> 4.61us 1.01x
bitmap_RGBA_8888_update 4.64us -> 4.67us 1.01x
bitmap_RGBA_8888_A_source_stripes_three 9.66us -> 9.71us 1.01x
bitmaprect_80_filter_identity 10.6us -> 10.5us 0.99x
bitmaprect_80_nofilter_identity 10.5us -> 10.4us 0.99x

TBR=reed@google.com
BUG=skia:

Review URL: https://codereview.chromium.org/762453002
/external/skia/src/opts/SkColor_opts_SSE2.h
2253aa93930cdc5d0615098ce5473065427bcff6 25-Nov-2014 qiankun.miao <qiankun.miao@intel.com> Add SkBlendARGB32_SSE2() to clean up code

Related nanobench results:
before:
maxrss loops min median mean max stddev samples config bench
10M 2 31.9µs 32.4µs 33.3µs 38.7µs 6% █▄▂▂▂▁▂▁▁▁ 8888 bitmap_BGRA_8888_A_scale_bicubic
10M 13 43.8µs 51.8µs 49.6µs 57.9µs 11% ▁▁▁▁▂▆▇▆▅█ 8888 bitmap_BGRA_8888_A_scale_bilerp
10M 13 23.7µs 24.3µs 26µs 32.7µs 13% ▅█▆▁▁▁▁▂▁▁ 8888 bitmap_Index_8_A
10M 4 1.68µs 1.7µs 4.09µs 25.4µs 183% █▁▁▁▁▁▁▁▁▁ 8888 text_16_AA_88
10M 144 1.76µs 1.77µs 1.78µs 1.81µs 1% █▂▇▂▅▁▁▁▁▁ 8888 text_16_AA_FF
10M 10 4.7µs 5.34µs 5.61µs 8.63µs 21% █▂▂▃▂▁▁▁▁▄ 8888 rotated_rects_aa_alternating_transparent_and_opaque_src
10M 50 4.44µs 4.47µs 4.5µs 4.71µs 2% █▅▃▂▂▂▁▁▁▁ 8888 rotated_rects_aa_changing_opaque_src
10M 51 4.39µs 4.78µs 5.21µs 6.62µs 17% ▁▆▆▇▁▁█▁▂▂ 8888 rotated_rects_aa_same_opaque_src
10M 50 4.47µs 5.79µs 5.43µs 6.14µs 11% ▄▂▁▃▇▇▆▇▇█ 8888 rotated_rects_aa_alternating_transparent_and_opaque_srcover
10M 30 4.35µs 6.06µs 5.84µs 7.63µs 16% ▅▅▅▄▅▅▄█▁▁ 8888 rotated_rects_aa_changing_transparent_srcover
10M 44 4.31µs 4.51µs 4.76µs 6.25µs 13% ▄▂▂▁█▃▁▃▁▁ 8888 rotated_rects_aa_changing_opaque_srcover
10M 46 4.36µs 4.42µs 4.75µs 6.19µs 14% ▆█▃▁▁▁▁▁▁▁ 8888 rotated_rects_aa_same_transparent_srcover
10M 47 4.29µs 4.35µs 4.44µs 5.15µs 6% ▃▂▂▁▁█▁▁▁▁ 8888 rotated_rects_aa_same_opaque_srcover
10M 3 39.1µs 39.2µs 50.7µs 153µs 71% █▁▁▁▁▁▁▁▁▁ 8888 rectori
10M 1 2.3ms 2.31ms 2.35ms 2.74ms 6% ▁▁▁▁▁▁▁▁█▂ 8888 maskcolor
10M 1 2.33ms 2.34ms 2.53ms 3.14ms 11% ▁▁▁▁▁▁▅█▄▄ 8888 maskopaque
10M 11 15µs 15.3µs 15.7µs 18.3µs 7% ▅▃▂▂▁▁▁▁█▁ 8888 rrects_3_stroke_4
10M 46 3.99µs 4.07µs 4.14µs 4.54µs 4% █▅▅▃▂▂▁▁▁▁ 8888 rrects_3
10M 16 15.6µs 15.9µs 16.1µs 17.5µs 4% █▄▃▂▂▂▁▂▁▁ 8888 ovals_3_stroke_4
10M 40 5.09µs 5.18µs 5.23µs 5.67µs 3% █▅▃▂▂▁▃▁▁▁ 8888 ovals_3
10M 231 1.92µs 1.93µs 1.94µs 2µs 1% █▃▂▁▃▁▁▁▁▁ 8888 zeroradroundrect
10M 924 3.88µs 3.93µs 4.11µs 4.95µs 9% ▁█▆▃▁▁▁▁▁▁ 8888 arbroundrect
10M 8 8.11µs 8.47µs 8.48µs 8.85µs 3% █▅▇▄▄▂▁▄▄▆ 8888 merge_large
10M 14 6.71µs 6.92µs 6.96µs 7.46µs 3% ▃▆▁█▃▃▃▂▂▁ 8888 merge_small
11M 2 225µs 227µs 229µs 233µs 1% ███▃▇▂▃▁▃▂ 8888 displacement_full_large
16M 1 381µs 401µs 401µs 421µs 3% ▅▅▅█▆▄▄▃▃▁ 8888 displacement_alpha_large
19M 1 507µs 508µs 509µs 512µs 0% █▃▂▆▂▂▃▂▃▁ 8888 displacement_zero_large
19M 19 9µs 9.11µs 9.15µs 9.67µs 2% ▄▂▂▂█▂▁▁▁▂ 8888 displacement_full_small
19M 5 54.2µs 54.5µs 54.9µs 58µs 2% █▃▂▂▁▁▃▁▁▁ 8888 blurroundrect_WH[100x100]_cr[90]
20M 1 229µs 230µs 231µs 240µs 2% █▄▃▂▂▁▁▁▁▂ 8888 GM_varied_text_clipped_no_lcd
20M 1 267µs 269µs 270µs 279µs 1% █▄▃▂▂▂▂▂▁▁ 8888 GM_varied_text_ignorable_clip_no_lcd
22M 1 1.95ms 1.97ms 2.03ms 2.46ms 8% ▁▁▁▁▁▁▁▂█▃ 8888 GM_convex_poly_clip

after:
maxrss loops min median mean max stddev samples config bench
10M 2 31.5µs 32.3µs 32.8µs 37.2µs 5% █▄▃▂▂▂▁▁▁▁ 8888 bitmap_BGRA_8888_A_scale_bicubic
10M 13 43.9µs 44µs 44.1µs 44.9µs 1% █▂▁▁▁▆▁▁▁▂ 8888 bitmap_BGRA_8888_A_scale_bilerp
10M 19 22.7µs 23.3µs 25.6µs 32.4µs 14% ▁▁▁▁▁▅▆▁▅█ 8888 bitmap_Index_8_A
10M 5 1.79µs 1.97µs 3.85µs 21.1µs 158% █▁▁▁▁▁▁▁▁▁ 8888 text_16_AA_88
10M 141 1.83µs 1.83µs 1.85µs 1.93µs 2% ▅▁▁█▁▁▁▁▁▁ 8888 text_16_AA_FF
10M 10 4.65µs 4.92µs 5.06µs 6.56µs 11% █▃▃▂▂▂▁▁▁▁ 8888 rotated_rects_aa_alternating_transparent_and_opaque_src
10M 51 4.35µs 4.48µs 4.83µs 6.68µs 17% ▂▁▁▁▁▁▁▂▆█ 8888 rotated_rects_aa_changing_opaque_src
10M 51 4.38µs 4.79µs 4.85µs 5.84µs 11% ▁█▁▃▃▁▄▁▄▇ 8888 rotated_rects_aa_same_opaque_src
10M 32 5.58µs 6.24µs 6.1µs 6.39µs 5% █▂█▆▁▇▄▅▇▇ 8888 rotated_rects_aa_alternating_transparent_and_opaque_srcover
10M 42 4.28µs 5.59µs 5.11µs 6.01µs 15% ▂▂█▇█▂▁▆▁▇ 8888 rotated_rects_aa_changing_transparent_srcover
10M 48 4.24µs 4.33µs 4.58µs 6.46µs 15% ▁▁▁▁▁█▃▂▁▁ 8888 rotated_rects_aa_changing_opaque_srcover
10M 48 4.28µs 4.3µs 4.4µs 5.12µs 6% ▂▂▁▁▁▁▁▁▁█ 8888 rotated_rects_aa_same_transparent_srcover
10M 46 4.24µs 4.29µs 4.66µs 7.11µs 20% ▁▁▁▁▁▁▁▁▃█ 8888 rotated_rects_aa_same_opaque_srcover
10M 3 39.3µs 39.4µs 51.4µs 154µs 70% █▁▁▁▁▁▁▁▁▁ 8888 rectori
10M 1 2.32ms 2.43ms 2.53ms 3.14ms 11% ▁▁▁▁▂▄█▃▅▁ 8888 maskcolor
10M 1 2.33ms 2.37ms 2.54ms 3.21ms 12% ▁▁▁▁▁▂█▅▆▁ 8888 maskopaque
10M 10 15.3µs 15.6µs 15.8µs 17.2µs 4% █▅▃▂▂▂▁▁▁▁ 8888 rrects_3_stroke_4
10M 46 4.03µs 4.09µs 4.15µs 4.47µs 4% █▄▆▂▂▂▁▁▁▁ 8888 rrects_3
10M 15 15.9µs 16.2µs 16.3µs 17.8µs 4% █▄▃▂▂▂▁▁▁▁ 8888 ovals_3_stroke_4
10M 40 5.14µs 5.26µs 5.29µs 5.72µs 3% █▅▃▂▂▁▂▂▁▁ 8888 ovals_3
10M 222 1.91µs 1.99µs 2.21µs 2.91µs 19% ▂▁▁▁▁▁▂▇▇█ 8888 zeroradroundrect
10M 462 3.9µs 3.96µs 4.23µs 5.22µs 12% ▆▄█▁▂▁▁▁▁▁ 8888 arbroundrect
10M 8 8.2µs 8.59µs 8.62µs 8.97µs 3% ▆▄█▄▅▃▁▆▄█ 8888 merge_large
10M 14 6.73µs 6.88µs 6.86µs 7.08µs 2% ▄█▁▂▄▂▅▄▂▅ 8888 merge_small
11M 2 221µs 234µs 237µs 263µs 5% ▄▃▃▃▄▃▂▁▇█ 8888 displacement_full_large
16M 1 387µs 416µs 427µs 471µs 7% ▇█▁▃▃▁▃▃▇▆ 8888 displacement_alpha_large
19M 1 512µs 521µs 528µs 594µs 5% █▂▂▂▁▁▂▃▁▁ 8888 displacement_zero_large
19M 18 9.06µs 9.12µs 9.13µs 9.23µs 1% █▃▃▃▄▃▆▁▅▅ 8888 displacement_full_small
19M 5 55.6µs 55.9µs 56.5µs 59.5µs 2% █▃▂▁▁▁▁▁▅▁ 8888 blurroundrect_WH[100x100]_cr[90]
20M 1 229µs 233µs 235µs 254µs 3% █▄▃▂▂▁▁▂▁▁ 8888 GM_varied_text_clipped_no_lcd
20M 1 270µs 271µs 272µs 278µs 1% █▄▃▂▂▂▁▂▁▇ 8888 GM_varied_text_ignorable_clip_no_lcd
22M 1 1.96ms 2ms 2.06ms 2.45ms 7% ▂▂▁▁▁▁▁▃█▄ 8888 GM_convex_poly_clip

BUG=skia:

Review URL: https://codereview.chromium.org/754733002
/external/skia/src/opts/SkColor_opts_SSE2.h
533a32782f9817bb307484b36323040470575da4 25-Nov-2014 qiankun.miao <qiankun.miao@intel.com> Cleanup with SkAlphaMulQ_SSE2()

Related nanobench results:
before:
10M 18 7.03µs 7.31µs 7.38µs 8.46µs 6% ▂▁▂▂▂▃▄▁█▁ 8888 bitmaprect_80_filter_identity
10M 43 6.96µs 6.97µs 6.99µs 7.19µs 1% ▁▂▁▁▁▁▁█▁▁ 8888 bitmaprect_80_nofilter_identity
10M 14 35.7µs 35.8µs 35.9µs 36.3µs 1% ▃▂▁▂▁█▂▁▁▁ 8888 bitmap_BGRA_8888_update_scale_bilerp
10M 16 35.5µs 35.6µs 35.7µs 36.3µs 1% █▅▂▁▁▁▃▂▁▁ 8888 bitmap_BGRA_8888_update_volatile_scale_bilerp
10M 16 35.4µs 35.4µs 35.5µs 36.8µs 1% ▂▁█▁▁▁▁▂▁▁ 8888 bitmap_BGRA_8888_scale_bilerp
10M 25 16.4µs 16.6µs 16.7µs 17.4µs 2% ▂▁▁▂▁▁▁▅▅█ 8888 bitmap_Index_8
10M 15 37.9µs 38µs 38µs 38.4µs 0% ▄▆▂▁▁▁█▂▁▁ 8888 bitmap_RGB_565
10M 33 11.1µs 11.1µs 11.1µs 11.2µs 0% ▆▂█▂▂▂▁▁▂▁ 8888 bitmap_BGRA_8888_scale
after:
10M 9 7.04µs 7.06µs 7.1µs 7.32µs 1% █▅▂▁▁▂▁▁▁▁ 8888 bitmaprect_80_filter_identity
10M 18 7.01µs 7.02µs 7.05µs 7.25µs 1% █▂▁▁▁▁▁▁▁▁ 8888 bitmaprect_80_nofilter_identity
10M 5 33.9µs 34µs 34.1µs 34.5µs 1% █▃▂▂▁▁▁▅▃▂ 8888 bitmap_BGRA_8888_update_scale_bilerp
10M 7 35.5µs 35.5µs 35.6µs 36.3µs 1% ▃▂▂▁▂▁▂▁█▂ 8888 bitmap_BGRA_8888_update_volatile_scale_bilerp
10M 7 35.5µs 35.5µs 35.7µs 36.8µs 1% ▂▁▁▁▁▁▁▁▁█ 8888 bitmap_BGRA_8888_scale_bilerp
10M 11 16.4µs 16.4µs 16.4µs 16.6µs 0% █▂▁▁▂▁▁▁▂▁ 8888 bitmap_Index_8
10M 7 37.3µs 37.4µs 38.4µs 47.8µs 9% ▁▁▁▁▁▁▁▁▁█ 8888 bitmap_RGB_565
10M 33 11µs 11µs 11.1µs 11.2µs 1% ▄█▅▃▂▁▁▁▁▁ 8888 bitmap_BGRA_8888_scale

BUG=skia:

Review URL: https://codereview.chromium.org/755573002
/external/skia/src/opts/SkColor_opts_SSE2.h
f04713d9c8f2af15f97984b47587358488e2594e 14-Nov-2014 qiankun.miao <qiankun.miao@intel.com> Optimize SkAlphaMulQ_SSE2

These two mask clear are useless, because _mm_srli_epi16 fills high byte
of each word with 0.

BUG=skia:

Review URL: https://codereview.chromium.org/724333003
/external/skia/src/opts/SkColor_opts_SSE2.h
f31fa24914c683abcc2c860093b142725c43fbe6 12-May-2014 commit-bot@chromium.org <commit-bot@chromium.org@2bbb7eff-a529-9590-31e7-b0007b416f81> Make gMask_00FF00FF a constant

This is to optimize SkAlphaMulQ() in PIC mode. With the visibility=default
symbol the constant is not known at compile time (and is not a constant), but
instead is fetched through a double indirection through GOT. The function is
quite hot on one of the chromium benchmarks:
rasterize_and_record_micro.key_silk_cases.

This change replaces the symbol with a compile-time constant. As a bonus the
variable is not exported from the dynamic library, i. e. a cleaner library
interface.

See specific performance improvements on Android here:
http://goo.gl/iMuTDt

R=skyostil@chromium.org, tomhudson@chromium.org, mtklein@google.com, reed@google.com, tomhudson@google.com

Author: pasko@chromium.org

Review URL: https://codereview.chromium.org/270473003

git-svn-id: http://skia.googlecode.com/svn/trunk@14696 2bbb7eff-a529-9590-31e7-b0007b416f81
/external/skia/src/opts/SkColor_opts_SSE2.h
7bf10152b129e3b0cad76f2abd5136ccbc74a393 25-Apr-2014 commit-bot@chromium.org <commit-bot@chromium.org@2bbb7eff-a529-9590-31e7-b0007b416f81> Xfermode: SSE2 implementation of overlay_modeproc

With SSE2 optimization, performance of Xfermode_Overlay will improve
about 35% on desktop i7-3770. Here are the data:
before:
Xfermode_Overlay 8888: cmsecs = 44.17 565: cmsecs = 59.27
after:
Xfermode_Overlay 8888: cmsecs = 28.30 565: cmsecs = 35.84

BUG=skia:
R=mtklein@google.com

Author: qiankun.miao@intel.com

Review URL: https://codereview.chromium.org/232783002

git-svn-id: http://skia.googlecode.com/svn/trunk@14370 2bbb7eff-a529-9590-31e7-b0007b416f81
/external/skia/src/opts/SkColor_opts_SSE2.h
54299654e964ab53189f774e30bce2adebbdc857 14-Apr-2014 commit-bot@chromium.org <commit-bot@chromium.org@2bbb7eff-a529-9590-31e7-b0007b416f81> Xfermode: SSE2 implementation of a number of simple transfer modes

These modes share some common code and not very complex, so group them
together. This CL yields about 50% performance improvement on desktop
i7-3770. Here are the data:
before:
Xfermode_Screen 8888: cmsecs = 30.25 565: cmsecs = 46.81
Xfermode_Modulate 8888: cmsecs = 22.48 565: cmsecs = 40.06
Xfermode_Plus 8888: cmsecs = 21.04 565: cmsecs = 37.51
Xfermode_Xor 8888: cmsecs = 37.18 565: cmsecs = 52.53
Xfermode_DstATop 8888: cmsecs = 28.97 565: cmsecs = 46.42
Xfermode_SrcATop 8888: cmsecs = 29.74 565: cmsecs = 46.25
Xfermode_DstOut 8888: cmsecs = 5.34 565: cmsecs = 24.53
Xfermode_SrcOut 8888: cmsecs = 12.25 565: cmsecs = 24.39
Xfermode_DstIn 8888: cmsecs = 5.30 565: cmsecs = 24.50
Xfermode_SrcIn 8888: cmsecs = 12.05 565: cmsecs = 25.40
Xfermode_DstOver 8888: cmsecs = 12.45 565: cmsecs = 0.15
Xfermode_SrcOver 8888: cmsecs = 2.68 565: cmsecs = 4.42
after:
Xfermode_Screen 8888: cmsecs = 13.68 565: cmsecs = 21.73
Xfermode_Modulate 8888: cmsecs = 13.25 565: cmsecs = 20.97
Xfermode_Plus 8888: cmsecs = 9.77 565: cmsecs = 16.71
Xfermode_Xor 8888: cmsecs = 17.64 565: cmsecs = 25.62
Xfermode_DstATop 8888: cmsecs = 15.99 565: cmsecs = 23.74
Xfermode_SrcATop 8888: cmsecs = 15.69 565: cmsecs = 23.40
Xfermode_DstOut 8888: cmsecs = 4.77 565: cmsecs = 11.85
Xfermode_SrcOut 8888: cmsecs = 4.98 565: cmsecs = 11.84
Xfermode_DstIn 8888: cmsecs = 4.68 565: cmsecs = 11.72
Xfermode_SrcIn 8888: cmsecs = 4.93 565: cmsecs = 11.79
Xfermode_DstOver 8888: cmsecs = 5.04 565: cmsecs = 0.15
Xfermode_SrcOver 8888: cmsecs = 2.69 565: cmsecs = 4.42

BUG=skia:
R=mtklein@google.com

Author: qiankun.miao@intel.com

Review URL: https://codereview.chromium.org/232793002

git-svn-id: http://skia.googlecode.com/svn/trunk@14176 2bbb7eff-a529-9590-31e7-b0007b416f81
/external/skia/src/opts/SkColor_opts_SSE2.h
c524e98f1edf06b53e65543f5f28217fa13b7aa9 09-Apr-2014 commit-bot@chromium.org <commit-bot@chromium.org@2bbb7eff-a529-9590-31e7-b0007b416f81> Xfermode: SSE2 implementation of multiply_modeproc

This patch implements basics for Xfermode SSE optimization. Based on
these basics, SSE2 implementation of multiply_modeproc is provided. SSE2
implementation for other modes will come in future. With this patch
performance of Xfermode_Multiply will improve about 45%. Here are the
data on desktop i7-3770.
before:
Xfermode_Multiply 8888: cmsecs = 33.30 565: cmsecs = 45.65
after:
Xfermode_Multiply 8888: cmsecs = 17.18 565: cmsecs = 24.87

BUG=

Committed: http://code.google.com/p/skia/source/detail?r=14006

Committed: http://code.google.com/p/skia/source/detail?r=14050

R=mtklein@google.com, robertphillips@google.com

Author: qiankun.miao@intel.com

Review URL: https://codereview.chromium.org/202903004

git-svn-id: http://skia.googlecode.com/svn/trunk@14107 2bbb7eff-a529-9590-31e7-b0007b416f81
/external/skia/src/opts/SkColor_opts_SSE2.h
77815fd74df355b8d6eff8a91fd10cc65033a79f 03-Apr-2014 commit-bot@chromium.org <commit-bot@chromium.org@2bbb7eff-a529-9590-31e7-b0007b416f81> Revert of Xfermode: SSE2 implementation of multiply_modeproc (https://codereview.chromium.org/202903004/)

Reason for revert:
It looks like serialization is broken. The serialize and pipe-cross-process tests are failing and turning (at least the Ubuntu12 and Win7) bots red

Original issue's description:
> Xfermode: SSE2 implementation of multiply_modeproc
>
> This patch implements basics for Xfermode SSE optimization. Based on
> these basics, SSE2 implementation of multiply_modeproc is provided. SSE2
> implementation for other modes will come in future. With this patch
> performance of Xfermode_Multiply will improve about 45%. Here are the
> data on desktop i7-3770.
> before:
> Xfermode_Multiply 8888: cmsecs = 33.30 565: cmsecs = 45.65
> after:
> Xfermode_Multiply 8888: cmsecs = 17.18 565: cmsecs = 24.87
>
> BUG=
>
> Committed: http://code.google.com/p/skia/source/detail?r=14006
>
> Committed: http://code.google.com/p/skia/source/detail?r=14050

R=mtklein@google.com, qiankun.miao@intel.com
TBR=mtklein@google.com, qiankun.miao@intel.com
NOTREECHECKS=true
NOTRY=true
BUG=

Author: robertphillips@google.com

Review URL: https://codereview.chromium.org/224253003

git-svn-id: http://skia.googlecode.com/svn/trunk@14053 2bbb7eff-a529-9590-31e7-b0007b416f81
/external/skia/src/opts/SkColor_opts_SSE2.h
c3118739277beb7678973d47e3d71bb863929bce 03-Apr-2014 commit-bot@chromium.org <commit-bot@chromium.org@2bbb7eff-a529-9590-31e7-b0007b416f81> Xfermode: SSE2 implementation of multiply_modeproc

This patch implements basics for Xfermode SSE optimization. Based on
these basics, SSE2 implementation of multiply_modeproc is provided. SSE2
implementation for other modes will come in future. With this patch
performance of Xfermode_Multiply will improve about 45%. Here are the
data on desktop i7-3770.
before:
Xfermode_Multiply 8888: cmsecs = 33.30 565: cmsecs = 45.65
after:
Xfermode_Multiply 8888: cmsecs = 17.18 565: cmsecs = 24.87

BUG=

Committed: http://code.google.com/p/skia/source/detail?r=14006

R=mtklein@google.com, robertphillips@google.com

Author: qiankun.miao@intel.com

Review URL: https://codereview.chromium.org/202903004

git-svn-id: http://skia.googlecode.com/svn/trunk@14050 2bbb7eff-a529-9590-31e7-b0007b416f81
/external/skia/src/opts/SkColor_opts_SSE2.h
079d2986002a6eb407ba4699bae58920fdc355a0 01-Apr-2014 commit-bot@chromium.org <commit-bot@chromium.org@2bbb7eff-a529-9590-31e7-b0007b416f81> Revert of Xfermode: SSE2 implementation of multiply_modeproc (https://codereview.chromium.org/202903004/)

Reason for revert:
Breaking builds

Original issue's description:
> Xfermode: SSE2 implementation of multiply_modeproc
>
> This patch implements basics for Xfermode SSE optimization. Based on
> these basics, SSE2 implementation of multiply_modeproc is provided. SSE2
> implementation for other modes will come in future. With this patch
> performance of Xfermode_Multiply will improve about 45%. Here are the
> data on desktop i7-3770.
> before:
> Xfermode_Multiply 8888: cmsecs = 33.30 565: cmsecs = 45.65
> after:
> Xfermode_Multiply 8888: cmsecs = 17.18 565: cmsecs = 24.87
>
> BUG=
>
> Committed: http://code.google.com/p/skia/source/detail?r=14006

R=mtklein@google.com, qiankun.miao@intel.com
TBR=mtklein@google.com, qiankun.miao@intel.com
NOTREECHECKS=true
NOTRY=true
BUG=

Author: robertphillips@google.com

Review URL: https://codereview.chromium.org/219243009

git-svn-id: http://skia.googlecode.com/svn/trunk@14007 2bbb7eff-a529-9590-31e7-b0007b416f81
/external/skia/src/opts/SkColor_opts_SSE2.h
25f7455f3a7cf2c440509bead85486079f1e4b31 01-Apr-2014 commit-bot@chromium.org <commit-bot@chromium.org@2bbb7eff-a529-9590-31e7-b0007b416f81> Xfermode: SSE2 implementation of multiply_modeproc

This patch implements basics for Xfermode SSE optimization. Based on
these basics, SSE2 implementation of multiply_modeproc is provided. SSE2
implementation for other modes will come in future. With this patch
performance of Xfermode_Multiply will improve about 45%. Here are the
data on desktop i7-3770.
before:
Xfermode_Multiply 8888: cmsecs = 33.30 565: cmsecs = 45.65
after:
Xfermode_Multiply 8888: cmsecs = 17.18 565: cmsecs = 24.87

BUG=
R=mtklein@google.com

Author: qiankun.miao@intel.com

Review URL: https://codereview.chromium.org/202903004

git-svn-id: http://skia.googlecode.com/svn/trunk@14006 2bbb7eff-a529-9590-31e7-b0007b416f81
/external/skia/src/opts/SkColor_opts_SSE2.h
475910750cdc7d14da3071d4052ba9ab98383be9 19-Feb-2014 commit-bot@chromium.org <commit-bot@chromium.org@2bbb7eff-a529-9590-31e7-b0007b416f81> SSE2 implementation of S32A_D565_Opaque

microbenchmark of S32A_D565_Opaque() shows a 3x speedup after SSE optimization with various count on i7-3770.

BUG=
R=mtklein@google.com, reed@google.com

Author: qiankun.miao@intel.com

Review URL: https://codereview.chromium.org/138163013

git-svn-id: http://skia.googlecode.com/svn/trunk@13495 2bbb7eff-a529-9590-31e7-b0007b416f81
/external/skia/src/opts/SkColor_opts_SSE2.h