41ab8faaf0d90238d42d8e2bbb7177467c10b4f6 |
|
08-Sep-2016 |
Miao Wang <miaowang@google.com> |
Implement multi-thread CPU GEMM for BLAS Intrinsics - Multi-thread GEMM utilizes existing RS thread pool on top of Eigen. - Large matrix-matrix multiplication is decomposed into multiple tiled matrix-matrix multiplications. Each thread iterates on the unfinished works. - The tiling applies to ONLY ONE dimension of each input matrix, and whether to tile X or Y depends on the transpose of the matrix. - The performance increase is proportional to the number of available CPU cores, for sufficiently large matrices. Test: CTS test (rsblas) pass on Angler, Fugu and new devices. Performance test with RsBlasBenchmark and RsNeuralNet demo on Anger, Ryu, Seed, Shamu, Volantis, Fugu and new devices, showing roughly 70%(Volantix 2 core) ~ 400+%(Angler 8 core) perf gain. Change-Id: If96f4119fd34d5d9d98a2542801495e7ffe577ae
/frameworks/rs/cpu_ref/rsCpuBLASDispatch.h
|