Cross Reference: /external/tensorflow/tensorflow/compiler/xla/service/cpu/parallel_cpu

History log of /external/tensorflow/tensorflow/compiler/xla/service/cpu/parallel_cpu_executable.cc
Revision	Date	Author	Comments (<<< Hide modified files) (Show modified files >>>)
f83bb6db5bd17df215994bde7adacef50ece0192	16-Feb-2018	Sanjoy Das <sanjoy@google.com>	[XLA:CPU] Minor cleanup to simple_orc_jit SimpleResolver became unused after an LLVM upstream merge, and we never needed the name mangling logic in what is now FindCompiledSymbol. PiperOrigin-RevId: 186039307 /external/tensorflow/tensorflow/compiler/xla/service/cpu/parallel_cpu_executable.cc
ffa63e57bdd703ae051ae849af5b5a272fca2223	25-Jan-2018	Sanjoy Das <sanjoy@google.com>	[TF:XLA] Replace most of HloProfilePrinter by a protocol buffer This change replaces the meat of HloProfilePrinter with a protobuf HloProfilePrinterData. The original plan was to serialize HloProfilePrinter into C++ source code and put that in a .cc file along with the string for the xla::ProgramShape. However, since we now directly serialize xla::ProgramShape into a .o file, for consistency I think we should do the same thing for HloProfilePrinter (instead of adding yet another output file to tfcompile). The change itself is fairly simple, it is large mostly due to the mass renaming I had to do. PiperOrigin-RevId: 183158192 /external/tensorflow/tensorflow/compiler/xla/service/cpu/parallel_cpu_executable.cc
fc2526a8c1cf0bc2a93c8cc819ff7209eb4628c9	16-Dec-2017	A. Unique TensorFlower <gardener@tensorflow.org>	Merged commit includes the following changes: 179277894 by gunan: Run buildifier on build file. -- 179275101 by meheff: Replace DeviceMemoryBase with ShapedBuffer in XLA interfaces. Executable, TransferManager, and AllocationTracker now use ShapedBuffer to hold device memory addresses holding XLA data. Most of the change is straight-forward with the exception of AllocationTracker which was mostly rewritten (and simplified) and some refactoring in the CPU executable. Also, have ShapedBuffer hold on-host and on-device Shapes which are the shapes of the representation of the data on the host and device, respectively. This is necessary because with cl/178624364 the on-host and on-device shape may no longer be equal. -- 179265385 by A. Unique TensorFlower: Return error rather than CHECK fail in Executable::ExecuteOnStreamWrapper -- 179264551 by dandelion: Internal fixes. -- PiperOrigin-RevId: 179277894 /external/tensorflow/tensorflow/compiler/xla/service/cpu/parallel_cpu_executable.cc
4b636957604faa3361a799dd9d8749a6b85afff7	22-Nov-2017	Sanjoy Das <sanjoy@google.com>	Place HloProfilePrinter and HloProfileIndexMap in Executable This refactoring will later allow XlaCompiledCpuFunction to pull out the HloProfilePrinter from Executable and use that to display the hlo execution profile. A de/serialized HloProfilePrinter will let AOT compiled binaries display their Hlo execution profile. PiperOrigin-RevId: 176689528 /external/tensorflow/tensorflow/compiler/xla/service/cpu/parallel_cpu_executable.cc
96c415ad77c20e1cf2da5e61f85e24fd6c36eb28	03-Nov-2017	A. Unique TensorFlower <gardener@tensorflow.org>	[XLA] Use maps with a deterministic iteration order for HloInstruction. Convert a bunch of std::maps with HloInstruction and const HloInstruction* keys to use a comparator that is based on the unique_id of the instruction rather than the pointer value. PiperOrigin-RevId: 174474868 /external/tensorflow/tensorflow/compiler/xla/service/cpu/parallel_cpu_executable.cc
0a7be5a2f58fe5470fa7526c9de1404cb16fe3dc	31-Oct-2017	Sanjoy Das <sanjoy@google.com>	Rename (Add\|Get)ProfileResult to something more specific; NFC PiperOrigin-RevId: 174084570 /external/tensorflow/tensorflow/compiler/xla/service/cpu/parallel_cpu_executable.cc
8c748bdb7cbf435925675d6b7a3d75ecbefa3351	27-Sep-2017	A. Unique TensorFlower <gardener@tensorflow.org>	Add more `const`s to xla::Executable. No functional change. PiperOrigin-RevId: 170252047 /external/tensorflow/tensorflow/compiler/xla/service/cpu/parallel_cpu_executable.cc
06deeea373c93ea36547648481c5daf4dc56126f	27-Sep-2017	Mark Heffernan <meheff@google.com>	For tuple-shaped data, change ShapedBuffer (an abstraction holding on-device data of a given shape) to also hold an array of pointers representing the tuple structure in the device memory. Previously ShapedBuffer only held array-shaped data at the leaves of the tuple shape. Construction of these array-of-pointers is handled by TransferManager which has to construct array-of-pointers anyway to transfer literals to the device. This change makes ShapedBuffer match the native representative of tuple-shaped data passed into XLA computations. This is the first step to migrating XLA interfaces away from using naked device memory pointers (DeviceMemoryBase) to using more expressive ShapedBuffers instead. This change enables tuple-shaped parameters in computations run through the LocalClient interface. Also, change LocalClient interfaces to return ScopedShapedBuffers as these are generally easier to deal with ownership-wise that ShapedBuffers. They are analogous to std::unique_ptr, while ShapedBuffers are analogous to bare pointers. This change includes a couple other cleanups found along the way: * move cpu/gpu/interpreter transfer managers into their respective directories under xla/service. * Make the generic transfer manager take a pointer size. Previously it would just use sizeof(void*) which might not be exactly what is needed. PiperOrigin-RevId: 170133015 /external/tensorflow/tensorflow/compiler/xla/service/cpu/parallel_cpu_executable.cc
1196fbc824b45679cba9fae8daa35cc6c02d3599	01-Sep-2017	A. Unique TensorFlower <gardener@tensorflow.org>	Use boolean literals where appropriate instead of narrowing ints PiperOrigin-RevId: 167314054 /external/tensorflow/tensorflow/compiler/xla/service/cpu/parallel_cpu_executable.cc
5ead76420dee762a5f710fda6893075f1292d5d3	19-Aug-2017	A. Unique TensorFlower <gardener@tensorflow.org>	Reduce XLA compile time by ~7% for a convolutional image model: * Added CompactPointerSet<T>, which is optimized for set size <= 1. * Changed expensive CHECKs to DCHECKS in buffer_assignment.cc * Reserve space in DFS state array before starting DFS. * Use unsigned arithmetic in DFS state maintenance. * HloInstruction: - Moved frequently used fields to start for better cache locality. - Use InlinedVector instead of vector for operand array. - Use InlinedVector instead of vector for DFS stack. * Pre-compute "is array" and "is tuple" for LogicalBuffer. * PointsToSet: - Combine two ShapeTrees into one. - Use CompactPointerSet instead of std::set to hold sources. - Use CompactPointerSet instead of std::set to hold flattened buffers. * ShapeTree: use unique_ptr instead of optional for shape storage (reduces size and destruction overhead). * Add proper const qualifiers to some FlatSet iterator methods. Co-author=jeff PiperOrigin-RevId: 165759117 /external/tensorflow/tensorflow/compiler/xla/service/cpu/parallel_cpu_executable.cc
b882d686ff00f73425a846c47e29a7c336435f25	01-Aug-2017	Bjarke Hammersholt Roune <broune@google.com>	Allow cost estimates to differ per backend and include the estimates into the HLO profile. Add a summary table for what categories have the most opportunity for optimization left in them. PiperOrigin-RevId: 163780413 /external/tensorflow/tensorflow/compiler/xla/service/cpu/parallel_cpu_executable.cc
34cbf161d7b1191ad5c1b3bc02fc52d338e8b175	27-Jul-2017	Jiri Simsa <jsimsa@google.com>	Update Dataset API documentation. PiperOrigin-RevId: 163349457 /external/tensorflow/tensorflow/compiler/xla/service/cpu/parallel_cpu_executable.cc
90d6421c5e0898fb840197d9533c2f8ba1a7c651	11-Jul-2017	Shanqing Cai <cais@google.com>	Merge changes from github. END_PUBLIC --- Commit d0f53f77f authored by Penghao Cen<scorpiocph@gmail.com> Committed by Shanqing Cai<cais@google.com>: Minor fix typo (#11323) --- Commit 02fcf564e authored by Chris Song<sjhshy@gmail.com> Committed by Chris Song<sjhshy@gmail.com>: Fix misspells. --- Commit 764c9b6b4 authored by Louis Tiao<ltiao@users.noreply.github.com> Committed by GitHub<noreply@github.com>: Fixed typo in docstring --- Commit f8cd1283e authored by Shanqing Cai<cais@google.com> Committed by Shanqing Cai<cais@google.com>: Chaser --- Commit 01383b946 authored by Shanqing Cai<cais@google.com> Committed by Shanqing Cai<cais@google.com>: Adapt TensorFlowTestCase.setUp() to new reset_default_graph() semantics Avoid calling reset_default_graph() directly to prevent exceptions in cases where test methods error out from within nested graph contexts, which can leave _default_graph_stack non-empty in certain Python versions. --- Commit 0ffc37890 authored by Amit Patankar<amitpatankar@google.com> Committed by Amit Patankar<amitpatankar@google.com>: Removing second declaration of functions. --- Commit f9c9cacb0 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Refactor ElementalIrEmitter's slice index finding code into IrArray::Index::SourceIndexOfSlice(). PiperOrigin-RevId: 161140653 --- Commit ba297aec9 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Update ops-related pbtxt files. PiperOrigin-RevId: 161138258 --- Commit 68d666737 authored by Alexandre Passos<apassos@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Fixes a reentrant lock issue with tensors using ndarray memory which uses tensor memory. PiperOrigin-RevId: 161137788 --- Commit a2ee8bca3 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Add support for int8 x int8 -> int32 matrix multiplication via cublasGemmEx to stream_executor. PiperOrigin-RevId: 161137741 --- Commit 755fa7b50 authored by Mark Daoust<markdaoust@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Block generate_test, and docs generating from running in python3. - Doc generation is currently unsupported in python3 - These both end in errors in python 3.5.1+ PiperOrigin-RevId: 161137467 --- Commit 97cbcac45 authored by Peter Hawkins<phawkins@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: [TF:XLA] Fix failure in functionalize_control_flow rewrite for Enter nodes that are unused. Make sure we ignore such nodes without producing an error. PiperOrigin-RevId: 161136545 --- Commit dabcb60bc authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: [XLA] Add reasonable error messages to Builder::Build for bad parameter numbers. PiperOrigin-RevId: 161136262 --- Commit 0cbd249e8 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Add complex tensors support to `matrix_determinant`. PiperOrigin-RevId: 161132422 --- Commit 335f1f14d authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Extend static shape inference for SparseTensors with dense_shapes constructed using slicing. PiperOrigin-RevId: 161132391 --- Commit 53604916e authored by Jianwei Xie<xiejw@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Fixed the missing labels test in TPUEstimator. PiperOrigin-RevId: 161131282 --- Commit 9f57dc8dd authored by Bruno Rosa<bruno.rosa@eldorado.org.br> Committed by Bruno Rosa<bruno.rosa@eldorado.org.br>: Use mcpu instead of march for ppc64le march is not support by gcc on ppc64le --- Commit 7d5c74a9c authored by Skye Wanderman-Milne<skyewm@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Move duplicate detection logic from Graph to FunctionLibraryDefinition Turns out this is more useful, since there are many function libraries that don't belong to a graph. This will be used in a future change. Note that this maintains the current behavior of Graph. In addition, updates FunctionDefsEqual() to handle unset attr entries (I ran into this when using this in said future change). PiperOrigin-RevId: 161126628 --- Commit 2caec3af1 authored by Shanqing Cai<cais@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Disable more timeseries py tests failing in OSS PIP GPU builds PiperOrigin-RevId: 161124799 --- Commit 0b5cce367 authored by Eugene Brevdo<ebrevdo@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Get TopK op working on GPU again. Extend using cub's radix sort. 1. Undo rollback of Andreas Kirsch's initial implementation. 2. Use cub segmented radix sort if Andreas' heap-based impl for large k and small num_cols (thresholds of k=100, n=1000 determined empirically). 3. Use cub segmented radix sort if k == num_cols (this case is always faster). 4. Added benchmarks. Benchmarks show that the GPU implementation is up to 3x slower for small k but can be 10x faster for large num_cols and k. Benchmarks: Benchmark: m_128_n_10_k_5_use_gpu_False wall_time: 0.000166 s Throughput: 0.0077 GB/s Benchmark: m_128_n_10_k_5_use_gpu_True wall_time: 0.000796 s Throughput: 0.00161 GB/s Benchmark: m_128_n_10_k_9_use_gpu_False wall_time: 0.00017 s Throughput: 0.00751 GB/s Benchmark: m_128_n_10_k_9_use_gpu_True wall_time: 0.000796 s Throughput: 0.00161 GB/s Benchmark: m_128_n_10_k_10_use_gpu_False wall_time: 0.00017 s Throughput: 0.00753 GB/s Benchmark: m_128_n_10_k_10_use_gpu_True wall_time: 0.000775 s Throughput: 0.00165 GB/s Benchmark: m_128_n_100_k_1_use_gpu_False wall_time: 0.000155 s Throughput: 0.0826 GB/s Benchmark: m_128_n_100_k_1_use_gpu_True wall_time: 0.000796 s Throughput: 0.0161 GB/s Benchmark: m_128_n_100_k_50_use_gpu_False wall_time: 0.000247 s Throughput: 0.0519 GB/s Benchmark: m_128_n_100_k_50_use_gpu_True wall_time: 0.0008 s Throughput: 0.016 GB/s Benchmark: m_128_n_100_k_99_use_gpu_False wall_time: 0.000261 s Throughput: 0.049 GB/s Benchmark: m_128_n_100_k_99_use_gpu_True wall_time: 0.000794 s Throughput: 0.0161 GB/s Benchmark: m_128_n_100_k_100_use_gpu_False wall_time: 0.000239 s Throughput: 0.0536 GB/s Benchmark: m_128_n_100_k_100_use_gpu_True wall_time: 0.000777 s Throughput: 0.0165 GB/s Benchmark: m_128_n_1000_k_1_use_gpu_False wall_time: 0.000324 s Throughput: 0.395 GB/s Benchmark: m_128_n_1000_k_1_use_gpu_True wall_time: 0.000916 s Throughput: 0.14 GB/s Benchmark: m_128_n_1000_k_10_use_gpu_False wall_time: 0.00042 s Throughput: 0.305 GB/s Benchmark: m_128_n_1000_k_10_use_gpu_True wall_time: 0.000902 s Throughput: 0.142 GB/s Benchmark: m_128_n_1000_k_500_use_gpu_False wall_time: 0.0011 s Throughput: 0.116 GB/s Benchmark: m_128_n_1000_k_500_use_gpu_True wall_time: 0.00097 s Throughput: 0.132 GB/s Benchmark: m_128_n_1000_k_990_use_gpu_False wall_time: 0.00133 s Throughput: 0.0962 GB/s Benchmark: m_128_n_1000_k_990_use_gpu_True wall_time: 0.000993 s Throughput: 0.129 GB/s Benchmark: m_128_n_1000_k_1000_use_gpu_False wall_time: 0.00102 s Throughput: 0.126 GB/s Benchmark: m_128_n_1000_k_1000_use_gpu_True wall_time: 0.000964 s Throughput: 0.133 GB/s Benchmark: m_128_n_10000_k_10_use_gpu_False wall_time: 0.002 s Throughput: 0.64 GB/s Benchmark: m_128_n_10000_k_10_use_gpu_True wall_time: 0.00288 s Throughput: 0.445 GB/s Benchmark: m_128_n_10000_k_100_use_gpu_False wall_time: 0.00233 s Throughput: 0.549 GB/s Benchmark: m_128_n_10000_k_100_use_gpu_True wall_time: 0.00325 s Throughput: 0.394 GB/s Benchmark: m_128_n_10000_k_5000_use_gpu_False wall_time: 0.0127 s Throughput: 0.101 GB/s Benchmark: m_128_n_10000_k_5000_use_gpu_True wall_time: 0.00381 s Throughput: 0.336 GB/s Benchmark: m_128_n_10000_k_9900_use_gpu_False wall_time: 0.015 s Throughput: 0.0853 GB/s Benchmark: m_128_n_10000_k_9900_use_gpu_True wall_time: 0.00438 s Throughput: 0.292 GB/s Benchmark: m_128_n_10000_k_10000_use_gpu_False wall_time: 0.0104 s Throughput: 0.123 GB/s Benchmark: m_128_n_10000_k_10000_use_gpu_True wall_time: 0.00427 s Throughput: 0.3 GB/s Benchmark: m_128_n_100000_k_100_use_gpu_False wall_time: 0.0148 s Throughput: 0.865 GB/s Benchmark: m_128_n_100000_k_100_use_gpu_True wall_time: 0.0262 s Throughput: 0.488 GB/s Benchmark: m_128_n_100000_k_1000_use_gpu_False wall_time: 0.0201 s Throughput: 0.636 GB/s Benchmark: m_128_n_100000_k_1000_use_gpu_True wall_time: 0.0263 s Throughput: 0.486 GB/s Benchmark: m_128_n_100000_k_50000_use_gpu_False wall_time: 0.214 s Throughput: 0.0599 GB/s Benchmark: m_128_n_100000_k_50000_use_gpu_True wall_time: 0.0322 s Throughput: 0.398 GB/s Benchmark: m_128_n_100000_k_99000_use_gpu_False wall_time: 0.262 s Throughput: 0.0489 GB/s Benchmark: m_128_n_100000_k_99000_use_gpu_True wall_time: 0.0377 s Throughput: 0.34 GB/s Benchmark: m_128_n_100000_k_100000_use_gpu_False wall_time: 0.118 s Throughput: 0.108 GB/s Benchmark: m_128_n_100000_k_100000_use_gpu_True wall_time: 0.0365 s Throughput: 0.351 GB/s END_PUBLIC BEGIN_PUBLIC BEGIN_PUBLIC Automated g4 rollback of changelist 157169178 PiperOrigin-RevId: 161476569 /external/tensorflow/tensorflow/compiler/xla/service/cpu/parallel_cpu_executable.cc
b760c0cade03186a8f194390f6cba46fb363bfca	07-Jul-2017	A. Unique TensorFlower <gardener@tensorflow.org>	Update xla compiler after upstream Orc API change r307350: https://reviews.llvm.org/rL307350 PiperOrigin-RevId: 161195744 /external/tensorflow/tensorflow/compiler/xla/service/cpu/parallel_cpu_executable.cc
3b41352a3177c2fe8a1329e8981b285bb6aacf8b	19-Jun-2017	A. Unique TensorFlower <gardener@tensorflow.org>	[XLA:CPU] Thread-parallel CPU backend (work in progress). ) Partitions HLO instructions along outer dimensions, based on simple cost model. ) Emits loop nests with dynamic outer loop bounds (for partitions), leaves inner loop bounds static (for optimizations). *) Dispatches parallel tasks on thread pool for execution. Simple element-wise fusion benchmark: CPU: Intel Sandybridge with HyperThreading (16 cores) dL1:32KB dL2:256KB dL3:20MB Benchmark Time(ns) CPU(ns) Iterations ---------------------------------------------------------- BM_ParallelFusion/T1 16821490 16740939 100 237.791MB/s BM_ParallelFusion/T2 9175467 17826232 100 435.945MB/s BM_ParallelFusion/T4 5106019 18875761 100 783.389MB/s BM_ParallelFusion/T8 2833598 19624622 233 1.379GB/s BM_ParallelFusion/T16 1995259 26541594 344 1.958GB/s Performance on some select model benchmarks (more work needed is needed here, but wanted to get this CL in and iterate). Benchmark runs with 16 threads and wall time reported in seconds. InceptionResnetV2.inception_resnet_v2_200x200x20x1000_inference_xla_cpu wall_time(old): 7.97818803787 wall_time(new): 4.328297019 InceptionV3.inception_v3_200x200x20x1000_inference_xla_cpu wall_time(old): 2.96792650223 wall_time(new): 1.21296644211 InceptionResnetV2.inception_resnet_v2_200x200x20x1000_training_xla_cpu wall_time(old): 42.0342495441 wall_time(new): 17.9182584286 InceptionV3.inception_v3_200x200x20x1000_training_xla_cpu wall_time(old): 6.99778497219 wall_time(new): 3.95318603516 BenchmarkRNN.rnn_basic_lstm_64x512_4x20_xla_cpu_forward wall_time(old): 11.869822979 wall_time(new): 7.89778208733 BenchmarkRNN.rnn_basic_lstm_64x512_4x20_xla_cpu_forward_backward wall_time(old): 38.1911079884 wall_time(new): 29.8181960583 PiperOrigin-RevId: 159474444 /external/tensorflow/tensorflow/compiler/xla/service/cpu/parallel_cpu_executable.cc
05412bd367198ec491ca034b4bc634784c03125c	07-Jun-2017	Mark Heffernan <meheff@google.com>	[XLA] Simplify Shape traversal visitors. Simplify shape traversal visitors in ShapeUtil and ShapeTree. Add a non-Status form because most uses of the traversal methods do not use it, and remove is_leaf parameter from ShapeTree.ForEach* as it is not frequently used. PiperOrigin-RevId: 158201574 /external/tensorflow/tensorflow/compiler/xla/service/cpu/parallel_cpu_executable.cc
95719e869c61c78a4b0ac0407e1fb04e60daca35	23-May-2017	A. Unique TensorFlower <gardener@tensorflow.org>	[XLA] Teach Executable to do its own profiling (patch 1/4). Presently, ExecuteOnStreamWrapper is a method on xla::Service, where it doesn't really conceptually belong -- note that it doesn't use anything from the containing Service object, but it does have an Executable object as its first parameter that it could easily be a method on instead. The only reason that it needs to be on Service is that it needs to access a Backend object in order to call backend->compiler()->shape_size_function(), and simply moving that into Executable would introduce a dependency cycle. Thus, this patch (the first part of a sequence to address this) teaches Executable and its derivatives to compute shape_size_function. In the CPU cases, this is simply a static function. However, in the GPU case, we need to pass in the shape_size_function to the constructor, since it depends on a pointer size computed in the GpuCompiler. PiperOrigin-RevId: 156807318 /external/tensorflow/tensorflow/compiler/xla/service/cpu/parallel_cpu_executable.cc
e84588a3e9e67d91d6ae32e64469f890d217c5dd	18-May-2017	Eli Bendersky <eliben@google.com>	[XLA] Attach an HloModuleConfig to HloModule, obviating the need to pass them around as a pair. This cuts through a bunch of critical XLA APIs, but it's time... The background for this change is to make flags/options more easily pipe-able from the TF/XLA boundary deep into the XLA compiler and other components. The situation after this CL is still not perfect; there are a number of places with chicken-egg scenarios when a module has to be constructed before a config (to register the result shape), but the situation is strictly better than before. Future CLs will clean things up even more. PiperOrigin-RevId: 156469639 /external/tensorflow/tensorflow/compiler/xla/service/cpu/parallel_cpu_executable.cc
a20ebced22db1be959cdc9875f1a797fd3367712	12-May-2017	A. Unique TensorFlower <gardener@tensorflow.org>	[XLA:CPU] Prep work for thread-parallel XLA CPU backend. ) Plumbs intra op thread parallelism value through to XLA backend. ) Service execution uses inter/intra op pools from backend. *) LocalService execution uses intra op pool from backend for XLA parallel ops, and intra op pool passed in ExecutableRunOptions for eigen ops. PiperOrigin-RevId: 155891730 /external/tensorflow/tensorflow/compiler/xla/service/cpu/parallel_cpu_executable.cc
3088d3664a99e7cb81ee190f4d65f4bd10407f42	29-Mar-2017	David Majnemer <majnemer@google.com>	[XLA] Move kPad from GpuElementalIrEmitter::MakeElementGenerator to ElementalIrEmitter::MakeElementGenerator There is nothing GPU specific in GpuElementalIrEmitter::MakeElementGenerator for kPad. Move it into the base implementation so that all subcalses have it as an implementation. Change: 151564674 /external/tensorflow/tensorflow/compiler/xla/service/cpu/parallel_cpu_executable.cc
fce9ecb9cb44e03989c98367aa5a3b73c644a606	28-Mar-2017	A. Unique TensorFlower <gardener@tensorflow.org>	[XLA:CPU] Implements LocalClient support for parallel CPU backend. Change: 151446054 /external/tensorflow/tensorflow/compiler/xla/service/cpu/parallel_cpu_executable.cc
738143e6cd7f9eba0a0e77b44c6cc5ae4e1781ad	07-Mar-2017	Peter Hawkins <phawkins@google.com>	[TF:XLA] Remove support for client-allocated result buffers. This code path is unused; Tensorflow ended up settling on having XLA allocate result buffers using Tensorflow's allocator. Remove it to reduce the proliferation of ExecuteXYZ() methods. Change: 149423775 /external/tensorflow/tensorflow/compiler/xla/service/cpu/parallel_cpu_executable.cc
112a534b50c0a23dec95382941ac0556f2866b29	03-Mar-2017	A. Unique TensorFlower <gardener@tensorflow.org>	[XLA:GPU] Cache GPU substreams across executions Change: 149063035 /external/tensorflow/tensorflow/compiler/xla/service/cpu/parallel_cpu_executable.cc
af2c7253bb1f9d135ad9b0c6a271741205ab57fd	02-Mar-2017	David Majnemer <majnemer@google.com>	[XLA] Add support for profiling multiple computations While we are here, add support for getting the cost analysis for call HLOs. Change: 148952748 /external/tensorflow/tensorflow/compiler/xla/service/cpu/parallel_cpu_executable.cc
8ff1c465c87fc3967c9d480646fac6d6205f856c	07-Feb-2017	A. Unique TensorFlower <gardener@tensorflow.org>	[TF:XLA] Change buffer assignment to combine temp buffers into one allocation. This lays the groundwork for future CLs to reduce overall memory usage, but doesn't accomplish that goal yet. I.e. this is step 1. The main change is in the semantics of BufferAllocation. Previously we'd only assign non-interferring (i.e. disjoint in liveness) LogicalBuffers to a single BufferAllocation. This meant that each BufferAllocation represented a unique address range in the working memory of the compiled program. Now we allow assignment of LogicalBuffers that overlap in liveness to the same BufferAllocation, by ensuring they occupy disjoint address ranges within the allocation. Bookkeeping of each address range is accomplished by associating each LogicalBuffer with an offset and size. We take advantage of these new semantics to combine all temp buffers into a single BufferAllocation, by laying them end-to-end in a postprocessing step - see BufferAssigner::CombineTempAllocations. This is the same logic that TempBufferOffsets used on the GPU side; that class has been removed. Entry parameters (inputs) and maybe_live_out (outputs) are unchanged, and may still occupy multiple BufferAllocations. The rest of the CL deals with the consequences of these changes. Change: 146800348 /external/tensorflow/tensorflow/compiler/xla/service/cpu/parallel_cpu_executable.cc
8a8c9a185d268074fc6abb8a56466a57a295e98e	23-Jan-2017	Eli Bendersky <eliben@google.com>	[XLA] Replace TODO(name) by TODO(b/...) Change: 145314878 /external/tensorflow/tensorflow/compiler/xla/service/cpu/parallel_cpu_executable.cc
1e67c90e2caceeff82d09793d1ef5fa0300d219b	09-Jan-2017	Peter Hawkins <phawkins@google.com>	Initial open-source release of XLA: Accelerated Linear Algebra. XLA is a compiler-based linear algebra execution engine that targets CPUs, GPUs and custom accelerators. XLA is still experimental; we are releasing it early to get the community involved. Change: 143990941 /external/tensorflow/tensorflow/compiler/xla/service/cpu/parallel_cpu_executable.cc