History log of /external/tensorflow/tensorflow/compiler/xla/service/cpu/parallel_cpu_executable.cc
Revision Date Author Comments (<<< Hide modified files) (Show modified files >>>)
f83bb6db5bd17df215994bde7adacef50ece0192 16-Feb-2018 Sanjoy Das <sanjoy@google.com> [XLA:CPU] Minor cleanup to simple_orc_jit

SimpleResolver became unused after an LLVM upstream merge, and we never needed
the name mangling logic in what is now FindCompiledSymbol.

PiperOrigin-RevId: 186039307
/external/tensorflow/tensorflow/compiler/xla/service/cpu/parallel_cpu_executable.cc
ffa63e57bdd703ae051ae849af5b5a272fca2223 25-Jan-2018 Sanjoy Das <sanjoy@google.com> [TF:XLA] Replace most of HloProfilePrinter by a protocol buffer

This change replaces the meat of HloProfilePrinter with a protobuf
HloProfilePrinterData. The original plan was to serialize HloProfilePrinter
into C++ source code and put that in a .cc file along with the string for the
xla::ProgramShape. However, since we now directly serialize xla::ProgramShape
into a .o file, for consistency I think we should do the same thing for
HloProfilePrinter (instead of adding yet another output file to tfcompile).

The change itself is fairly simple, it is large mostly due to the mass renaming
I had to do.

PiperOrigin-RevId: 183158192
/external/tensorflow/tensorflow/compiler/xla/service/cpu/parallel_cpu_executable.cc
fc2526a8c1cf0bc2a93c8cc819ff7209eb4628c9 16-Dec-2017 A. Unique TensorFlower <gardener@tensorflow.org> Merged commit includes the following changes:
179277894 by gunan:

Run buildifier on build file.

--
179275101 by meheff:

Replace DeviceMemoryBase with ShapedBuffer in XLA interfaces.
Executable, TransferManager, and AllocationTracker now use ShapedBuffer to hold device memory addresses holding XLA data. Most of the change is straight-forward with the exception of AllocationTracker which was mostly rewritten (and simplified) and some refactoring in the CPU executable.

Also, have ShapedBuffer hold on-host and on-device Shapes which are the shapes of the representation of the data on the host and device, respectively. This is necessary because with cl/178624364 the on-host and on-device shape may no longer be equal.

--
179265385 by A. Unique TensorFlower:

Return error rather than CHECK fail in Executable::ExecuteOnStreamWrapper

--
179264551 by dandelion:

Internal fixes.

--

PiperOrigin-RevId: 179277894
/external/tensorflow/tensorflow/compiler/xla/service/cpu/parallel_cpu_executable.cc
4b636957604faa3361a799dd9d8749a6b85afff7 22-Nov-2017 Sanjoy Das <sanjoy@google.com> Place HloProfilePrinter and HloProfileIndexMap in Executable

This refactoring will later allow XlaCompiledCpuFunction to pull out the
HloProfilePrinter from Executable and use that to display the hlo execution
profile. A de/serialized HloProfilePrinter will let AOT compiled binaries
display their Hlo execution profile.

PiperOrigin-RevId: 176689528
/external/tensorflow/tensorflow/compiler/xla/service/cpu/parallel_cpu_executable.cc
96c415ad77c20e1cf2da5e61f85e24fd6c36eb28 03-Nov-2017 A. Unique TensorFlower <gardener@tensorflow.org> [XLA] Use maps with a deterministic iteration order for HloInstruction*.

Convert a bunch of std::maps with HloInstruction* and const HloInstruction* keys to use a comparator that is based on the unique_id of the instruction rather than the pointer value.

PiperOrigin-RevId: 174474868
/external/tensorflow/tensorflow/compiler/xla/service/cpu/parallel_cpu_executable.cc
0a7be5a2f58fe5470fa7526c9de1404cb16fe3dc 31-Oct-2017 Sanjoy Das <sanjoy@google.com> Rename (Add|Get)ProfileResult to something more specific; NFC

PiperOrigin-RevId: 174084570
/external/tensorflow/tensorflow/compiler/xla/service/cpu/parallel_cpu_executable.cc
8c748bdb7cbf435925675d6b7a3d75ecbefa3351 27-Sep-2017 A. Unique TensorFlower <gardener@tensorflow.org> Add more `const`s to xla::Executable. No functional change.

PiperOrigin-RevId: 170252047
/external/tensorflow/tensorflow/compiler/xla/service/cpu/parallel_cpu_executable.cc
06deeea373c93ea36547648481c5daf4dc56126f 27-Sep-2017 Mark Heffernan <meheff@google.com> For tuple-shaped data, change ShapedBuffer (an abstraction holding on-device data of a given shape) to also hold an array of pointers representing the tuple structure in the device memory. Previously ShapedBuffer only held array-shaped data at the leaves of the tuple shape. Construction of these array-of-pointers is handled by TransferManager which has to construct array-of-pointers anyway to transfer literals to the device. This change makes ShapedBuffer match the native representative of tuple-shaped data passed into XLA computations. This is the first step to migrating XLA interfaces away from using naked device memory pointers (DeviceMemoryBase) to using more expressive ShapedBuffers instead.

This change enables tuple-shaped parameters in computations run through the LocalClient interface.

Also, change LocalClient interfaces to return ScopedShapedBuffers as these are generally easier to deal with ownership-wise that ShapedBuffers. They are analogous to std::unique_ptr, while ShapedBuffers are analogous to bare pointers.

This change includes a couple other cleanups found along the way:

* move cpu/gpu/interpreter transfer managers into their respective directories under xla/service.

* Make the generic transfer manager take a pointer size. Previously it would just use sizeof(void*) which might not be exactly what is needed.

PiperOrigin-RevId: 170133015
/external/tensorflow/tensorflow/compiler/xla/service/cpu/parallel_cpu_executable.cc
1196fbc824b45679cba9fae8daa35cc6c02d3599 01-Sep-2017 A. Unique TensorFlower <gardener@tensorflow.org> Use boolean literals where appropriate instead of narrowing ints

PiperOrigin-RevId: 167314054
/external/tensorflow/tensorflow/compiler/xla/service/cpu/parallel_cpu_executable.cc
5ead76420dee762a5f710fda6893075f1292d5d3 19-Aug-2017 A. Unique TensorFlower <gardener@tensorflow.org> Reduce XLA compile time by ~7% for a convolutional image model:

* Added CompactPointerSet<T>, which is optimized for set size <= 1.
* Changed expensive CHECKs to DCHECKS in buffer_assignment.cc
* Reserve space in DFS state array before starting DFS.
* Use unsigned arithmetic in DFS state maintenance.
* HloInstruction:
- Moved frequently used fields to start for better cache locality.
- Use InlinedVector instead of vector for operand array.
- Use InlinedVector instead of vector for DFS stack.
* Pre-compute "is array" and "is tuple" for LogicalBuffer.
* PointsToSet:
- Combine two ShapeTrees into one.
- Use CompactPointerSet instead of std::set to hold sources.
- Use CompactPointerSet instead of std::set to hold flattened buffers.
* ShapeTree: use unique_ptr instead of optional for shape storage
(reduces size and destruction overhead).
* Add proper const qualifiers to some FlatSet iterator methods.

Co-author=jeff
PiperOrigin-RevId: 165759117
/external/tensorflow/tensorflow/compiler/xla/service/cpu/parallel_cpu_executable.cc
b882d686ff00f73425a846c47e29a7c336435f25 01-Aug-2017 Bjarke Hammersholt Roune <broune@google.com> Allow cost estimates to differ per backend and include the estimates into the HLO profile. Add a summary table for what categories have the most opportunity for optimization left in them.

PiperOrigin-RevId: 163780413
/external/tensorflow/tensorflow/compiler/xla/service/cpu/parallel_cpu_executable.cc
34cbf161d7b1191ad5c1b3bc02fc52d338e8b175 27-Jul-2017 Jiri Simsa <jsimsa@google.com> Update Dataset API documentation.

PiperOrigin-RevId: 163349457
/external/tensorflow/tensorflow/compiler/xla/service/cpu/parallel_cpu_executable.cc
90d6421c5e0898fb840197d9533c2f8ba1a7c651 11-Jul-2017 Shanqing Cai <cais@google.com> Merge changes from github.
END_PUBLIC

---
Commit d0f53f77f authored by Penghao Cen<scorpiocph@gmail.com>
Committed by Shanqing Cai<cais@google.com>:
Minor fix typo (#11323)

---
Commit 02fcf564e authored by Chris Song<sjhshy@gmail.com>
Committed by Chris Song<sjhshy@gmail.com>:
Fix misspells.

---
Commit 764c9b6b4 authored by Louis Tiao<ltiao@users.noreply.github.com>
Committed by GitHub<noreply@github.com>:
Fixed typo in docstring
---
Commit f8cd1283e authored by Shanqing Cai<cais@google.com>
Committed by Shanqing Cai<cais@google.com>:
Chaser

---
Commit 01383b946 authored by Shanqing Cai<cais@google.com>
Committed by Shanqing Cai<cais@google.com>:
Adapt TensorFlowTestCase.setUp() to new reset_default_graph() semantics

Avoid calling reset_default_graph() directly to prevent exceptions in
cases where test methods error out from within nested graph contexts,
which can leave _default_graph_stack non-empty in certain Python
versions.

---
Commit 0ffc37890 authored by Amit Patankar<amitpatankar@google.com>
Committed by Amit Patankar<amitpatankar@google.com>:
Removing second declaration of functions.

---
Commit f9c9cacb0 authored by A. Unique TensorFlower<gardener@tensorflow.org>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
Refactor ElementalIrEmitter's slice index finding code into
IrArray::Index::SourceIndexOfSlice().

PiperOrigin-RevId: 161140653

---
Commit ba297aec9 authored by A. Unique TensorFlower<gardener@tensorflow.org>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
Update ops-related pbtxt files.

PiperOrigin-RevId: 161138258

---
Commit 68d666737 authored by Alexandre Passos<apassos@google.com>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
Fixes a reentrant lock issue with tensors using ndarray memory which uses tensor memory.

PiperOrigin-RevId: 161137788

---
Commit a2ee8bca3 authored by A. Unique TensorFlower<gardener@tensorflow.org>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
Add support for int8 x int8 -> int32 matrix multiplication via cublasGemmEx to stream_executor.

PiperOrigin-RevId: 161137741

---
Commit 755fa7b50 authored by Mark Daoust<markdaoust@google.com>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
Block generate_test, and docs generating from running in python3.

- Doc generation is currently unsupported in python3

- These both end in errors in python 3.5.1+

PiperOrigin-RevId: 161137467

---
Commit 97cbcac45 authored by Peter Hawkins<phawkins@google.com>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
[TF:XLA] Fix failure in functionalize_control_flow rewrite for Enter nodes that are unused. Make sure we ignore such nodes without producing an error.

PiperOrigin-RevId: 161136545

---
Commit dabcb60bc authored by A. Unique TensorFlower<gardener@tensorflow.org>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
[XLA] Add reasonable error messages to Builder::Build for bad parameter numbers.

PiperOrigin-RevId: 161136262

---
Commit 0cbd249e8 authored by A. Unique TensorFlower<gardener@tensorflow.org>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
Add complex tensors support to `matrix_determinant`.

PiperOrigin-RevId: 161132422

---
Commit 335f1f14d authored by A. Unique TensorFlower<gardener@tensorflow.org>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
Extend static shape inference for SparseTensors with dense_shapes constructed using slicing.

PiperOrigin-RevId: 161132391

---
Commit 53604916e authored by Jianwei Xie<xiejw@google.com>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
Fixed the missing labels test in TPUEstimator.

PiperOrigin-RevId: 161131282

---
Commit 9f57dc8dd authored by Bruno Rosa<bruno.rosa@eldorado.org.br>
Committed by Bruno Rosa<bruno.rosa@eldorado.org.br>:
Use mcpu instead of march for ppc64le

march is not support by gcc on ppc64le

---
Commit 7d5c74a9c authored by Skye Wanderman-Milne<skyewm@google.com>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
Move duplicate detection logic from Graph to FunctionLibraryDefinition

Turns out this is more useful, since there are many function libraries
that don't belong to a graph. This will be used in a future
change. Note that this maintains the current behavior of Graph.

In addition, updates FunctionDefsEqual() to handle unset attr entries
(I ran into this when using this in said future change).

PiperOrigin-RevId: 161126628

---
Commit 2caec3af1 authored by Shanqing Cai<cais@google.com>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
Disable more timeseries py tests failing in OSS PIP GPU builds

PiperOrigin-RevId: 161124799

---
Commit 0b5cce367 authored by Eugene Brevdo<ebrevdo@google.com>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
Get TopK op working on GPU again. Extend using cub's radix sort.

1. Undo rollback of Andreas Kirsch's initial implementation.
2. Use cub segmented radix sort if Andreas' heap-based impl
for large k and small num_cols (thresholds of k=100, n=1000
determined empirically).
3. Use cub segmented radix sort if k == num_cols (this case is always faster).
4. Added benchmarks.

Benchmarks show that the GPU implementation is up to 3x slower for small k but
can be 10x faster for large num_cols and k.

Benchmarks:

Benchmark: m_128_n_10_k_5_use_gpu_False wall_time: 0.000166 s Throughput: 0.0077 GB/s
Benchmark: m_128_n_10_k_5_use_gpu_True wall_time: 0.000796 s Throughput: 0.00161 GB/s
Benchmark: m_128_n_10_k_9_use_gpu_False wall_time: 0.00017 s Throughput: 0.00751 GB/s
Benchmark: m_128_n_10_k_9_use_gpu_True wall_time: 0.000796 s Throughput: 0.00161 GB/s
Benchmark: m_128_n_10_k_10_use_gpu_False wall_time: 0.00017 s Throughput: 0.00753 GB/s
Benchmark: m_128_n_10_k_10_use_gpu_True wall_time: 0.000775 s Throughput: 0.00165 GB/s
Benchmark: m_128_n_100_k_1_use_gpu_False wall_time: 0.000155 s Throughput: 0.0826 GB/s
Benchmark: m_128_n_100_k_1_use_gpu_True wall_time: 0.000796 s Throughput: 0.0161 GB/s
Benchmark: m_128_n_100_k_50_use_gpu_False wall_time: 0.000247 s Throughput: 0.0519 GB/s
Benchmark: m_128_n_100_k_50_use_gpu_True wall_time: 0.0008 s Throughput: 0.016 GB/s
Benchmark: m_128_n_100_k_99_use_gpu_False wall_time: 0.000261 s Throughput: 0.049 GB/s
Benchmark: m_128_n_100_k_99_use_gpu_True wall_time: 0.000794 s Throughput: 0.0161 GB/s
Benchmark: m_128_n_100_k_100_use_gpu_False wall_time: 0.000239 s Throughput: 0.0536 GB/s
Benchmark: m_128_n_100_k_100_use_gpu_True wall_time: 0.000777 s Throughput: 0.0165 GB/s
Benchmark: m_128_n_1000_k_1_use_gpu_False wall_time: 0.000324 s Throughput: 0.395 GB/s
Benchmark: m_128_n_1000_k_1_use_gpu_True wall_time: 0.000916 s Throughput: 0.14 GB/s
Benchmark: m_128_n_1000_k_10_use_gpu_False wall_time: 0.00042 s Throughput: 0.305 GB/s
Benchmark: m_128_n_1000_k_10_use_gpu_True wall_time: 0.000902 s Throughput: 0.142 GB/s
Benchmark: m_128_n_1000_k_500_use_gpu_False wall_time: 0.0011 s Throughput: 0.116 GB/s
Benchmark: m_128_n_1000_k_500_use_gpu_True wall_time: 0.00097 s Throughput: 0.132 GB/s
Benchmark: m_128_n_1000_k_990_use_gpu_False wall_time: 0.00133 s Throughput: 0.0962 GB/s
Benchmark: m_128_n_1000_k_990_use_gpu_True wall_time: 0.000993 s Throughput: 0.129 GB/s
Benchmark: m_128_n_1000_k_1000_use_gpu_False wall_time: 0.00102 s Throughput: 0.126 GB/s
Benchmark: m_128_n_1000_k_1000_use_gpu_True wall_time: 0.000964 s Throughput: 0.133 GB/s
Benchmark: m_128_n_10000_k_10_use_gpu_False wall_time: 0.002 s Throughput: 0.64 GB/s
Benchmark: m_128_n_10000_k_10_use_gpu_True wall_time: 0.00288 s Throughput: 0.445 GB/s
Benchmark: m_128_n_10000_k_100_use_gpu_False wall_time: 0.00233 s Throughput: 0.549 GB/s
Benchmark: m_128_n_10000_k_100_use_gpu_True wall_time: 0.00325 s Throughput: 0.394 GB/s
Benchmark: m_128_n_10000_k_5000_use_gpu_False wall_time: 0.0127 s Throughput: 0.101 GB/s
Benchmark: m_128_n_10000_k_5000_use_gpu_True wall_time: 0.00381 s Throughput: 0.336 GB/s
Benchmark: m_128_n_10000_k_9900_use_gpu_False wall_time: 0.015 s Throughput: 0.0853 GB/s
Benchmark: m_128_n_10000_k_9900_use_gpu_True wall_time: 0.00438 s Throughput: 0.292 GB/s
Benchmark: m_128_n_10000_k_10000_use_gpu_False wall_time: 0.0104 s Throughput: 0.123 GB/s
Benchmark: m_128_n_10000_k_10000_use_gpu_True wall_time: 0.00427 s Throughput: 0.3 GB/s
Benchmark: m_128_n_100000_k_100_use_gpu_False wall_time: 0.0148 s Throughput: 0.865 GB/s
Benchmark: m_128_n_100000_k_100_use_gpu_True wall_time: 0.0262 s Throughput: 0.488 GB/s
Benchmark: m_128_n_100000_k_1000_use_gpu_False wall_time: 0.0201 s Throughput: 0.636 GB/s
Benchmark: m_128_n_100000_k_1000_use_gpu_True wall_time: 0.0263 s Throughput: 0.486 GB/s
Benchmark: m_128_n_100000_k_50000_use_gpu_False wall_time: 0.214 s Throughput: 0.0599 GB/s
Benchmark: m_128_n_100000_k_50000_use_gpu_True wall_time: 0.0322 s Throughput: 0.398 GB/s
Benchmark: m_128_n_100000_k_99000_use_gpu_False wall_time: 0.262 s Throughput: 0.0489 GB/s
Benchmark: m_128_n_100000_k_99000_use_gpu_True wall_time: 0.0377 s Throughput: 0.34 GB/s
Benchmark: m_128_n_100000_k_100000_use_gpu_False wall_time: 0.118 s Throughput: 0.108 GB/s
Benchmark: m_128_n_100000_k_100000_use_gpu_True wall_time: 0.0365 s Throughput: 0.351 GB/s

END_PUBLIC

BEGIN_PUBLIC
BEGIN_PUBLIC
Automated g4 rollback of changelist 157169178

PiperOrigin-RevId: 161476569
/external/tensorflow/tensorflow/compiler/xla/service/cpu/parallel_cpu_executable.cc
b760c0cade03186a8f194390f6cba46fb363bfca 07-Jul-2017 A. Unique TensorFlower <gardener@tensorflow.org> Update xla compiler after upstream Orc API change r307350:
https://reviews.llvm.org/rL307350

PiperOrigin-RevId: 161195744
/external/tensorflow/tensorflow/compiler/xla/service/cpu/parallel_cpu_executable.cc
3b41352a3177c2fe8a1329e8981b285bb6aacf8b 19-Jun-2017 A. Unique TensorFlower <gardener@tensorflow.org> [XLA:CPU] Thread-parallel CPU backend (work in progress).
*) Partitions HLO instructions along outer dimensions, based on simple cost model.
*) Emits loop nests with dynamic outer loop bounds (for partitions), leaves inner loop bounds static (for optimizations).
*) Dispatches parallel tasks on thread pool for execution.

Simple element-wise fusion benchmark:

CPU: Intel Sandybridge with HyperThreading (16 cores) dL1:32KB dL2:256KB dL3:20MB
Benchmark Time(ns) CPU(ns) Iterations
----------------------------------------------------------
BM_ParallelFusion/T1 16821490 16740939 100 237.791MB/s
BM_ParallelFusion/T2 9175467 17826232 100 435.945MB/s
BM_ParallelFusion/T4 5106019 18875761 100 783.389MB/s
BM_ParallelFusion/T8 2833598 19624622 233 1.379GB/s
BM_ParallelFusion/T16 1995259 26541594 344 1.958GB/s

Performance on some select model benchmarks (more work needed is needed here, but wanted to get this CL in and iterate).
Benchmark runs with 16 threads and wall time reported in seconds.

InceptionResnetV2.inception_resnet_v2_200x200x20x1000_inference_xla_cpu
wall_time(old): 7.97818803787
wall_time(new): 4.328297019

InceptionV3.inception_v3_200x200x20x1000_inference_xla_cpu
wall_time(old): 2.96792650223
wall_time(new): 1.21296644211

InceptionResnetV2.inception_resnet_v2_200x200x20x1000_training_xla_cpu
wall_time(old): 42.0342495441
wall_time(new): 17.9182584286

InceptionV3.inception_v3_200x200x20x1000_training_xla_cpu
wall_time(old): 6.99778497219
wall_time(new): 3.95318603516

BenchmarkRNN.rnn_basic_lstm_64x512_4x20_xla_cpu_forward
wall_time(old): 11.869822979
wall_time(new): 7.89778208733

BenchmarkRNN.rnn_basic_lstm_64x512_4x20_xla_cpu_forward_backward
wall_time(old): 38.1911079884
wall_time(new): 29.8181960583

PiperOrigin-RevId: 159474444
/external/tensorflow/tensorflow/compiler/xla/service/cpu/parallel_cpu_executable.cc
05412bd367198ec491ca034b4bc634784c03125c 07-Jun-2017 Mark Heffernan <meheff@google.com> [XLA] Simplify Shape traversal visitors.
Simplify shape traversal visitors in ShapeUtil and ShapeTree. Add a non-Status form because most uses of the traversal methods do not use it, and remove is_leaf parameter from ShapeTree.ForEach* as it is not frequently used.

PiperOrigin-RevId: 158201574
/external/tensorflow/tensorflow/compiler/xla/service/cpu/parallel_cpu_executable.cc
95719e869c61c78a4b0ac0407e1fb04e60daca35 23-May-2017 A. Unique TensorFlower <gardener@tensorflow.org> [XLA] Teach Executable to do its own profiling (patch 1/4).

Presently, ExecuteOnStreamWrapper is a method on xla::Service, where it doesn't really conceptually belong -- note that it doesn't use anything from the containing Service object, but it does have an Executable object as its first parameter that it could easily be a method on instead. The only reason that it needs to be on Service is that it needs to access a Backend object in order to call backend->compiler()->shape_size_function(), and simply moving that into Executable would introduce a dependency cycle.

Thus, this patch (the first part of a sequence to address this) teaches Executable and its derivatives to compute shape_size_function. In the CPU cases, this is simply a static function. However, in the GPU case, we need to pass in the shape_size_function to the constructor, since it depends on a pointer size computed in the GpuCompiler.

PiperOrigin-RevId: 156807318
/external/tensorflow/tensorflow/compiler/xla/service/cpu/parallel_cpu_executable.cc
e84588a3e9e67d91d6ae32e64469f890d217c5dd 18-May-2017 Eli Bendersky <eliben@google.com> [XLA] Attach an HloModuleConfig to HloModule, obviating the need to pass them around as a pair.

This cuts through a bunch of critical XLA APIs, but it's time... The background for this change is to make flags/options more easily pipe-able from the TF/XLA boundary deep into the XLA compiler and other components.

The situation after this CL is still not perfect; there are a number of places with chicken-egg scenarios when a module has to be constructed before a config (to register the result shape), but the situation is strictly better than before. Future CLs will clean things up even more.

PiperOrigin-RevId: 156469639
/external/tensorflow/tensorflow/compiler/xla/service/cpu/parallel_cpu_executable.cc
a20ebced22db1be959cdc9875f1a797fd3367712 12-May-2017 A. Unique TensorFlower <gardener@tensorflow.org> [XLA:CPU] Prep work for thread-parallel XLA CPU backend.
*) Plumbs intra op thread parallelism value through to XLA backend.
*) Service execution uses inter/intra op pools from backend.
*) LocalService execution uses intra op pool from backend for XLA parallel ops,
and intra op pool passed in ExecutableRunOptions for eigen ops.

PiperOrigin-RevId: 155891730
/external/tensorflow/tensorflow/compiler/xla/service/cpu/parallel_cpu_executable.cc
3088d3664a99e7cb81ee190f4d65f4bd10407f42 29-Mar-2017 David Majnemer <majnemer@google.com> [XLA] Move kPad from GpuElementalIrEmitter::MakeElementGenerator to ElementalIrEmitter::MakeElementGenerator

There is nothing GPU specific in GpuElementalIrEmitter::MakeElementGenerator
for kPad. Move it into the base implementation so that all subcalses have it as
an implementation.
Change: 151564674
/external/tensorflow/tensorflow/compiler/xla/service/cpu/parallel_cpu_executable.cc
fce9ecb9cb44e03989c98367aa5a3b73c644a606 28-Mar-2017 A. Unique TensorFlower <gardener@tensorflow.org> [XLA:CPU] Implements LocalClient support for parallel CPU backend.
Change: 151446054
/external/tensorflow/tensorflow/compiler/xla/service/cpu/parallel_cpu_executable.cc
738143e6cd7f9eba0a0e77b44c6cc5ae4e1781ad 07-Mar-2017 Peter Hawkins <phawkins@google.com> [TF:XLA] Remove support for client-allocated result buffers.

This code path is unused; Tensorflow ended up settling on having XLA allocate result buffers using Tensorflow's allocator. Remove it to reduce the proliferation of ExecuteXYZ() methods.
Change: 149423775
/external/tensorflow/tensorflow/compiler/xla/service/cpu/parallel_cpu_executable.cc
112a534b50c0a23dec95382941ac0556f2866b29 03-Mar-2017 A. Unique TensorFlower <gardener@tensorflow.org> [XLA:GPU] Cache GPU substreams across executions
Change: 149063035
/external/tensorflow/tensorflow/compiler/xla/service/cpu/parallel_cpu_executable.cc
af2c7253bb1f9d135ad9b0c6a271741205ab57fd 02-Mar-2017 David Majnemer <majnemer@google.com> [XLA] Add support for profiling multiple computations

While we are here, add support for getting the cost analysis for call HLOs.
Change: 148952748
/external/tensorflow/tensorflow/compiler/xla/service/cpu/parallel_cpu_executable.cc
8ff1c465c87fc3967c9d480646fac6d6205f856c 07-Feb-2017 A. Unique TensorFlower <gardener@tensorflow.org> [TF:XLA] Change buffer assignment to combine temp buffers into one allocation.

This lays the groundwork for future CLs to reduce overall memory usage, but
doesn't accomplish that goal yet. I.e. this is step 1.

The main change is in the semantics of BufferAllocation. Previously we'd only
assign non-interferring (i.e. disjoint in liveness) LogicalBuffers to a single
BufferAllocation. This meant that each BufferAllocation represented a unique
address range in the working memory of the compiled program.

Now we allow assignment of LogicalBuffers that overlap in liveness to the same
BufferAllocation, by ensuring they occupy disjoint address ranges within the
allocation. Bookkeeping of each address range is accomplished by associating
each LogicalBuffer with an offset and size.

We take advantage of these new semantics to combine all temp buffers into a
single BufferAllocation, by laying them end-to-end in a postprocessing step -
see BufferAssigner::CombineTempAllocations. This is the same logic that
TempBufferOffsets used on the GPU side; that class has been removed.

Entry parameters (inputs) and maybe_live_out (outputs) are unchanged, and may
still occupy multiple BufferAllocations.

The rest of the CL deals with the consequences of these changes.
Change: 146800348
/external/tensorflow/tensorflow/compiler/xla/service/cpu/parallel_cpu_executable.cc
8a8c9a185d268074fc6abb8a56466a57a295e98e 23-Jan-2017 Eli Bendersky <eliben@google.com> [XLA] Replace TODO(name) by TODO(b/...)
Change: 145314878
/external/tensorflow/tensorflow/compiler/xla/service/cpu/parallel_cpu_executable.cc
1e67c90e2caceeff82d09793d1ef5fa0300d219b 09-Jan-2017 Peter Hawkins <phawkins@google.com> Initial open-source release of XLA: Accelerated Linear Algebra.

XLA is a compiler-based linear algebra execution engine that targets CPUs, GPUs and custom accelerators.

XLA is still experimental; we are releasing it early to get the community involved.
Change: 143990941
/external/tensorflow/tensorflow/compiler/xla/service/cpu/parallel_cpu_executable.cc