7e23850c2145ed565c668d6ba327dbcf064d4ed8 |
|
09-Feb-2018 |
Gunhan Gulsoy <gunan@google.com> |
Remove header dependence on cuda_config.h to fix opensource custom op support. Fixes #14454 Fixes #12860 PiperOrigin-RevId: 185194924
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
df982b8dea49eba273e33e4283c3b14eab171b04 |
|
09-Feb-2018 |
Guangda Lai <laigd@google.com> |
Split gpu_id.h and GpuIdManager out from build target //tensorflow/core:gpu_runtime, to reduce the size of dependencies, so when other lightweight libraries like grappler utils needs the TfToCudaGpuId translation function it doesn't need to depend on things like stream executor and cuda libraries. PiperOrigin-RevId: 185175757
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
d90054e7c0f41f4bab81df0548577a73b939a87a |
|
07-Feb-2018 |
Michael Case <mikecase@google.com> |
Merge changes from github. PiperOrigin-RevId: 184897758
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
993398d23d60efbc153c6ee7af33b0ecab85d33b |
|
06-Feb-2018 |
A. Unique TensorFlower <gardener@tensorflow.org> |
Add local interconnect data to DeviceLocality. This information can be used within a distributed implementation when deciding how to route data transfers that might involve more than one hop. By default the new fields are populated according to StreamExecutor::CanEnablePeerAccessTo(), however a platform-specific implementation can augment them with more detailed values. Do some refactoring of gpu_device and gpu_device_factory, making GetDeviceLocalities() and GetInterconnectMaps() into virtual functions. PiperOrigin-RevId: 184698821
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
7149a2e2e2f549035f23e21224ee41afe8df3876 |
|
30-Jan-2018 |
A. Unique TensorFlower <gardener@tensorflow.org> |
Cleanup: Ran clang-format on files in tensorflow/core/.../*.{cc,h}. PiperOrigin-RevId: 183848459
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
19993eff100fc8041bbab974105d494561845352 |
|
17-Jan-2018 |
Guangda Lai <laigd@google.com> |
Log all valid visible cuda gpu id in one line. PiperOrigin-RevId: 182121746
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
71896cc7e5bd3d1b8b5bb615eac7bebf86fa998c |
|
04-Jan-2018 |
Raghuraman Krishnamoorthi <raghuramank@google.com> |
Merge changes from github. PiperOrigin-RevId: 180746153
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
14cb8e14a8fb1e78e2ce623e4198972762e6e253 |
|
19-Dec-2017 |
Guangda Lai <laigd@google.com> |
Added virtual gpu support. PiperOrigin-RevId: 179504116
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
87cfa5696122c2173902accd47418ee4f25995d7 |
|
13-Dec-2017 |
Guangda Lai <laigd@google.com> |
Refactor helper functions a bit for virtual gpu changes later. PiperOrigin-RevId: 178826426
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
c381794b2fc3227bfee9cf085e26bafb33da8f4b |
|
11-Dec-2017 |
Xiaoqiang Zheng <zhengxq@google.com> |
Support different threading modes in GPU device. All modes are experimental for now. The goal is to find the best setting, and change the default to pick that. PiperOrigin-RevId: 178662212
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
487558cf40f85539f740959dd54a4f5eee8e0560 |
|
01-Dec-2017 |
A. Unique TensorFlower <gardener@tensorflow.org> |
Log the error status for failed CUDA EnablePeerAccess. This would help debug issues like this: #14759 PiperOrigin-RevId: 177625164
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
5439c1e2de01a8684b62aba224d44c392176ac32 |
|
20-Nov-2017 |
A. Unique TensorFlower <gardener@tensorflow.org> |
Remove nonfunctional and accidentally pessimizing value category casts. The conditional expression only has ONE fixed value category. Because in the code as written the two operands are of different value categories, the result is in fact a prvalue, i.e. a copy. This seems unintended, and we should simply preserve the existing lvalue. If we do want to allow moving, we need multiple statements: if (num == 1) { f(std::move(copier)); } else { f(copier); } PiperOrigin-RevId: 176414503
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
4b9238b2bce08dbf6dc433d1c4911043dd60403f |
|
13-Nov-2017 |
Eugene Brevdo <ebrevdo@google.com> |
Support non-scalar variant device copy. PiperOrigin-RevId: 175546097
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
21a07d7ca86e7b7429d4e75ed1dfc440b94ef3bd |
|
07-Nov-2017 |
Guangda Lai <laigd@google.com> |
Automated g4 rollback of changelist 174735029 PiperOrigin-RevId: 174796480
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
4535bd5df4d077072a8f207146bf4cd051971237 |
|
06-Nov-2017 |
Yangzihao Wang <yangzihao@google.com> |
Force CUDA runtime initialization only when device count is larger than 0. PiperOrigin-RevId: 174767565
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
e76519c75cf9e64fc1023c80d8b10ee712418e13 |
|
06-Nov-2017 |
Guangda Lai <laigd@google.com> |
Refactor helper functions a bit for virtual gpu changes later. PiperOrigin-RevId: 174735029
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
bfa539c03cd1555024fc04f4974e531c46b24e07 |
|
26-Oct-2017 |
Malcolm Reynolds <mareynolds@google.com> |
Automated g4 rollback of changelist 173456597 PiperOrigin-RevId: 173542536
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
5fe90b57748714341d02b2b44a7ec8ff27123bc0 |
|
26-Oct-2017 |
Yangzihao Wang <yangzihao@google.com> |
Force the CUDA runtime initialization before device creation. This is to avoid silent failure and garbage results produced when launching two TensorFlow programs simultaneously in two different processes. PiperOrigin-RevId: 173456597
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
ba5a5bfc23065086990ec3057caa2ded0c8a8dbf |
|
17-Oct-2017 |
A. Unique TensorFlower <gardener@tensorflow.org> |
Add the op->IsExpensive() argument to tracing calls. PiperOrigin-RevId: 172422580
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
0a11eaffc985ad6abd3a0e792061e1880766674a |
|
03-Oct-2017 |
Eugene Brevdo <ebrevdo@google.com> |
Internal Variant API allowing registering Variants to be copied from/to GPU. Adds a test in the variant_op_copy_test. Modifies the base GPUDevice to use this registry if it sees a singleton variant. Modifies the rendezvous manager to do the same. PiperOrigin-RevId: 170908757
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
eb0808fb95567c1f5b7ce48d29f47edfd988aff8 |
|
23-Aug-2017 |
Benoit Steiner <bsteiner@google.com> |
Converted LOG(FATAL) into regular errors to prevent the process from crashing on error. PiperOrigin-RevId: 166257105
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
28ce1d163eeffe618a6972c5245be0e660d94e85 |
|
15-Aug-2017 |
A. Unique TensorFlower <gardener@tensorflow.org> |
Merge changes from github. END_PUBLIC --- Commit 9f81374c3 authored by raymondxyang<zihao.yang@microsoft.com> Committed by Rasmus Munk Larsen<rmlarsen@google.com>: Add option for build more python tests in Cmake (#11853) * Ignore Windows built project * Fix deprecated methods in tf.contrib.python * Fix regex match for Windows build in contrib.keras * Fix Regex match for Windows build in session_bundle * * Fix deprecated methods * Fix regex match for Windows * Fix compatibility issue with Python 3.x * Add missing ops into Windows build for test * Enabled more testcases for Windows build * Clean code and fix typo * Add conditional cmake mode for enabling more unit testcase * Add Cmake mode for major Contrib packages * Add supplementary info in RAEDME for new cmake option * * Update tf_tests after testing with TF 1.3 * Clean code and resolve conflicts * Fix unsafe regex matches and format code * Update exclude list after testing with latest master branch * Fix missing module --- Commit 98f0e1efe authored by Yong Tang<yong.tang.github@outlook.com> Committed by Rasmus Munk Larsen<rmlarsen@google.com>: Dynamic ksize and strides with MaxPool (#11875) * Dynamic ksize with max_pool This fix tries to fix the issue raised in 4746 where ksize is static (attr) with max_pool. This fix changes ksize to input tensor so that it is dynamic now. This fix fixes 4746. Signed-off-by: Yong Tang <yong.tang.github@outlook.com> * Add dynamic ksize to MaxPoolGrad and MaxPoolGradGrad Signed-off-by: Yong Tang <yong.tang.github@outlook.com> * Add test cases for max_pool_v2 Signed-off-by: Yong Tang <yong.tang.github@outlook.com> * Fix GPU Jenkins issue. Signed-off-by: Yong Tang <yong.tang.github@outlook.com> * Enable MaxPoolV2 in GPU Signed-off-by: Yong Tang <yong.tang.github@outlook.com> * Hide MaxPoolV2 and other fixes. Signed-off-by: Yong Tang <yong.tang.github@outlook.com> --- Commit 02d6bc185 authored by Bairen Yi<byronyi@users.noreply.github.com> Committed by Rasmus Munk Larsen<rmlarsen@google.com>: remove useless variable (#12212) --- Commit ed6b0d905 authored by namrata-ibm<bhavenamrata@gmail.com> Committed by Rasmus Munk Larsen<rmlarsen@google.com>: Adding support for s390x in calculation of cpu_frequency (#12201) --- Commit 627dfc9dd authored by Taehoon Lee<taehoonlee@snu.ac.kr> Committed by Taehoon Lee<taehoonlee@snu.ac.kr>: Fix typos --- Commit c0f9b0a91 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: In fast-math mode emit a tanh that has a faster min/max. PiperOrigin-RevId: 164943597 --- Commit 87605f3d6 authored by Kay Zhu<kayzhu@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: [TF:XLA] Use HloEvaluator for ComputeConstant, remove the need of a dedicated compute constant backend. PiperOrigin-RevId: 164940970 --- Commit 881de45c2 authored by Taehoon Lee<me@taehoonlee.com> Committed by Rasmus Munk Larsen<rmlarsen@google.com>: Add bool type supports for GPU kernels (#11927) * Add bool type supports for GPU kernels * Add bool type test codes for GPU kernels --- Commit eeacdcdb1 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Add missing "CPU" suffix in registrations. PiperOrigin-RevId: 164939527 --- Commit de01be952 authored by namrata-ibm<bhavenamrata@gmail.com> Committed by Rasmus Munk Larsen<rmlarsen@google.com>: Adding support for Big Endian in graph_constructor_test and wav_io (#12179) --- Commit 26719d29f authored by QingYing Chen<pkudysj@126.com> Committed by Rasmus Munk Larsen<rmlarsen@google.com>: Implement CRF decode (Viterbi decode) for tensor (#12056) * Implement CRF decoding for tensors * add test code for tensor version's CRF decoding * made modifications according to pylint * add some comments for crf decode * remove useless code * add comments at the top comment of crf module and add more comments in crf_test * capitalize first char of first word in comments * replace crf_decode test code with a deterministic example --- Commit f9a81ca2f authored by Pete Warden<pete@petewarden.com> Committed by gunan<gunan@google.com>: Create CI build script for Raspberry Pi (#12190) * Create CI build script for Raspberry Pi * Moved location of Pi build script --- Commit e2a163a90 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Merge code from PR #11940 with internal changes from cl/164796436, and update Python tests to also run on GPU. PiperOrigin-RevId: 164929133 --- Commit 08bbfa187 authored by Taehoon Lee<me@taehoonlee.com> Committed by Rasmus Munk Larsen<rmlarsen@google.com>: Fix typos (#12195) --- Commit ab96f41fb authored by Luke Iwanski<luke@codeplay.com> Committed by Rasmus Munk Larsen<rmlarsen@google.com>: [OpenCL] Extends matmul_benchmark.py to cover SYCL (#11697) * [OpenCL] Extends matmul_benchmark.py to cover SYCL * Fixed typo * /gpu:0 -> /device:GPU:0 * Fixes control_flow_ops_py_test * /gpu: -> /device:GPU: * Fixes //tensorflow/python/profiler/internal:run_metadata_test * gpu: -> GPU: * Fixes tfprof_node * [OpenCL] Fixes device path to name with many colons (#123) The device path is constructed from a device name by replacing all colons with underscores. Some device names contain more than one colon, for example 'device:SYCL:0' which gives a path 'device_SYCL_0'. The previous code would not convert this back to the original device name, but rather to 'device:SYCL_0'. An alternative fix would be to convert all underscores to colons in the device name (i.e. remove the restriction inside `replace("_", ":", 1)`), however I'm not sure if there are any device names which contain underscores. * If no gpu device aviable fake one * gpu: -> device:GPU * Fixes profiler test * /gpu:x -> /device:GPU:x * Fixes debug_io_utils_test.cc test * Fixes device_name_utils_test.cc --- Commit 35e7a3665 authored by Yong Tang<yong.tang.github@outlook.com> Committed by Rasmus Munk Larsen<rmlarsen@google.com>: Remove unneeded casting of int64 for reverse_sequence (#12192) This fix remove unneeded cast of int64 for reverse_sequence: ``` lengths = math_ops.to_int64(lengths) ``` as int32 has already been enabled for reverse_sequence. Signed-off-by: Yong Tang <yong.tang.github@outlook.com> --- Commit 9fba8c185 authored by Anna R<annarev@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Add benchmark dashboard link to benchmarks doc. Also, I added a link and description for Benchmarks page to Community index page. PiperOrigin-RevId: 164924906 --- Commit bb6f32fa7 authored by Mark Heffernan<meheff@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Make HloAliasAnalysis updatable after changes to the HLO graph. As part of this change make HloAliasAnalysis a thinner layer which basically only holds a map from HloValue to HloBuffer and vice versa. PiperOrigin-RevId: 164923041 --- Commit 9103096c1 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by Thomas K?ppe<tkoeppe@google.com>: Merged commit includes the following changes: 164923041 by meheff: Make HloAliasAnalysis updatable after changes to the HLO graph. As part of this change make HloAliasAnalysis a thinner layer which basically only holds a map from HloValue to HloBuffer and vice versa. -- PiperOrigin-RevId: 164923041 --- Commit 822603aed authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Merging sibling fusion instruction using multi_output_fusion PiperOrigin-RevId: 164920220 --- Commit c035aa2a8 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Go: Update generated wrapper functions for TensorFlow ops. PiperOrigin-RevId: 164917891 --- Commit e1e81d9ba authored by Luke Iwanski<luke@codeplay.com> Committed by Rasmus Munk Larsen<rmlarsen@google.com>: [OpenCL] Fixes double memcpy bug (#151) (#12173) * [OpenCL] Fixes double memcpy bug (#151) As the debg CopyOp is called on a Tensor without type, we need to use the DataType enum to get type information, and use this to pass the type on to Eigen. This is a workaround Eigen's need to have a type when calling memcpy. If the Eigen memcpy can be provided without a type requirement, then the memcpy in sycl_util is unnecessary. * Acts on feedback from: #12173/files/32cb12a9001b672425867b5a3110fd98e737a20b#r132496277 --- Commit d9ca2d86d authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Internal change PiperOrigin-RevId: 164916465 --- Commit b8d13d218 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Remove more parts of DCASGD missed in the first pass. (47949b) PiperOrigin-RevId: 164914552 --- Commit 73b3d52c7 authored by Alexandre Passos<apassos@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: cmake fix PiperOrigin-RevId: 164911656 --- Commit 2173b5b0a authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Allow TFE_TensorHandleCopyToDevice to have the same device as src and destination. It will reuse the same underlying buffer in those cases. PiperOrigin-RevId: 164909906 --- Commit 13eb3b90e authored by Alexandre Passos<apassos@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Experimental C and Python APIs to invoke TensorFlow kernels on concrete values. PiperOrigin-RevId: 164902588 --- Commit 7dfabcc01 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Initialize ExecutionOptions in ComputeConstant to default values. PiperOrigin-RevId: 164894867 --- Commit c8897e9bc authored by Benoit Steiner<bsteiner@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Static required time computation PiperOrigin-RevId: 164894645 --- Commit 076158f9b authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Enable implicit->explicit conversion by default. PiperOrigin-RevId: 164890915 --- Commit 58c4a4cb1 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Bugfix: number of input channels is not necessarily in the last dimension, after introduction of data_format param. PiperOrigin-RevId: 164889729 --- Commit 8f9b1af8a authored by Igor Saprykin<isaprykin@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Recover MonitoredSession when the Coordinator is requested to stop with one of the _PREEMPTION_ERRORS. When SyncReplicasOptimizer is used, a preemption in the Coordinator may result in two cases: Case 1) the session gets silently marked as complete Case 2) the session gets stuck This CL aims to solve and verify solutions for both of these problems. Fix 1 changes the should_stop logic. Fix 2 changes the CoordinatedSession.run() logic. SyncReplicasOptimizer runs a separate set of threads using a Coordinator instance. Those threads do FIFOQueue.enqueue; the main thread does a blocking FIFOQueue.dequeue. `sync_token_q` FIFOQueue is on parameter-servers. When one of the PS instances gets preempted, an AbortedError causes the Coordinator to stop via request_stop(ex). That by itself changes the state of MonitoredSession.should_stop() to True (Fix 1). Results of the blocking Dequeue operation are sent to the chief worker via Recv. What happens next depends on the amount of tokens in `sync_token_q`. If there are enough for the next call to Dequeue to return, then the low-level "tf session run() call" returns. The next iteration of the `while not MonitoredSession.should_stop()` loop decides that the training is complete (Case 1). If there are not enough tokens in `sync_token_q`, then the blocking Dequeue is going to keep waiting for them. This results in the graph execution getting stuck and the whole session getting garbage collected after 10 minutes (Case 2). We decided to fix that by re-creating a session after it gets garbage collected (Fix 2). An alternative was to try to cancel the pending Dequeue operation, but it's not clear that it is the right thing to do and it is also not easy. PiperOrigin-RevId: 164888390 --- Commit 46e4de6e5 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Undo loop fusion changes for now as they seem to be altering a few results. END_PUBLIC RELNOTES: n/a BEGIN_PUBLIC BEGIN_PUBLIC Automated g4 rollback of changelist 164825735 PiperOrigin-RevId: 165340331
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
851c11e3cc8857b3938b8babf304005340a6a5f4 |
|
08-Aug-2017 |
Jingyue Wu <jingyue@google.com> |
is_gpu_available checks minimum compute capability. Add "compute capability: X.Y" to the short device description. This CL doesn't break backward compability because min_cuda_compute_capability is an optional argument. PiperOrigin-RevId: 164534861
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
ca200aeafa834f51ec4c0437c291acdfa08baec5 |
|
07-Aug-2017 |
Toby Boyd <tobyboyd@google.com> |
Reduce GPU info messages PiperOrigin-RevId: 164509907
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
5e6cdbd68aa6b94883738b782eed1b59193850bf |
|
04-Aug-2017 |
Toby Boyd <tobyboyd@google.com> |
Reduce devices not peered info logging PiperOrigin-RevId: 164193299
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
8d31d4d909ef854e53d2cce5ff201e2bc6bc87a2 |
|
08-Jul-2017 |
Eugene Brevdo <ebrevdo@google.com> |
Double MinSystemMemory for non-opt builds. This allows more complicated kernels (i.e., CUB reduces) a more GPU memory to launch. Prior to this, calling WhereOp with the BFC allocator in non-opt mode led to "out of memory" error when launching the kernel. PiperOrigin-RevId: 161255358
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
996605b0e4ef96e6732f7496abf44b6e5e1eb504 |
|
07-Jul-2017 |
Skye Wanderman-Milne <skyewm@google.com> |
PiperOrigin-RevId: 161240586
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
9cf44446550b6d2c3141074013509875649b0fd5 |
|
06-Jul-2017 |
Eugene Brevdo <ebrevdo@google.com> |
Bugfixes for GPU WhereOp. 1. Set the cuda context properly within ComputeAsync. Also set the cuda context properly in the WhereOp GPU callback. 2. Ensure report_uninitialized_variables runs on CPU. This avoids intermediate copying of data to GPU after getting the variables' state and before returning it. PiperOrigin-RevId: 161092040
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
6ada43366663210beb0159b8c1a67b26ebfe6cb7 |
|
23-Jun-2017 |
Geoffrey Irving <geoffreyi@google.com> |
Prepare to not include node_def.proto.h in node_def_util.h The goal is to make kernels mostly independent of proto headers, which will let us lock down our .so imports. This CL makes a bunch of .cc files either include node_def.proto.h themselves or not need the definition of NodeDef; a second CL will make node_def_util.h not include node_def.proto.h. RELNOTES: n/a PiperOrigin-RevId: 159982117
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
5c6015f89c3d500bb6d5f4145572aaaab3432bb1 |
|
13-Jun-2017 |
A. Unique TensorFlower <gardener@tensorflow.org> |
Allocate 25 MiB less on GPUs with <2GB RAM, to avoid running out of memory when launching kernels. PiperOrigin-RevId: 158889089
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
49476a62cb06bab2ff78e8295707016c6a12d728 |
|
30-May-2017 |
A. Unique TensorFlower <gardener@tensorflow.org> |
Remove unused namespace aliases PiperOrigin-RevId: 157468609
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
cfbc9d26d6f75ba70ac1e566774b2b2d487bef6e |
|
27-May-2017 |
A. Unique TensorFlower <gardener@tensorflow.org> |
Annotate overriding functions with "override" or "final" (and not with "virtual") PiperOrigin-RevId: 157284709
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
fccaac3d1bf1391756fae67f1979afe598d10ed1 |
|
26-May-2017 |
A. Unique TensorFlower <gardener@tensorflow.org> |
Force GPU device objects that refer to the same physical card using the same stream id to use the same cuda stream objects. This avoids confusing the per-device memory allocator in ways that cause memory corruption. Fixes https://github.com/tensorflow/serving/issues/335. PiperOrigin-RevId: 157258318
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
fdb4eba5b1cd0f2a2b10f83042a7e0eec1a41548 |
|
06-May-2017 |
A. Unique TensorFlower <gardener@tensorflow.org> |
- fixing comments to reflect reality Change: 155256914
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
f28935a7d280b6ba75fe93fe35783d87b9cc2ec9 |
|
05-May-2017 |
Brennan Saeta <saeta@google.com> |
Implement ClusterSpec Propagation in TF Master ClusterSpec propagation is a capability upgrade for TensorFlow that should make it much easier to (1) build distributed TensorFlow clusters, and (2) handle node failures. The ClusterSpec propagation capability allows TensorFlow workers to be booted independently of each other, and with no knowledge about others. The client can then construct a ClusterDef (ClusterSpec), and then send it to the TF master at session creation. The master in turn then propagates the ClusterDef along to all of the workers. Change: 155159972
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
0a9b39caefd437fec742ae48b25061abd6e2699b |
|
29-Apr-2017 |
Vijay Vasudevan <vrv@google.com> |
When allocating GPU constants, check to see if the destination tensor is intialized early (because we ran out of memory) and report it as such. Fixes #7025. Change: 154603030
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
698a961a5309230a87c3c02ef534b1756be32af3 |
|
17-Apr-2017 |
Benoit Steiner <bsteiner@google.com> |
Error out when running out of GPU memory instead of triggering a fatal error. Change: 153357257
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
d146e67e20f1288cf7ea4441eb3c3301cf7fad43 |
|
14-Apr-2017 |
A. Unique TensorFlower <gardener@tensorflow.org> |
Adding tracing of TensorFlow operations executed on the CPU. Such tracing is disabled by default. Change: 153128900
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
a42b3fc598cd028308ee9ae9dc9ed3034a04a0bf |
|
05-Apr-2017 |
Suharsh Sivakumar <suharshs@google.com> |
Remove warning that only happening with config cuda. Change: 152189205
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
f3405c2d73196e409041d52bbf30748b2a64493b |
|
22-Feb-2017 |
A. Unique TensorFlower <gardener@tensorflow.org> |
Change nccl_manager to use ncclCommInitAll. Change: 148169806
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
55d1978d73ece0f72347f27bba134b5975f4d0c8 |
|
20-Feb-2017 |
Jeffrey A. Dean <jeff@google.com> |
A couple of performance oriented changes: (1) GraphView now allocates a single flat array of bytes to hold all the NodeItem data for a graph, and can therefore use 32-bit offsets into this array to hold the mapping from node id to its associated NodeItem (rather than 64-bit NodeItem* pointers). This halves the footprint of this data structure that is touched several times on every node execution. (2) For BaseGPUDevice::Compute, we were touching op_kernel->name() and op_kernel-type_string() on every kernel execution unconditionally, even when tracing via port::Tracing::ScopedAnnotations was off (the common case). This caused extra cache lines to be touched to access these fields in op_kernel. Instead, added new port::Tracing::ScopedAnnotation::Enabled() call and used this so that we now have two paths through the BaseGPUDevice::Compute code that avoids touching op_kernel->name() and op_kernel->type_string() in the common-case faster path. Speeds up an InceptionV3 model on my desktop GPU card by approximately 1% (75.83 images/sec to 76.533 images/sec). Change: 147979592
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
a2f9e9961051ace596b2d5139cc6b3ee6e4ff700 |
|
12-Jan-2017 |
David G. Andersen <dga@google.com> |
Automated rollback of change 144344623 Change: 144351555
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
1f409de8c85ad1b47a50daf87089758df3989fd1 |
|
12-Jan-2017 |
David G. Andersen <dga@google.com> |
Fail earlier and more clearly if dtype in parsed proto is invalid. Change: 144344623
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
945331a503b1e39312fb7c3c1336276154b73aa6 |
|
06-Dec-2016 |
A. Unique TensorFlower <gardener@tensorflow.org> |
Change non-NUMA OS warning from LOG(ERROR) to LOG(INFO). Change: 141230079
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
cfccd7ce1b9092eec98bb0989eb55a11e9d2b894 |
|
04-Nov-2016 |
Vijay Vasudevan <vrv@google.com> |
GPUDevice: if enabling peer access fails, log a warning but do not return an error. On some systems and GPUs, the driver may report being able to enable peer access between two devices, but trying to do so still fails. The system can still run, though possibly slower than if peer access were enabled. Since we cannot disambiguate between supported and unsupported cases in which this happens, we demote this to a warning, with the exception being that if *no* device could enable peering even though it should be possible, we still return an error. Fixes #5362, hopefully. Change: 138141024
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
a95f95a60b20fb48fbfcae8da4afaa9412582746 |
|
02-Nov-2016 |
Gunhan Gulsoy <gunan@google.com> |
Remove references to gcudacc. Change: 137888607
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
e2d51a87f0727f8537b46048d8241aeebb6e48d6 |
|
28-Oct-2016 |
Xiaoqiang Zheng <zhengxq@google.com> |
Merge changes from github. Change: 137532946
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
79228c74e64a639aeb5692b442522d4aa279f885 |
|
20-Oct-2016 |
A. Unique TensorFlower <gardener@tensorflow.org> |
Replace enum BusAdjacency with protobuf DeviceLocality for describing the topological neighborhood of a device. Change: 136663586
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
3e8c4fd7403659ec32b9fec90a78831043aa0786 |
|
31-Aug-2016 |
Vijay Vasudevan <vrv@google.com> |
Don't establish contexts on gpus not on visible_device_list. Moves all the initialization code out of gpu_init.h and into gpu_device.cc, because we want the code that establishes peer mappings between GPUs to reside close to where the device selection order is made. Now the initialization code does nothing but calls the StreamExecutor platform initialization. This also checks that there are no duplicate entries in the visible_device_list. I tested this by running the following program: import time import tensorflow as tf c = tf.ConfigProto() c.gpu_options.visible_device_list="1" s = tf.Session(config=c) time.sleep(5) # nvidia-smi showed the context was established on device 1 but NOT 0 del s c.gpu_options.visible_device_list="1,0" s = tf.Session(config=c) time.sleep(30) # nvidia-smi showed the context was established on both device 0 and 1, # and the logs showed that the device ordering was 1->/gpu:0 and 0->/gpu:1, # as well as the fact that it tried to establish the peer mapping. del s c.gpu_options.visible_device_list="1,0,1" s = tf.Session(config=c) # failed Fixes #1888 Change: 131785661
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
b0bdff4827f867a67f572ed99d85f9a847788326 |
|
26-Aug-2016 |
A. Unique TensorFlower <gardener@tensorflow.org> |
Merge changes from github. Change: 131437429
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
8faa6f04f341a749e56f6475404c82cc704d08e2 |
|
26-Aug-2016 |
Vijay Vasudevan <vrv@google.com> |
Automated rollback of change 131356339 Change: 131410487
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
10d6f0eed8f4704589d036bcc09ba000c5c91e8a |
|
26-Aug-2016 |
Vijay Vasudevan <vrv@google.com> |
Automated rollback of change 131340536 Change: 131356339
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
fa042c7063b922f8048630de7084b2a28e59827c |
|
26-Aug-2016 |
Vijay Vasudevan <vrv@google.com> |
Add a 'visible to virtual' GPU device remapping to ConfigProto. Allows one to remap the visible GPUs to virtual GPUs on a per-session basis. For example, if the visible devices are 0,1,2,3,4,5,6,8, setting visible_device_list='5,3' means that visible device 5 gets mapped to /gpu:0 and visible device 3 maps to /gpu:1. Tested manually on my local machine: our tests are single GPU only so there's no good way to test this ongoing. Fixes #1888. Change: 131340536
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
2c598e874e6a7b6b3185846ce9bac97a7d5d0169 |
|
25-Aug-2016 |
A. Unique TensorFlower <gardener@tensorflow.org> |
Merge changes from github. Change: 131310818
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
d0789e85a24af6f608a7bcc2c7928028cc0ff8a6 |
|
11-Aug-2016 |
Vijay Vasudevan <vrv@google.com> |
Change GPU initialization to avoid crashing on errors (as much as possible). Change: 129926913
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
2d0d126749d6d0cf82fb86691362c923a1bfbfe4 |
|
09-Aug-2016 |
Vijay Vasudevan <vrv@google.com> |
Change DeviceFactory functions that create devices to propagate Statuses, so that failures to initialize devices don't crash the program. Changes swig for device_lib to be a lot simpler, thanks to mrry@ and keveman@'s help. Change allocation of eigen scratch memory to go through the allocator. Re-enable test for local devices now that python3 issue is fixed. Change: 129678132
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
b8ccc5a20d0e6fc04ef9a854bb391c69f99ca907 |
|
18-Jul-2016 |
A. Unique TensorFlower <gardener@tensorflow.org> |
Reduce overhead of running kernels: - Change ExecutorState::Entry to construct the Tensor val late, avoiding default construction and moving the tensor in favor of calling the constructor directly. - Change DeviceContextMap to be vector, and make the check cheaper when no nodes are registered. - Cache in Params whether the device requires registering tensor accesses. Change: 127714854
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
404b6faa3846c75e22daf12d7a979d942577a657 |
|
09-Jun-2016 |
A. Unique TensorFlower <nobody@tensorflow.org> |
Remove unused MultiOpActivation from stream executor. Change: 124421014
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
8f10920915b8442fcd8ea43d1fd7a595b97ebf46 |
|
08-Jun-2016 |
A. Unique TensorFlower <nobody@tensorflow.org> |
Change a couple of error messages related to getting the NUMA affinity of a GPU to suggest that the kernel may be lacking NUMA support. Change: 124389318
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
03dba169c65953266713800fd05d6d151194d4f7 |
|
08-Jun-2016 |
Benoit Steiner <benoit.steiner.goog@gmail.com> |
Improved the performance of full reductions on GPU. NEW BM_fullReduction/10 4591 4595 153149 20.8M items/s BM_fullReduction/64 5073 5075 100000 770.0M items/s BM_fullReduction/512 9067 9070 75263 26.9G items/s BM_fullReduction/4k 243984 244125 2868 64.0G items/s BM_fullReduction/5k 359125 359273 1951 64.8G items/s OLD BM_fullReduction/10 9085 9087 74395 10.5M items/s BM_fullReduction/64 9478 9478 72014 412.1M items/s BM_fullReduction/512 14643 14646 46902 16.7G items/s BM_fullReduction/4k 260338 260384 2678 60.0G items/s BM_fullReduction/5k 385076 385178 1818 60.5G items/s Change: 124290852
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
61f89ece63493f603d4f55725aba4ef4fb0dd6dd |
|
07-Jun-2016 |
A. Unique TensorFlower <nobody@tensorflow.org> |
The minimum number of multiprocessors in a GPU is now set to the maximum number found in the visible GPUs or 8, whichever is smaller. GPUs with a smaller number of multiprocessors are ignored. This check is now performed regardless of how many GPUs there are. Change: 124256972
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
c8b59c046895fa5b6d79f73e0b5817330fcfbfc1 |
|
02-Jun-2016 |
A. Unique TensorFlower <nobody@tensorflow.org> |
Update copyright for 3p/tf/core. Change: 123900938
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
acc40ff24b167477944c1aa2ee82c0eef39a0138 |
|
02-Jun-2016 |
Xiaoqiang Zheng <zhengxq@google.com> |
Add TF_EXTRA_CUDA_CAPABILITIES to support extra Cuda compute capabilities. The list of extra cuda capabilities must be passed through in a sequence separated by commas. With bazel, the build command arguments to include "3.0" is: --copt=-DTF_EXTRA_CUDA_CAPABILITIES=3.0 Change: 123810335
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
892ca4ddc12852a7b4633fd08f163941356cb4e6 |
|
23-May-2016 |
Derek Murray <mrry@google.com> |
Merge changes from github. Change: 123026122
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
df7276a15c5f9b9d1bca6794159e049103c0e2be |
|
12-May-2016 |
Benoit Steiner <benoit.steiner.goog@gmail.com> |
Upgraded to the latest version of Eigen that speeds up full reductions on fp16 by about 3 orders of magnitude as well as some partial reductions by 30% when using cuda 7.5 or above Change: 122191448
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
885cc6bf55745142b8ecc578c61c5f03ff45e6ce |
|
10-May-2016 |
A. Unique TensorFlower <nobody@tensorflow.org> |
Where GPUStreamExecutor fails to find a GPU NUMA node and returns -1, log an error message then reset to 0 where the value is used in GPUDevice. Getting the NUMA node correct is only necessary for multi-socket/bus architectures which are probably in the minority among the TensorFlow user community. This fix will unblock users for whom the StreamExecutor cannot read the NUMA data correctly, and still provide an error message warning of the problem for users who might be affected. Change: 121970499
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
049040a6dc710b54a566085fc8d2f6608dbbd6a2 |
|
10-May-2016 |
A. Unique TensorFlower <nobody@tensorflow.org> |
Adds the StreamExecutor as a protected member of BaseGPUDevice. Change: 121946595
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
e9db74626ea9e46a93eb0c15c37a2e138c83fade |
|
22-Apr-2016 |
A. Unique TensorFlower <nobody@tensorflow.org> |
Replace dynamic_cast with static_cast in ReinitializeDevice. Change: 120542152
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
ab6ffc92992f12456d2378f872be17f0ed274083 |
|
12-Mar-2016 |
Vijay Vasudevan <vrv@google.com> |
TensorFlow: allow growth in the GPU BFC allocator. This allows an option to start the BFC allocator small and grow it over time as needed. This can lead to increased fragmentation, but the benefit is that only as much memory as "needed" is reserved. This option defaults to off, but can be turned on by passing an option to the first Session. This is done by adding one more layer of indirection between mapping a ChunkHandle to a pointer by introducing the concept of AllocationRegions, which are contiguous memory regions that mimic the previous implementation in their indexing (constant time indexing within an AllocationRegion). The drawback is that we must introduce one more lookup to find out which allocation region a pointer is a part of. This implementation uses a sorted vector and upper_bound to do a binary search based on end_ptr. Its impact is relatively low based on the microbenchmarks below, and if it were a cause for later concern, we can try to map the 'page tables' of multiple regions into one very large AllocationRegion, and hope that there are no holes between address spaces so that the ChunkHandle map is not too large for memory. That being said, this change appears to not slow down the ptb_word_lm benchmark, which was initial impetus for most of the recent changes to this class, so this appears safe. Microbenchmarks I had ran showed no real difference, even when there were multiple regions, and ptb_word_lm benchmark also didn't change. The following numbers bear this out: At HEAD: (consumes 5.8GiB on my Titan Black) Epoch: 1 Learning rate: 1.000 0.004 perplexity: 6119.287 speed: 679 wps 0.104 perplexity: 849.526 speed: 5743 wps 0.204 perplexity: 629.677 speed: 6935 wps 0.304 perplexity: 509.189 speed: 7461 wps 0.404 perplexity: 438.585 speed: 7760 wps 0.504 perplexity: 392.459 speed: 7953 wps 0.604 perplexity: 352.998 speed: 8081 wps 0.703 perplexity: 325.909 speed: 8182 wps 0.803 perplexity: 304.531 speed: 8261 wps 0.903 perplexity: 284.988 speed: 8322 wps Epoch: 1 Train Perplexity: 270.398 Epoch: 1 Valid Perplexity: 178.860 Epoch: 2 Learning rate: 1.000 0.004 perplexity: 212.458 speed: 8836 wps 0.104 perplexity: 151.131 speed: 9039 wps 0.204 perplexity: 158.768 speed: 8950 wps 0.304 perplexity: 153.650 speed: 8925 wps 0.404 perplexity: 150.586 speed: 8910 wps 0.504 perplexity: 148.136 speed: 8817 wps 0.604 perplexity: 143.511 speed: 8778 wps 0.703 perplexity: 141.382 speed: 8773 wps 0.803 perplexity: 139.401 speed: 8775 wps 0.903 perplexity: 135.706 speed: 8777 wps Epoch: 2 Train Perplexity: 133.618 Epoch: 2 Valid Perplexity: 143.462 Epoch: 3 Learning rate: 1.000 0.004 perplexity: 146.292 speed: 8947 wps 0.104 perplexity: 104.901 speed: 9325 wps 0.204 perplexity: 114.335 speed: 9108 wps 0.304 perplexity: 111.434 speed: 9046 wps 0.404 perplexity: 110.328 speed: 9014 wps 0.504 perplexity: 109.455 speed: 8995 wps 0.604 perplexity: 106.877 speed: 8984 wps 0.703 perplexity: 106.158 speed: 8978 wps 0.803 perplexity: 105.532 speed: 8966 wps 0.903 perplexity: 103.284 speed: 8965 wps Epoch: 3 Train Perplexity: 102.326 Epoch: 3 Valid Perplexity: 132.332 Epoch: 4 Learning rate: 1.000 0.004 perplexity: 116.748 speed: 8990 wps 0.104 perplexity: 85.032 speed: 9172 wps 0.204 perplexity: 93.827 speed: 9051 wps 0.304 perplexity: 91.716 speed: 9010 wps 0.404 perplexity: 91.088 speed: 8966 wps 0.504 perplexity: 90.654 speed: 8955 wps 0.604 perplexity: 88.841 speed: 8952 wps 0.703 perplexity: 88.550 speed: 8943 wps 0.803 perplexity: 88.268 speed: 8932 wps 0.903 perplexity: 86.610 speed: 8924 wps Epoch: 4 Train Perplexity: 86.030 Epoch: 4 Valid Perplexity: 127.415 Epoch: 5 Learning rate: 1.000 0.004 perplexity: 98.907 speed: 8952 wps 0.104 perplexity: 73.707 speed: 9238 wps 0.204 perplexity: 81.525 speed: 9112 wps 0.304 perplexity: 79.768 speed: 9074 wps 0.404 perplexity: 79.366 speed: 9060 wps 0.504 perplexity: 79.199 speed: 9039 wps 0.604 perplexity: 77.728 speed: 9037 wps 0.703 perplexity: 77.630 speed: 9037 wps 0.803 perplexity: 77.596 speed: 9033 wps 0.903 perplexity: 76.270 speed: 9005 wps Epoch: 5 Train Perplexity: 75.907 Epoch: 5 Valid Perplexity: 126.183 Epoch: 6 Learning rate: 0.500 0.004 perplexity: 88.458 speed: 8816 wps 0.104 perplexity: 64.231 speed: 9143 wps 0.204 perplexity: 69.896 speed: 9050 wps 0.304 perplexity: 67.342 speed: 9016 wps 0.404 perplexity: 66.162 speed: 8989 wps 0.504 perplexity: 65.290 speed: 8952 wps 0.604 perplexity: 63.331 speed: 8945 wps 0.703 perplexity: 62.617 speed: 8942 wps 0.803 perplexity: 61.883 speed: 8943 wps 0.903 perplexity: 60.149 speed: 8934 wps Epoch: 6 Train Perplexity: 59.222 Epoch: 6 Valid Perplexity: 119.635 Epoch: 7 Learning rate: 0.250 0.004 perplexity: 73.009 speed: 8941 wps 0.104 perplexity: 53.369 speed: 9241 wps 0.204 perplexity: 58.193 speed: 9115 wps 0.304 perplexity: 55.957 speed: 9091 wps 0.404 perplexity: 54.885 speed: 9073 wps 0.504 perplexity: 54.052 speed: 9059 wps 0.604 perplexity: 52.298 speed: 9053 wps 0.703 perplexity: 51.598 speed: 9036 wps 0.803 perplexity: 50.858 speed: 9024 wps With this change: (Consumes 700MiB on my TitanBlack) Epoch: 1 Learning rate: 1.000 0.004 perplexity: 6220.805 speed: 649 wps 0.104 perplexity: 847.498 speed: 5631 wps 0.204 perplexity: 628.919 speed: 6853 wps 0.304 perplexity: 506.395 speed: 7391 wps 0.404 perplexity: 435.559 speed: 7675 wps 0.504 perplexity: 389.903 speed: 7883 wps 0.604 perplexity: 351.013 speed: 8033 wps 0.703 perplexity: 324.474 speed: 8144 wps 0.803 perplexity: 303.551 speed: 8230 wps 0.903 perplexity: 284.267 speed: 8300 wps Epoch: 1 Train Perplexity: 269.826 Epoch: 1 Valid Perplexity: 178.575 Epoch: 2 Learning rate: 1.000 0.004 perplexity: 214.660 speed: 8880 wps 0.104 perplexity: 152.258 speed: 9222 wps 0.204 perplexity: 159.331 speed: 9072 wps 0.304 perplexity: 154.358 speed: 9036 wps 0.404 perplexity: 151.455 speed: 9019 wps 0.504 perplexity: 148.906 speed: 9008 wps 0.604 perplexity: 144.203 speed: 8990 wps 0.703 perplexity: 142.134 speed: 8979 wps 0.803 perplexity: 140.096 speed: 8971 wps 0.903 perplexity: 136.424 speed: 8968 wps Epoch: 2 Train Perplexity: 134.372 Epoch: 2 Valid Perplexity: 144.896 Epoch: 3 Learning rate: 1.000 0.004 perplexity: 146.571 speed: 9008 wps 0.104 perplexity: 105.991 speed: 9277 wps 0.204 perplexity: 114.965 speed: 9151 wps 0.304 perplexity: 112.041 speed: 9101 wps 0.404 perplexity: 110.948 speed: 9057 wps 0.504 perplexity: 110.141 speed: 9050 wps 0.604 perplexity: 107.539 speed: 9043 wps 0.703 perplexity: 106.877 speed: 9040 wps 0.803 perplexity: 106.181 speed: 9040 wps 0.903 perplexity: 103.940 speed: 9025 wps Epoch: 3 Train Perplexity: 103.023 Epoch: 3 Valid Perplexity: 132.966 Epoch: 4 Learning rate: 1.000 0.004 perplexity: 117.296 speed: 8990 wps 0.104 perplexity: 85.532 speed: 9764 wps 0.204 perplexity: 94.076 speed: 9784 wps 0.304 perplexity: 91.875 speed: 9773 wps 0.404 perplexity: 91.423 speed: 9689 wps 0.504 perplexity: 91.090 speed: 9546 wps 0.604 perplexity: 89.244 speed: 9460 wps 0.703 perplexity: 89.004 speed: 9399 wps 0.803 perplexity: 88.732 speed: 9352 wps 0.903 perplexity: 87.097 speed: 9312 wps Epoch: 4 Train Perplexity: 86.571 Epoch: 4 Valid Perplexity: 128.440 Epoch: 5 Learning rate: 1.000 0.004 perplexity: 100.152 speed: 8973 wps 0.104 perplexity: 74.050 speed: 9271 wps 0.204 perplexity: 81.658 speed: 9157 wps 0.304 perplexity: 79.822 speed: 9115 wps 0.404 perplexity: 79.594 speed: 9061 wps 0.504 perplexity: 79.486 speed: 9020 wps 0.604 perplexity: 78.066 speed: 8990 wps 0.703 perplexity: 78.046 speed: 8974 wps 0.803 perplexity: 77.968 speed: 8963 wps 0.903 perplexity: 76.702 speed: 8946 wps Epoch: 5 Train Perplexity: 76.386 Epoch: 5 Valid Perplexity: 127.245 Change: 117032081
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
3b55e1f4f4be8fd4a6a5084edf9daf01e0990c3c |
|
12-Mar-2016 |
A. Unique TensorFlower <nobody@tensorflow.org> |
Change safe_strto32 and safe_strto64 to accept StringPiece. Updates callers to pass the StringPiece values. Change: 117027762
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
ec1403e7dc2b919531e527d36d28659f60621c9e |
|
02-Mar-2016 |
A. Unique TensorFlower <nobody@tensorflow.org> |
Add optional comprehensive logging of memory allocation/deallocation events. When enabled, the following events are recorded: The start of a step, with the numerical step_id and a textual handle describing the step. A Tensor allocation, including the step_id, the name of the OpKernel, the data type, shape, allocation size, allocation_id, data pointer location, and allocator used (the allocation_id is local to an allocator). A Tensor deallocation, including the allocation_id and allocator used. A raw memory allocation, including the step_id, the name of the component (e.g. Eigen), the number of bytes, data pointer location, allocation_id and allocator used. A raw memory deallocation, including the step_id, the name of the component (e.g. Eigen), allocation_id and allocator used. For now many Tensor allocations show 'unknown' for the kernel and step_id. These mostly come from Tensors allocated by the system from protocol buffers, and Tensors allocated by Ops using the Tensor constructor directly instead of calling OpKernelContext::allocate_temp. The latter can in principle be cleaned up one by one as necessary. The former would require some plumbing to associate an allocation with the appropriate step_id. With this CL memory logging is enabled by raising the VLOG level to 1. Once there is an ability to set process-wide options programmatically it would make sense to update the machinery to do that. Currently recorded events are logged as INFO, and they can all be retrieved by filtering the log for lines including __LOG_MEMORY__. Some example lines are as follows: I0301 13:38:55.797563 81179 log_memory.cc:18] __LOG_MEMORY__ MemoryLogTensorAllocation { step_id: -6 kernel_name: "Unknown (from Proto)" tensor { dtype: DT_FLOAT shape { } allocation_description { requested_bytes: 4 allocated_bytes: 4 allocator_name: "cuda_host" allocation_id: 2 has_single_reference: true ptr: 8717861408 } } } I0301 13:38:55.802245 81179 log_memory.cc:18] __LOG_MEMORY__ MemoryLogTensorAllocation { step_id: -6 kernel_name: "Unknown" tensor { dtype: DT_FLOAT shape { } allocation_description { requested_bytes: 4 allocated_bytes: 256 allocator_name: "gpu_bfc" allocation_id: 1 has_single_reference: true ptr: 47378989056 } } } I0301 13:38:55.802347 81179 log_memory.cc:18] __LOG_MEMORY__ MemoryLogTensorDeallocation { allocation_id: 2 allocator_name: "cuda_host" } [...] I0301 13:38:55.806454 81179 log_memory.cc:18] __LOG_MEMORY__ MemoryLogStep { step_id: 1 handle: "->/init;0" } I0301 13:38:55.806659 81220 log_memory.cc:18] __LOG_MEMORY__ MemoryLogTensorOutput { step_id: 1 kernel_name: "random_normal/shape" tensor { dtype: DT_INT32 shape { dim { size: 4 } } allocation_description { requested_bytes: 16 allocated_bytes: 16 allocator_name: "cuda_host" allocation_id: 1 ptr: 8717860896 } } } [...] I0301 13:38:56.362898 81218 log_memory.cc:18] __LOG_MEMORY__ MemoryLogTensorAllocation { step_id: 1 kernel_name: "conv1/truncated_normal" tensor { dtype: DT_FLOAT shape { dim { size: 11 } dim { size: 11 } dim { size: 3 } dim { size: 96 } } allocation_description { requested_bytes: 139392 allocated_bytes: 139520 allocator_name: "gpu_bfc" allocation_id: 36 has_single_reference: true ptr: 47379030016 } } } I0301 13:38:56.362894 81217 log_memory.cc:18] __LOG_MEMORY__ MemoryLogTensorDeallocation { allocation_id: 24 allocator_name: "gpu_bfc" } I0301 13:38:56.362903 81213 log_memory.cc:18] __LOG_MEMORY__ MemoryLogTensorOutput { step_id: 1 kernel_name: "conv5/truncated_normal/mul" tensor { dtype: DT_FLOAT shape { dim { size: 3 } dim { size: 3 } dim { size: 1024 } dim { size: 1024 } } allocation_description { requested_bytes: 37748736 allocated_bytes: 37748736 allocator_name: "gpu_bfc" allocation_id: 34 ptr: 48512711168 } } } [...] I0229 16:39:57.482980 76558 log_memory.cc:18] __LOG_MEMORY__ MemoryLogRawAllocation { step_id: 13 operation: "xentropy/EigenAllocator" num_bytes: 64 ptr: 47386857472 allocation_id: 625 allocator_name: "gpu_bfc" } I0229 16:39:57.483147 76558 log_memory.cc:18] __LOG_MEMORY__ MemoryLogRawDeallocation { step_id: 13 operation: "xentropy/EigenAllocator" allocation_id: 625 allocator_name: "gpu_bfc" deferred: true } I0229 16:39:57.483197 76558 log_memory.cc:18] __LOG_MEMORY__ MemoryLogRawDeallocation { step_id: 13 operation: "xentropy/EigenAllocator" allocation_id: 625 allocator_name: "gpu_bfc" } Change: 116065112
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
c1e0d8d469ed467f7421a5a528895feb096a4a07 |
|
12-Feb-2016 |
Benoit Steiner <benoit.steiner.goog@gmail.com> |
Avoid calling the default constructor of GpuDevice since it puts in motion a lot of expensive stream executor machinery. Moreover the stream executor initialization was done for nothing as the device always ends up being reinitialized before its first use. Change: 114488347
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
88b0cb44a468ca8c26b20f008d3414a1474a3f8e |
|
11-Feb-2016 |
A. Unique TensorFlower <nobody@tensorflow.org> |
Force the executor to wait until queued operations have finished executing before allowing the SINK Op to complete. This probably better matches the expected behavior, and avoids misleading output from benchmarks, as well as marginally improving the reported timing of some of our existing benchmarks, and reducing the amount of "burn-in" time needed before getting useful readings from benchmarks. A common pattern in benchmark python code is: start_time = time.time() _ = session.run(tf.group(*target)) duration = time.time() - start_time Since no tensor is fetched by session.run the call returns as soon as the SINK Op has finished its Compute method. With the current codebase this does not wait for GPU-queued operations to complete, and as a consequence steps that should execute sequentially are overlapped. By modifying benchmark_overfeat (in this case with batch size 2750) to print out timings for every step including "burn-in" we see the behavior, where between 5 and 6 steps are overlapped before backpressure kicks in (the direct session thread pool has 5 entries) and we see the steady state time stabilize: 2016-02-11 10:38:13.556584: step 1, duration = 0.005 2016-02-11 10:38:13.561401: step 2, duration = 0.005 2016-02-11 10:38:13.565904: step 3, duration = 0.004 2016-02-11 10:38:13.570100: step 4, duration = 0.004 2016-02-11 10:38:15.434168: step 5, duration = 1.864 2016-02-11 10:38:23.401597: step 6, duration = 7.967 2016-02-11 10:38:31.372718: step 7, duration = 7.971 2016-02-11 10:38:39.344853: step 8, duration = 7.972 2016-02-11 10:38:47.333486: step 9, duration = 7.989 2016-02-11 10:38:55.297039: step 10, duration = 7.963 2016-02-11 10:39:03.267716: step 11, duration = 7.971 2016-02-11 10:39:11.243412: step 12, duration = 7.976 2016-02-11 10:39:19.198777: step 13, duration = 7.955 2016-02-11 10:39:27.175440: step 14, duration = 7.977 With the change the output (same batch size) shows: 2016-02-11 10:35:24.421367: step 1, duration = 7.832 2016-02-11 10:35:32.269099: step 2, duration = 7.848 2016-02-11 10:35:40.118505: step 3, duration = 7.849 2016-02-11 10:35:47.946552: step 4, duration = 7.828 2016-02-11 10:35:55.802172: step 5, duration = 7.856 2016-02-11 10:36:03.637670: step 6, duration = 7.835 2016-02-11 10:36:11.505289: step 7, duration = 7.868 2016-02-11 10:36:19.331914: step 8, duration = 7.827 2016-02-11 10:36:27.162643: step 9, duration = 7.831 2016-02-11 10:36:34.994089: step 10, duration = 7.831 2016-02-11 10:36:42.850624: step 11, duration = 7.856 2016-02-11 10:36:50.707724: step 12, duration = 7.857 2016-02-11 10:36:58.574334: step 13, duration = 7.867 No steps are overlapped, so the first step reports meaningful timing, and as a bonus the overall reported step time is slightly less, presumably because we are able to use all 5 CPU threads in the thread pool for a step. The maximum batch size before OOM is the same before and after the change. Change: 114458853
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
bc6f617bd636c4e371d7c03d153e9ce9b8b0e3c3 |
|
10-Feb-2016 |
Xiaoqiang Zheng <zhengxq@google.com> |
* Change gpu_count to gpu_device_enabled. * Switch to have the BaseGPUDevice constructor to enable GPU support, instead of relying on the entry point. This make it possible for TensorFlow to use pinned memory for GPU/CPU memory copies for all entry points. Change: 114364920
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
241698b6ba6cd9b13d606a9e4603baa4f33891f2 |
|
06-Feb-2016 |
Xiaoqiang Zheng <zhengxq@google.com> |
Move CPU/GPU memory copies into their own streams. Change: 114000504
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
545dee2e9897e424b641f092eec9ffd4a277f9d1 |
|
03-Feb-2016 |
Xiaoqiang Zheng <zhengxq@google.com> |
Put device-to-device GPU memory copies on a different stream. Change: 113784244
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
d821f6aeb66a93501673ac5314685bd7d58151f8 |
|
02-Feb-2016 |
Xiaoqiang Zheng <zhengxq@google.com> |
Disable tensor tracking when only one GPU stream is used. Change: 113579306
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
f8fa35b8a1910772d6d6ba7b621f905358640c2c |
|
26-Jan-2016 |
Josh Levenberg <josh11b@tensorflow.org> |
Global search & replace to move to the new location for tensorflow/core/ files and build targets. Change: 113080048
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
ff8522de343a90813fc4e5cbb249e308c1819f1d |
|
25-Jan-2016 |
A. Unique TensorFlower <nobody@tensorflow.org> |
Eliminate per-op allocation of gpu device wrapper. The PerOpGpuDevice is allocated once in the OpKernelContext::Params struct, then re-used every time a new OpKernelContext uses the Params. Thus in the executor, as long as there is more work to do the PerOpGpuDevice is not freed. Change: 112909215
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
b481783fe0e00a86f6feb20a8dcad5fc4fc936a4 |
|
21-Jan-2016 |
Josh Levenberg <josh11b@tensorflow.org> |
Move #include <vector> out of port.h to users of std::vector<>. After this we can replace port.h with types.h. Change: 112727463
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
c8eaac926c929e07ac8db69f67803a2223ff2d93 |
|
20-Jan-2016 |
Josh Levenberg <josh11b@tensorflow.org> |
Many tensorflow/core build clean ups. Change: 112523833
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
f592f23775e2a6ac75496829db5005d3bb70a3d2 |
|
19-Jan-2016 |
A. Unique TensorFlower <nobody@tensorflow.org> |
Replacing reference 'names' variable with 'example_names' variable. Change: 112481326
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
6f62e435ab6c36dfdfdef1acd580b5f278f6723c |
|
18-Jan-2016 |
A. Unique TensorFlower <nobody@tensorflow.org> |
Adds enough auditing to make it possible to track tensor buffers throughout an execution, and build a cost model of memory usage. There are two main components: 1) GPU allocators now assign to each allocated tensor buffer a unique ID so its use can be tracked within and across steps. 2) The checkin cleans up the tracking of usage of Tensor buffers, and makes it work for both sync and async kernels (async kernels did not previously track gpu memory correctly). Each use is now tracked by the OpKernelContext (for allocators that need this support) in a single uniquified set of TensorReferences. When the kernel finishes, the executor retrieves the list of references, logs it if needed in the nodeexecstats, then passes it to the device, which may add an additional reference to keep the memory from being reused until the execution completes. When the tensor is logged in the nodeexecstats a flag is set if there is a single remaining reference to the buffer, which means that the memory will be freed once the Op completes. Change: 112375683
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
8e388a5546f1466aa0d2afa00e5a015997a23a2b |
|
14-Jan-2016 |
Vijay Vasudevan <vrv@google.com> |
TensorFlow: Get rid of legacy command line flags use in TensorFlow. Change: 112105282
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
94538a944ed71eb8c0c22213fef245e09165f935 |
|
13-Jan-2016 |
A. Unique TensorFlower <nobody@tensorflow.org> |
Added OpKernel::is_internal and used it to avoid a memcmp against the OpKernel::type_string() for every node executed on a GPU. Change: 111998968
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
3ffa307e49e5b150934a71386194d7ed621e3e98 |
|
07-Jan-2016 |
Josh Levenberg <josh11b@tensorflow.org> |
#include third_party/tensorflow/core/platform/macros.h directly so we can drop it from port.h. Change: 111621646
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
1c579361cd1e088dd5e05a394b1561a73e3667ba |
|
05-Jan-2016 |
A. Unique TensorFlower <nobody@tensorflow.org> |
Added 'logging' import to control_flow_ops which is used in the file but not imported. Change: 110842260
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
f9d3e9d03c69bfac77a2fe1ad80f7c5aa517e0f0 |
|
06-Dec-2015 |
Vijay Vasudevan <vrv@google.com> |
TensorFlow: upstream latest changes to git. Change 109537918 TensorFlow pip setup: wheel >= 0.26 for python3 pip install Change 109505848 Fix distortion default value to 1.0 in fixed_unigram_candidate_sampler. This means we default to the actual provided unigram distribution, instead of to the uniform (as it is currently). Change 109470494 Bugfix in gradients calculation when the ys rely on each other. Change 109467619 Fix CIFAR-10 model to train on all the training data instead of just 80% of it. Fixes #396. Change 109467557 Replaced checkpoint file with binary GraphDef. Change 109467433 Updates to C++ tutorial section. Change 109465269 TensorFlow: update documentation for tutorials to not assume use of bazel (when possible). Change 109462916 A tutorial for image recognition to coincide with the release of the latest Inception image classification model. Change 109462342 Clear control dependencies in variable_scope.get_variable() when creating ops for the initializer. Add tests of various error conditions. Change 109461981 Various performance improvements in low-level node execution code paths. Speeds up ptb_word_lm on my desktop with a Titan X from 3638 words per second to 3751 words per second (3.1% speedup). Changes include: o Avoided many strcmp operations per node execution and extra touches of cache lines in executor.cc, by making all the various IsMerge, IsSwitch, IsSend, etc. operations instead be based on an internal enum value that is pre-computed at Node construction time, rather than doing string comparisons against node->type_string(). We were doing about 6 such comparisons per executed node. o Removed mutex_lock in executor.cc in ExecutorState::Process. The lock was not needed and the comment about the iterations array being potentially resized is not true (the iterations arrays are created with a fixed size). Checked with yuanbyu to confirm this. o Added new two-argument port::Tracing::ScopedAnnotation constructor that takes two StringPiece arguments, and only concatenates them lazily if tracing is enabled. Also changed the code in platform/tracing.{h,cc} so that the ScopedAnnotation constructor and the TraceMe constructor can be inlined. o In BaseGPUDevice::Compute, used the two-argument ScopedAnnotation constructor to avoid doing StrCat(opkernel->name(), ":", op_kernel->type_string()) on every node execution on a GPU. o Introduced a new TensorReference class that just holds a reference to an underlying TensorBuffer, and requires an explicit Unref(). o Changed the EventMgr interface to take a vector of TensorReference objects for EventMgr::ThenDeleteTensors, rather than a vector of Tensor objects. o Used TensorReference in a few places in gpu_util.cc o Minor: switched to using InlinedVectors in a few places to get better cache locality. Change 109456692 Updated the label_image example to use the latest Inception model Change 109456545 Provides classify_image which performs image recognition on a 1000 object label set. $ ./classify_image giant panda, panda, panda bear, coon bear, Ailuropoda melanoleuca (score = 0.88493) indri, indris, Indri indri, Indri brevicaudatus (score = 0.00878) lesser panda, red panda, panda, bear cat, cat bear, Ailurus fulgens (score = 0.00317) custard apple (score = 0.00149) earthstar (score = 0.00127) Change 109455002 TensorFlow: make the helper libraries for various models available in the pip package so that when users type: python translate.py ... the absolute import works. This change is supposed to help make our tutorials run without the *need* to use bazel. Change 109450041 TensorFlow: remove cifar and convolutional binary copies from pip install. Adds embedding and some other models to the list. Change 109448520 Move the description of a failing invariant from a comment into the dcheck-fail message text. Change 109447577 TensorBoard has release tagging (tensorboard/TAG) Also track TensorBoard changes (tensorboard/CHANGES) Change 109444161 Added ParseSingleSequenceExample + python wrappers + unit tests. Change 109440864 Update all the TensorFlow Dockerfiles, and simplify GPU containers. This change updates all four of our Dockerfiles to match the targets discussed in https://github.com/tensorflow/tensorflow/issues/149. The most notable change here is moving the GPU images to use the NVidia containers which include cudnn and other build-time dependencies, dramatically simplifying both the build and run steps. A description of which tags exist and get pushed where will be in a follow-up. Change 109432591 Some pylint and pydoc changes in saver. Change 109430127 Remove unused hydrogen components Change 109419354 The RNN api, although moved into python/ops/, remains undocumented. It may still change at any time. Base CL: 109538006
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
9c3043ff3bf31a6a81810b4ce9e87ef936f1f529 |
|
20-Nov-2015 |
Manjunath Kudlur <keveman@gmail.com> |
TensorFlow: Improve performance of Alexnet Changes: * error message that refers to removed `DefaultSession` method. * -Wnull-conversion warnings * the "_start_time" attr for recvs when the flag "--brain_enable_scheduling_for_recvs" is set. * typo in tutorial data download progress message. * a typo ("however their installing"=>"however installing"). * typo, rename "TensorFlow Mechanics" to "How To" to be consistent with the website. * a typo ("subtact"=>"subtract"). * protobuf examples in comments in tensorflow::Example.proto. * formula formatting in MNIST beginner tutorial * negative fraction-of-queue-full stats * protobuf inclusion path so that Android demo will build under Blaze. * small typo (moderatly > moderately) * Session.run() to check that tensor arguments come from the session's graph. * another six import * seq2seq typo in bazel command Base CL: 108349164
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
56313def004795f75ef8281a0294c958d28f1e06 |
|
16-Nov-2015 |
Vijay Vasudevan <vrv@google.com> |
TensorFlow: Doc and linter fixes, some additional tests and error handling, updates to website. Changes: - Removes redundant reshape from image models by @mrry - Default TensorBoard to localhost by @danmane - Reformatting of tensorflow/core by @josh11b - Make tutorials backwards compatible to 0.5.0 by @girving - Improve print documentation (md files not updated). - Add proper scrolling to sitemap by @martinwicke Base CL: 107956254
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
4dffee7f62d81ec9173aba1b0ef6b96e47f8037c |
|
12-Nov-2015 |
Vijay Vasudevan <vrv@google.com> |
TensorFlow: Minor updates to docs, BUILD, GPU config / perf, etc. Changes: - Updates to op documentation and index by Josh - More changes to BUILD files for python 3 support by @girving - Fix to Eigen to use DenseIndex everywhere by @jiayq - Enable configuration for cuda compute capability by @zheng-xq, including updates to docs. - Route aggregation method through optimizer by schuster - Updates to install instructions for bazel 0.1.1. Base CL: 107702099
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
f2102f4e2c1c87f1d1bf9ab856a2849c54478760 |
|
12-Nov-2015 |
Vijay Vasudevan <vrv@google.com> |
TensorFlow: upstream changes from the afternoon. Changes: - futurize --stage2 changes for Python 3 compatibility by @girving. - Small updates to documentation by @vrv, schuster and others - Account for failure of std::thread::hardware_concurrency by @ebrevdo. - More changes for backwards-compatibility tests by Josh - Updates to python op doc generation by Josh - Added support for using the best-fit allocator via ConfigProto by @vrv. - Rename LocalSession to DirectSession, since local was a bad name for it. - Enable tf.nn.moments() to work with tensors of unknown shape by @mrry. GITHUB_ISSUE: 139 - Changes for Android build by Andrew. Base CL: 107645181
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|
f41959ccb2d9d4c722fe8fc3351401d53bcf4900 |
|
07-Nov-2015 |
Manjunath Kudlur <keveman@gmail.com> |
TensorFlow: Initial commit of TensorFlow library. TensorFlow is an open source software library for numerical computation using data flow graphs. Base CL: 107276108
/external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
|