Cross Reference: /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu

History log of /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
Revision	Date	Author	Comments (<<< Hide modified files) (Show modified files >>>)
7e23850c2145ed565c668d6ba327dbcf064d4ed8	09-Feb-2018	Gunhan Gulsoy <gunan@google.com>	Remove header dependence on cuda_config.h to fix opensource custom op support. Fixes #14454 Fixes #12860 PiperOrigin-RevId: 185194924 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
df982b8dea49eba273e33e4283c3b14eab171b04	09-Feb-2018	Guangda Lai <laigd@google.com>	Split gpu_id.h and GpuIdManager out from build target //tensorflow/core:gpu_runtime, to reduce the size of dependencies, so when other lightweight libraries like grappler utils needs the TfToCudaGpuId translation function it doesn't need to depend on things like stream executor and cuda libraries. PiperOrigin-RevId: 185175757 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
d90054e7c0f41f4bab81df0548577a73b939a87a	07-Feb-2018	Michael Case <mikecase@google.com>	Merge changes from github. PiperOrigin-RevId: 184897758 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
993398d23d60efbc153c6ee7af33b0ecab85d33b	06-Feb-2018	A. Unique TensorFlower <gardener@tensorflow.org>	Add local interconnect data to DeviceLocality. This information can be used within a distributed implementation when deciding how to route data transfers that might involve more than one hop. By default the new fields are populated according to StreamExecutor::CanEnablePeerAccessTo(), however a platform-specific implementation can augment them with more detailed values. Do some refactoring of gpu_device and gpu_device_factory, making GetDeviceLocalities() and GetInterconnectMaps() into virtual functions. PiperOrigin-RevId: 184698821 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
7149a2e2e2f549035f23e21224ee41afe8df3876	30-Jan-2018	A. Unique TensorFlower <gardener@tensorflow.org>	Cleanup: Ran clang-format on files in tensorflow/core/.../*.{cc,h}. PiperOrigin-RevId: 183848459 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
19993eff100fc8041bbab974105d494561845352	17-Jan-2018	Guangda Lai <laigd@google.com>	Log all valid visible cuda gpu id in one line. PiperOrigin-RevId: 182121746 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
71896cc7e5bd3d1b8b5bb615eac7bebf86fa998c	04-Jan-2018	Raghuraman Krishnamoorthi <raghuramank@google.com>	Merge changes from github. PiperOrigin-RevId: 180746153 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
14cb8e14a8fb1e78e2ce623e4198972762e6e253	19-Dec-2017	Guangda Lai <laigd@google.com>	Added virtual gpu support. PiperOrigin-RevId: 179504116 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
87cfa5696122c2173902accd47418ee4f25995d7	13-Dec-2017	Guangda Lai <laigd@google.com>	Refactor helper functions a bit for virtual gpu changes later. PiperOrigin-RevId: 178826426 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
c381794b2fc3227bfee9cf085e26bafb33da8f4b	11-Dec-2017	Xiaoqiang Zheng <zhengxq@google.com>	Support different threading modes in GPU device. All modes are experimental for now. The goal is to find the best setting, and change the default to pick that. PiperOrigin-RevId: 178662212 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
487558cf40f85539f740959dd54a4f5eee8e0560	01-Dec-2017	A. Unique TensorFlower <gardener@tensorflow.org>	Log the error status for failed CUDA EnablePeerAccess. This would help debug issues like this: #14759 PiperOrigin-RevId: 177625164 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
5439c1e2de01a8684b62aba224d44c392176ac32	20-Nov-2017	A. Unique TensorFlower <gardener@tensorflow.org>	Remove nonfunctional and accidentally pessimizing value category casts. The conditional expression only has ONE fixed value category. Because in the code as written the two operands are of different value categories, the result is in fact a prvalue, i.e. a copy. This seems unintended, and we should simply preserve the existing lvalue. If we do want to allow moving, we need multiple statements: if (num == 1) { f(std::move(copier)); } else { f(copier); } PiperOrigin-RevId: 176414503 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
4b9238b2bce08dbf6dc433d1c4911043dd60403f	13-Nov-2017	Eugene Brevdo <ebrevdo@google.com>	Support non-scalar variant device copy. PiperOrigin-RevId: 175546097 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
21a07d7ca86e7b7429d4e75ed1dfc440b94ef3bd	07-Nov-2017	Guangda Lai <laigd@google.com>	Automated g4 rollback of changelist 174735029 PiperOrigin-RevId: 174796480 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
4535bd5df4d077072a8f207146bf4cd051971237	06-Nov-2017	Yangzihao Wang <yangzihao@google.com>	Force CUDA runtime initialization only when device count is larger than 0. PiperOrigin-RevId: 174767565 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
e76519c75cf9e64fc1023c80d8b10ee712418e13	06-Nov-2017	Guangda Lai <laigd@google.com>	Refactor helper functions a bit for virtual gpu changes later. PiperOrigin-RevId: 174735029 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
bfa539c03cd1555024fc04f4974e531c46b24e07	26-Oct-2017	Malcolm Reynolds <mareynolds@google.com>	Automated g4 rollback of changelist 173456597 PiperOrigin-RevId: 173542536 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
5fe90b57748714341d02b2b44a7ec8ff27123bc0	26-Oct-2017	Yangzihao Wang <yangzihao@google.com>	Force the CUDA runtime initialization before device creation. This is to avoid silent failure and garbage results produced when launching two TensorFlow programs simultaneously in two different processes. PiperOrigin-RevId: 173456597 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
ba5a5bfc23065086990ec3057caa2ded0c8a8dbf	17-Oct-2017	A. Unique TensorFlower <gardener@tensorflow.org>	Add the op->IsExpensive() argument to tracing calls. PiperOrigin-RevId: 172422580 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
0a11eaffc985ad6abd3a0e792061e1880766674a	03-Oct-2017	Eugene Brevdo <ebrevdo@google.com>	Internal Variant API allowing registering Variants to be copied from/to GPU. Adds a test in the variant_op_copy_test. Modifies the base GPUDevice to use this registry if it sees a singleton variant. Modifies the rendezvous manager to do the same. PiperOrigin-RevId: 170908757 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
eb0808fb95567c1f5b7ce48d29f47edfd988aff8	23-Aug-2017	Benoit Steiner <bsteiner@google.com>	Converted LOG(FATAL) into regular errors to prevent the process from crashing on error. PiperOrigin-RevId: 166257105 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
28ce1d163eeffe618a6972c5245be0e660d94e85	15-Aug-2017	A. Unique TensorFlower <gardener@tensorflow.org>	Merge changes from github. END_PUBLIC --- Commit 9f81374c3 authored by raymondxyang<zihao.yang@microsoft.com> Committed by Rasmus Munk Larsen<rmlarsen@google.com>: Add option for build more python tests in Cmake (#11853) * Ignore Windows built project * Fix deprecated methods in tf.contrib.python * Fix regex match for Windows build in contrib.keras * Fix Regex match for Windows build in session_bundle * * Fix deprecated methods * Fix regex match for Windows * Fix compatibility issue with Python 3.x * Add missing ops into Windows build for test * Enabled more testcases for Windows build * Clean code and fix typo * Add conditional cmake mode for enabling more unit testcase * Add Cmake mode for major Contrib packages * Add supplementary info in RAEDME for new cmake option * * Update tf_tests after testing with TF 1.3 * Clean code and resolve conflicts * Fix unsafe regex matches and format code * Update exclude list after testing with latest master branch * Fix missing module --- Commit 98f0e1efe authored by Yong Tang<yong.tang.github@outlook.com> Committed by Rasmus Munk Larsen<rmlarsen@google.com>: Dynamic ksize and strides with MaxPool (#11875) * Dynamic ksize with max_pool This fix tries to fix the issue raised in 4746 where ksize is static (attr) with max_pool. This fix changes ksize to input tensor so that it is dynamic now. This fix fixes 4746. Signed-off-by: Yong Tang <yong.tang.github@outlook.com> * Add dynamic ksize to MaxPoolGrad and MaxPoolGradGrad Signed-off-by: Yong Tang <yong.tang.github@outlook.com> * Add test cases for max_pool_v2 Signed-off-by: Yong Tang <yong.tang.github@outlook.com> * Fix GPU Jenkins issue. Signed-off-by: Yong Tang <yong.tang.github@outlook.com> * Enable MaxPoolV2 in GPU Signed-off-by: Yong Tang <yong.tang.github@outlook.com> * Hide MaxPoolV2 and other fixes. Signed-off-by: Yong Tang <yong.tang.github@outlook.com> --- Commit 02d6bc185 authored by Bairen Yi<byronyi@users.noreply.github.com> Committed by Rasmus Munk Larsen<rmlarsen@google.com>: remove useless variable (#12212) --- Commit ed6b0d905 authored by namrata-ibm<bhavenamrata@gmail.com> Committed by Rasmus Munk Larsen<rmlarsen@google.com>: Adding support for s390x in calculation of cpu_frequency (#12201) --- Commit 627dfc9dd authored by Taehoon Lee<taehoonlee@snu.ac.kr> Committed by Taehoon Lee<taehoonlee@snu.ac.kr>: Fix typos --- Commit c0f9b0a91 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: In fast-math mode emit a tanh that has a faster min/max. PiperOrigin-RevId: 164943597 --- Commit 87605f3d6 authored by Kay Zhu<kayzhu@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: [TF:XLA] Use HloEvaluator for ComputeConstant, remove the need of a dedicated compute constant backend. PiperOrigin-RevId: 164940970 --- Commit 881de45c2 authored by Taehoon Lee<me@taehoonlee.com> Committed by Rasmus Munk Larsen<rmlarsen@google.com>: Add bool type supports for GPU kernels (#11927) * Add bool type supports for GPU kernels * Add bool type test codes for GPU kernels --- Commit eeacdcdb1 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Add missing "CPU" suffix in registrations. PiperOrigin-RevId: 164939527 --- Commit de01be952 authored by namrata-ibm<bhavenamrata@gmail.com> Committed by Rasmus Munk Larsen<rmlarsen@google.com>: Adding support for Big Endian in graph_constructor_test and wav_io (#12179) --- Commit 26719d29f authored by QingYing Chen<pkudysj@126.com> Committed by Rasmus Munk Larsen<rmlarsen@google.com>: Implement CRF decode (Viterbi decode) for tensor (#12056) * Implement CRF decoding for tensors * add test code for tensor version's CRF decoding * made modifications according to pylint * add some comments for crf decode * remove useless code * add comments at the top comment of crf module and add more comments in crf_test * capitalize first char of first word in comments * replace crf_decode test code with a deterministic example --- Commit f9a81ca2f authored by Pete Warden<pete@petewarden.com> Committed by gunan<gunan@google.com>: Create CI build script for Raspberry Pi (#12190) * Create CI build script for Raspberry Pi * Moved location of Pi build script --- Commit e2a163a90 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Merge code from PR #11940 with internal changes from cl/164796436, and update Python tests to also run on GPU. PiperOrigin-RevId: 164929133 --- Commit 08bbfa187 authored by Taehoon Lee<me@taehoonlee.com> Committed by Rasmus Munk Larsen<rmlarsen@google.com>: Fix typos (#12195) --- Commit ab96f41fb authored by Luke Iwanski<luke@codeplay.com> Committed by Rasmus Munk Larsen<rmlarsen@google.com>: [OpenCL] Extends matmul_benchmark.py to cover SYCL (#11697) * [OpenCL] Extends matmul_benchmark.py to cover SYCL * Fixed typo * /gpu:0 -> /device:GPU:0 * Fixes control_flow_ops_py_test * /gpu: -> /device:GPU: * Fixes //tensorflow/python/profiler/internal:run_metadata_test * gpu: -> GPU: * Fixes tfprof_node * [OpenCL] Fixes device path to name with many colons (#123) The device path is constructed from a device name by replacing all colons with underscores. Some device names contain more than one colon, for example 'device:SYCL:0' which gives a path 'device_SYCL_0'. The previous code would not convert this back to the original device name, but rather to 'device:SYCL_0'. An alternative fix would be to convert all underscores to colons in the device name (i.e. remove the restriction inside `replace("_", ":", 1)`), however I'm not sure if there are any device names which contain underscores. * If no gpu device aviable fake one * gpu: -> device:GPU * Fixes profiler test * /gpu:x -> /device:GPU:x * Fixes debug_io_utils_test.cc test * Fixes device_name_utils_test.cc --- Commit 35e7a3665 authored by Yong Tang<yong.tang.github@outlook.com> Committed by Rasmus Munk Larsen<rmlarsen@google.com>: Remove unneeded casting of int64 for reverse_sequence (#12192) This fix remove unneeded cast of int64 for reverse_sequence: ``` lengths = math_ops.to_int64(lengths) ``` as int32 has already been enabled for reverse_sequence. Signed-off-by: Yong Tang <yong.tang.github@outlook.com> --- Commit 9fba8c185 authored by Anna R<annarev@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Add benchmark dashboard link to benchmarks doc. Also, I added a link and description for Benchmarks page to Community index page. PiperOrigin-RevId: 164924906 --- Commit bb6f32fa7 authored by Mark Heffernan<meheff@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Make HloAliasAnalysis updatable after changes to the HLO graph. As part of this change make HloAliasAnalysis a thinner layer which basically only holds a map from HloValue to HloBuffer and vice versa. PiperOrigin-RevId: 164923041 --- Commit 9103096c1 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by Thomas K?ppe<tkoeppe@google.com>: Merged commit includes the following changes: 164923041 by meheff: Make HloAliasAnalysis updatable after changes to the HLO graph. As part of this change make HloAliasAnalysis a thinner layer which basically only holds a map from HloValue to HloBuffer and vice versa. -- PiperOrigin-RevId: 164923041 --- Commit 822603aed authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Merging sibling fusion instruction using multi_output_fusion PiperOrigin-RevId: 164920220 --- Commit c035aa2a8 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Go: Update generated wrapper functions for TensorFlow ops. PiperOrigin-RevId: 164917891 --- Commit e1e81d9ba authored by Luke Iwanski<luke@codeplay.com> Committed by Rasmus Munk Larsen<rmlarsen@google.com>: [OpenCL] Fixes double memcpy bug (#151) (#12173) * [OpenCL] Fixes double memcpy bug (#151) As the debg CopyOp is called on a Tensor without type, we need to use the DataType enum to get type information, and use this to pass the type on to Eigen. This is a workaround Eigen's need to have a type when calling memcpy. If the Eigen memcpy can be provided without a type requirement, then the memcpy in sycl_util is unnecessary. * Acts on feedback from: #12173/files/32cb12a9001b672425867b5a3110fd98e737a20b#r132496277 --- Commit d9ca2d86d authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Internal change PiperOrigin-RevId: 164916465 --- Commit b8d13d218 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Remove more parts of DCASGD missed in the first pass. (47949b) PiperOrigin-RevId: 164914552 --- Commit 73b3d52c7 authored by Alexandre Passos<apassos@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: cmake fix PiperOrigin-RevId: 164911656 --- Commit 2173b5b0a authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Allow TFE_TensorHandleCopyToDevice to have the same device as src and destination. It will reuse the same underlying buffer in those cases. PiperOrigin-RevId: 164909906 --- Commit 13eb3b90e authored by Alexandre Passos<apassos@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Experimental C and Python APIs to invoke TensorFlow kernels on concrete values. PiperOrigin-RevId: 164902588 --- Commit 7dfabcc01 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Initialize ExecutionOptions in ComputeConstant to default values. PiperOrigin-RevId: 164894867 --- Commit c8897e9bc authored by Benoit Steiner<bsteiner@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Static required time computation PiperOrigin-RevId: 164894645 --- Commit 076158f9b authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Enable implicit->explicit conversion by default. PiperOrigin-RevId: 164890915 --- Commit 58c4a4cb1 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Bugfix: number of input channels is not necessarily in the last dimension, after introduction of data_format param. PiperOrigin-RevId: 164889729 --- Commit 8f9b1af8a authored by Igor Saprykin<isaprykin@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Recover MonitoredSession when the Coordinator is requested to stop with one of the _PREEMPTION_ERRORS. When SyncReplicasOptimizer is used, a preemption in the Coordinator may result in two cases: Case 1) the session gets silently marked as complete Case 2) the session gets stuck This CL aims to solve and verify solutions for both of these problems. Fix 1 changes the should_stop logic. Fix 2 changes the CoordinatedSession.run() logic. SyncReplicasOptimizer runs a separate set of threads using a Coordinator instance. Those threads do FIFOQueue.enqueue; the main thread does a blocking FIFOQueue.dequeue. `sync_token_q` FIFOQueue is on parameter-servers. When one of the PS instances gets preempted, an AbortedError causes the Coordinator to stop via request_stop(ex). That by itself changes the state of MonitoredSession.should_stop() to True (Fix 1). Results of the blocking Dequeue operation are sent to the chief worker via Recv. What happens next depends on the amount of tokens in `sync_token_q`. If there are enough for the next call to Dequeue to return, then the low-level "tf session run() call" returns. The next iteration of the `while not MonitoredSession.should_stop()` loop decides that the training is complete (Case 1). If there are not enough tokens in `sync_token_q`, then the blocking Dequeue is going to keep waiting for them. This results in the graph execution getting stuck and the whole session getting garbage collected after 10 minutes (Case 2). We decided to fix that by re-creating a session after it gets garbage collected (Fix 2). An alternative was to try to cancel the pending Dequeue operation, but it's not clear that it is the right thing to do and it is also not easy. PiperOrigin-RevId: 164888390 --- Commit 46e4de6e5 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Undo loop fusion changes for now as they seem to be altering a few results. END_PUBLIC RELNOTES: n/a BEGIN_PUBLIC BEGIN_PUBLIC Automated g4 rollback of changelist 164825735 PiperOrigin-RevId: 165340331 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
851c11e3cc8857b3938b8babf304005340a6a5f4	08-Aug-2017	Jingyue Wu <jingyue@google.com>	is_gpu_available checks minimum compute capability. Add "compute capability: X.Y" to the short device description. This CL doesn't break backward compability because min_cuda_compute_capability is an optional argument. PiperOrigin-RevId: 164534861 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
ca200aeafa834f51ec4c0437c291acdfa08baec5	07-Aug-2017	Toby Boyd <tobyboyd@google.com>	Reduce GPU info messages PiperOrigin-RevId: 164509907 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
5e6cdbd68aa6b94883738b782eed1b59193850bf	04-Aug-2017	Toby Boyd <tobyboyd@google.com>	Reduce devices not peered info logging PiperOrigin-RevId: 164193299 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
8d31d4d909ef854e53d2cce5ff201e2bc6bc87a2	08-Jul-2017	Eugene Brevdo <ebrevdo@google.com>	Double MinSystemMemory for non-opt builds. This allows more complicated kernels (i.e., CUB reduces) a more GPU memory to launch. Prior to this, calling WhereOp with the BFC allocator in non-opt mode led to "out of memory" error when launching the kernel. PiperOrigin-RevId: 161255358 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
996605b0e4ef96e6732f7496abf44b6e5e1eb504	07-Jul-2017	Skye Wanderman-Milne <skyewm@google.com>	PiperOrigin-RevId: 161240586 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
9cf44446550b6d2c3141074013509875649b0fd5	06-Jul-2017	Eugene Brevdo <ebrevdo@google.com>	Bugfixes for GPU WhereOp. 1. Set the cuda context properly within ComputeAsync. Also set the cuda context properly in the WhereOp GPU callback. 2. Ensure report_uninitialized_variables runs on CPU. This avoids intermediate copying of data to GPU after getting the variables' state and before returning it. PiperOrigin-RevId: 161092040 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
6ada43366663210beb0159b8c1a67b26ebfe6cb7	23-Jun-2017	Geoffrey Irving <geoffreyi@google.com>	Prepare to not include node_def.proto.h in node_def_util.h The goal is to make kernels mostly independent of proto headers, which will let us lock down our .so imports. This CL makes a bunch of .cc files either include node_def.proto.h themselves or not need the definition of NodeDef; a second CL will make node_def_util.h not include node_def.proto.h. RELNOTES: n/a PiperOrigin-RevId: 159982117 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
5c6015f89c3d500bb6d5f4145572aaaab3432bb1	13-Jun-2017	A. Unique TensorFlower <gardener@tensorflow.org>	Allocate 25 MiB less on GPUs with <2GB RAM, to avoid running out of memory when launching kernels. PiperOrigin-RevId: 158889089 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
49476a62cb06bab2ff78e8295707016c6a12d728	30-May-2017	A. Unique TensorFlower <gardener@tensorflow.org>	Remove unused namespace aliases PiperOrigin-RevId: 157468609 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
cfbc9d26d6f75ba70ac1e566774b2b2d487bef6e	27-May-2017	A. Unique TensorFlower <gardener@tensorflow.org>	Annotate overriding functions with "override" or "final" (and not with "virtual") PiperOrigin-RevId: 157284709 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
fccaac3d1bf1391756fae67f1979afe598d10ed1	26-May-2017	A. Unique TensorFlower <gardener@tensorflow.org>	Force GPU device objects that refer to the same physical card using the same stream id to use the same cuda stream objects. This avoids confusing the per-device memory allocator in ways that cause memory corruption. Fixes https://github.com/tensorflow/serving/issues/335. PiperOrigin-RevId: 157258318 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
fdb4eba5b1cd0f2a2b10f83042a7e0eec1a41548	06-May-2017	A. Unique TensorFlower <gardener@tensorflow.org>	- fixing comments to reflect reality Change: 155256914 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
f28935a7d280b6ba75fe93fe35783d87b9cc2ec9	05-May-2017	Brennan Saeta <saeta@google.com>	Implement ClusterSpec Propagation in TF Master ClusterSpec propagation is a capability upgrade for TensorFlow that should make it much easier to (1) build distributed TensorFlow clusters, and (2) handle node failures. The ClusterSpec propagation capability allows TensorFlow workers to be booted independently of each other, and with no knowledge about others. The client can then construct a ClusterDef (ClusterSpec), and then send it to the TF master at session creation. The master in turn then propagates the ClusterDef along to all of the workers. Change: 155159972 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
0a9b39caefd437fec742ae48b25061abd6e2699b	29-Apr-2017	Vijay Vasudevan <vrv@google.com>	When allocating GPU constants, check to see if the destination tensor is intialized early (because we ran out of memory) and report it as such. Fixes #7025. Change: 154603030 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
698a961a5309230a87c3c02ef534b1756be32af3	17-Apr-2017	Benoit Steiner <bsteiner@google.com>	Error out when running out of GPU memory instead of triggering a fatal error. Change: 153357257 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
d146e67e20f1288cf7ea4441eb3c3301cf7fad43	14-Apr-2017	A. Unique TensorFlower <gardener@tensorflow.org>	Adding tracing of TensorFlow operations executed on the CPU. Such tracing is disabled by default. Change: 153128900 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
a42b3fc598cd028308ee9ae9dc9ed3034a04a0bf	05-Apr-2017	Suharsh Sivakumar <suharshs@google.com>	Remove warning that only happening with config cuda. Change: 152189205 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
f3405c2d73196e409041d52bbf30748b2a64493b	22-Feb-2017	A. Unique TensorFlower <gardener@tensorflow.org>	Change nccl_manager to use ncclCommInitAll. Change: 148169806 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
55d1978d73ece0f72347f27bba134b5975f4d0c8	20-Feb-2017	Jeffrey A. Dean <jeff@google.com>	A couple of performance oriented changes: (1) GraphView now allocates a single flat array of bytes to hold all the NodeItem data for a graph, and can therefore use 32-bit offsets into this array to hold the mapping from node id to its associated NodeItem (rather than 64-bit NodeItem* pointers). This halves the footprint of this data structure that is touched several times on every node execution. (2) For BaseGPUDevice::Compute, we were touching op_kernel->name() and op_kernel-type_string() on every kernel execution unconditionally, even when tracing via port::Tracing::ScopedAnnotations was off (the common case). This caused extra cache lines to be touched to access these fields in op_kernel. Instead, added new port::Tracing::ScopedAnnotation::Enabled() call and used this so that we now have two paths through the BaseGPUDevice::Compute code that avoids touching op_kernel->name() and op_kernel->type_string() in the common-case faster path. Speeds up an InceptionV3 model on my desktop GPU card by approximately 1% (75.83 images/sec to 76.533 images/sec). Change: 147979592 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
a2f9e9961051ace596b2d5139cc6b3ee6e4ff700	12-Jan-2017	David G. Andersen <dga@google.com>	Automated rollback of change 144344623 Change: 144351555 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
1f409de8c85ad1b47a50daf87089758df3989fd1	12-Jan-2017	David G. Andersen <dga@google.com>	Fail earlier and more clearly if dtype in parsed proto is invalid. Change: 144344623 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
945331a503b1e39312fb7c3c1336276154b73aa6	06-Dec-2016	A. Unique TensorFlower <gardener@tensorflow.org>	Change non-NUMA OS warning from LOG(ERROR) to LOG(INFO). Change: 141230079 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
cfccd7ce1b9092eec98bb0989eb55a11e9d2b894	04-Nov-2016	Vijay Vasudevan <vrv@google.com>	GPUDevice: if enabling peer access fails, log a warning but do not return an error. On some systems and GPUs, the driver may report being able to enable peer access between two devices, but trying to do so still fails. The system can still run, though possibly slower than if peer access were enabled. Since we cannot disambiguate between supported and unsupported cases in which this happens, we demote this to a warning, with the exception being that if no device could enable peering even though it should be possible, we still return an error. Fixes #5362, hopefully. Change: 138141024 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
a95f95a60b20fb48fbfcae8da4afaa9412582746	02-Nov-2016	Gunhan Gulsoy <gunan@google.com>	Remove references to gcudacc. Change: 137888607 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
e2d51a87f0727f8537b46048d8241aeebb6e48d6	28-Oct-2016	Xiaoqiang Zheng <zhengxq@google.com>	Merge changes from github. Change: 137532946 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
79228c74e64a639aeb5692b442522d4aa279f885	20-Oct-2016	A. Unique TensorFlower <gardener@tensorflow.org>	Replace enum BusAdjacency with protobuf DeviceLocality for describing the topological neighborhood of a device. Change: 136663586 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
3e8c4fd7403659ec32b9fec90a78831043aa0786	31-Aug-2016	Vijay Vasudevan <vrv@google.com>	Don't establish contexts on gpus not on visible_device_list. Moves all the initialization code out of gpu_init.h and into gpu_device.cc, because we want the code that establishes peer mappings between GPUs to reside close to where the device selection order is made. Now the initialization code does nothing but calls the StreamExecutor platform initialization. This also checks that there are no duplicate entries in the visible_device_list. I tested this by running the following program: import time import tensorflow as tf c = tf.ConfigProto() c.gpu_options.visible_device_list="1" s = tf.Session(config=c) time.sleep(5) # nvidia-smi showed the context was established on device 1 but NOT 0 del s c.gpu_options.visible_device_list="1,0" s = tf.Session(config=c) time.sleep(30) # nvidia-smi showed the context was established on both device 0 and 1, # and the logs showed that the device ordering was 1->/gpu:0 and 0->/gpu:1, # as well as the fact that it tried to establish the peer mapping. del s c.gpu_options.visible_device_list="1,0,1" s = tf.Session(config=c) # failed Fixes #1888 Change: 131785661 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
b0bdff4827f867a67f572ed99d85f9a847788326	26-Aug-2016	A. Unique TensorFlower <gardener@tensorflow.org>	Merge changes from github. Change: 131437429 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
8faa6f04f341a749e56f6475404c82cc704d08e2	26-Aug-2016	Vijay Vasudevan <vrv@google.com>	Automated rollback of change 131356339 Change: 131410487 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
10d6f0eed8f4704589d036bcc09ba000c5c91e8a	26-Aug-2016	Vijay Vasudevan <vrv@google.com>	Automated rollback of change 131340536 Change: 131356339 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
fa042c7063b922f8048630de7084b2a28e59827c	26-Aug-2016	Vijay Vasudevan <vrv@google.com>	Add a 'visible to virtual' GPU device remapping to ConfigProto. Allows one to remap the visible GPUs to virtual GPUs on a per-session basis. For example, if the visible devices are 0,1,2,3,4,5,6,8, setting visible_device_list='5,3' means that visible device 5 gets mapped to /gpu:0 and visible device 3 maps to /gpu:1. Tested manually on my local machine: our tests are single GPU only so there's no good way to test this ongoing. Fixes #1888. Change: 131340536 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
2c598e874e6a7b6b3185846ce9bac97a7d5d0169	25-Aug-2016	A. Unique TensorFlower <gardener@tensorflow.org>	Merge changes from github. Change: 131310818 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
d0789e85a24af6f608a7bcc2c7928028cc0ff8a6	11-Aug-2016	Vijay Vasudevan <vrv@google.com>	Change GPU initialization to avoid crashing on errors (as much as possible). Change: 129926913 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
2d0d126749d6d0cf82fb86691362c923a1bfbfe4	09-Aug-2016	Vijay Vasudevan <vrv@google.com>	Change DeviceFactory functions that create devices to propagate Statuses, so that failures to initialize devices don't crash the program. Changes swig for device_lib to be a lot simpler, thanks to mrry@ and keveman@'s help. Change allocation of eigen scratch memory to go through the allocator. Re-enable test for local devices now that python3 issue is fixed. Change: 129678132 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
b8ccc5a20d0e6fc04ef9a854bb391c69f99ca907	18-Jul-2016	A. Unique TensorFlower <gardener@tensorflow.org>	Reduce overhead of running kernels: - Change ExecutorState::Entry to construct the Tensor val late, avoiding default construction and moving the tensor in favor of calling the constructor directly. - Change DeviceContextMap to be vector, and make the check cheaper when no nodes are registered. - Cache in Params whether the device requires registering tensor accesses. Change: 127714854 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
404b6faa3846c75e22daf12d7a979d942577a657	09-Jun-2016	A. Unique TensorFlower <nobody@tensorflow.org>	Remove unused MultiOpActivation from stream executor. Change: 124421014 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
8f10920915b8442fcd8ea43d1fd7a595b97ebf46	08-Jun-2016	A. Unique TensorFlower <nobody@tensorflow.org>	Change a couple of error messages related to getting the NUMA affinity of a GPU to suggest that the kernel may be lacking NUMA support. Change: 124389318 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
03dba169c65953266713800fd05d6d151194d4f7	08-Jun-2016	Benoit Steiner <benoit.steiner.goog@gmail.com>	Improved the performance of full reductions on GPU. NEW BM_fullReduction/10 4591 4595 153149 20.8M items/s BM_fullReduction/64 5073 5075 100000 770.0M items/s BM_fullReduction/512 9067 9070 75263 26.9G items/s BM_fullReduction/4k 243984 244125 2868 64.0G items/s BM_fullReduction/5k 359125 359273 1951 64.8G items/s OLD BM_fullReduction/10 9085 9087 74395 10.5M items/s BM_fullReduction/64 9478 9478 72014 412.1M items/s BM_fullReduction/512 14643 14646 46902 16.7G items/s BM_fullReduction/4k 260338 260384 2678 60.0G items/s BM_fullReduction/5k 385076 385178 1818 60.5G items/s Change: 124290852 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
61f89ece63493f603d4f55725aba4ef4fb0dd6dd	07-Jun-2016	A. Unique TensorFlower <nobody@tensorflow.org>	The minimum number of multiprocessors in a GPU is now set to the maximum number found in the visible GPUs or 8, whichever is smaller. GPUs with a smaller number of multiprocessors are ignored. This check is now performed regardless of how many GPUs there are. Change: 124256972 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
c8b59c046895fa5b6d79f73e0b5817330fcfbfc1	02-Jun-2016	A. Unique TensorFlower <nobody@tensorflow.org>	Update copyright for 3p/tf/core. Change: 123900938 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
acc40ff24b167477944c1aa2ee82c0eef39a0138	02-Jun-2016	Xiaoqiang Zheng <zhengxq@google.com>	Add TF_EXTRA_CUDA_CAPABILITIES to support extra Cuda compute capabilities. The list of extra cuda capabilities must be passed through in a sequence separated by commas. With bazel, the build command arguments to include "3.0" is: --copt=-DTF_EXTRA_CUDA_CAPABILITIES=3.0 Change: 123810335 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
892ca4ddc12852a7b4633fd08f163941356cb4e6	23-May-2016	Derek Murray <mrry@google.com>	Merge changes from github. Change: 123026122 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
df7276a15c5f9b9d1bca6794159e049103c0e2be	12-May-2016	Benoit Steiner <benoit.steiner.goog@gmail.com>	Upgraded to the latest version of Eigen that speeds up full reductions on fp16 by about 3 orders of magnitude as well as some partial reductions by 30% when using cuda 7.5 or above Change: 122191448 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
885cc6bf55745142b8ecc578c61c5f03ff45e6ce	10-May-2016	A. Unique TensorFlower <nobody@tensorflow.org>	Where GPUStreamExecutor fails to find a GPU NUMA node and returns -1, log an error message then reset to 0 where the value is used in GPUDevice. Getting the NUMA node correct is only necessary for multi-socket/bus architectures which are probably in the minority among the TensorFlow user community. This fix will unblock users for whom the StreamExecutor cannot read the NUMA data correctly, and still provide an error message warning of the problem for users who might be affected. Change: 121970499 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
049040a6dc710b54a566085fc8d2f6608dbbd6a2	10-May-2016	A. Unique TensorFlower <nobody@tensorflow.org>	Adds the StreamExecutor as a protected member of BaseGPUDevice. Change: 121946595 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
e9db74626ea9e46a93eb0c15c37a2e138c83fade	22-Apr-2016	A. Unique TensorFlower <nobody@tensorflow.org>	Replace dynamic_cast with static_cast in ReinitializeDevice. Change: 120542152 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
ab6ffc92992f12456d2378f872be17f0ed274083	12-Mar-2016	Vijay Vasudevan <vrv@google.com>	TensorFlow: allow growth in the GPU BFC allocator. This allows an option to start the BFC allocator small and grow it over time as needed. This can lead to increased fragmentation, but the benefit is that only as much memory as "needed" is reserved. This option defaults to off, but can be turned on by passing an option to the first Session. This is done by adding one more layer of indirection between mapping a ChunkHandle to a pointer by introducing the concept of AllocationRegions, which are contiguous memory regions that mimic the previous implementation in their indexing (constant time indexing within an AllocationRegion). The drawback is that we must introduce one more lookup to find out which allocation region a pointer is a part of. This implementation uses a sorted vector and upper_bound to do a binary search based on end_ptr. Its impact is relatively low based on the microbenchmarks below, and if it were a cause for later concern, we can try to map the 'page tables' of multiple regions into one very large AllocationRegion, and hope that there are no holes between address spaces so that the ChunkHandle map is not too large for memory. That being said, this change appears to not slow down the ptb_word_lm benchmark, which was initial impetus for most of the recent changes to this class, so this appears safe. Microbenchmarks I had ran showed no real difference, even when there were multiple regions, and ptb_word_lm benchmark also didn't change. The following numbers bear this out: At HEAD: (consumes 5.8GiB on my Titan Black) Epoch: 1 Learning rate: 1.000 0.004 perplexity: 6119.287 speed: 679 wps 0.104 perplexity: 849.526 speed: 5743 wps 0.204 perplexity: 629.677 speed: 6935 wps 0.304 perplexity: 509.189 speed: 7461 wps 0.404 perplexity: 438.585 speed: 7760 wps 0.504 perplexity: 392.459 speed: 7953 wps 0.604 perplexity: 352.998 speed: 8081 wps 0.703 perplexity: 325.909 speed: 8182 wps 0.803 perplexity: 304.531 speed: 8261 wps 0.903 perplexity: 284.988 speed: 8322 wps Epoch: 1 Train Perplexity: 270.398 Epoch: 1 Valid Perplexity: 178.860 Epoch: 2 Learning rate: 1.000 0.004 perplexity: 212.458 speed: 8836 wps 0.104 perplexity: 151.131 speed: 9039 wps 0.204 perplexity: 158.768 speed: 8950 wps 0.304 perplexity: 153.650 speed: 8925 wps 0.404 perplexity: 150.586 speed: 8910 wps 0.504 perplexity: 148.136 speed: 8817 wps 0.604 perplexity: 143.511 speed: 8778 wps 0.703 perplexity: 141.382 speed: 8773 wps 0.803 perplexity: 139.401 speed: 8775 wps 0.903 perplexity: 135.706 speed: 8777 wps Epoch: 2 Train Perplexity: 133.618 Epoch: 2 Valid Perplexity: 143.462 Epoch: 3 Learning rate: 1.000 0.004 perplexity: 146.292 speed: 8947 wps 0.104 perplexity: 104.901 speed: 9325 wps 0.204 perplexity: 114.335 speed: 9108 wps 0.304 perplexity: 111.434 speed: 9046 wps 0.404 perplexity: 110.328 speed: 9014 wps 0.504 perplexity: 109.455 speed: 8995 wps 0.604 perplexity: 106.877 speed: 8984 wps 0.703 perplexity: 106.158 speed: 8978 wps 0.803 perplexity: 105.532 speed: 8966 wps 0.903 perplexity: 103.284 speed: 8965 wps Epoch: 3 Train Perplexity: 102.326 Epoch: 3 Valid Perplexity: 132.332 Epoch: 4 Learning rate: 1.000 0.004 perplexity: 116.748 speed: 8990 wps 0.104 perplexity: 85.032 speed: 9172 wps 0.204 perplexity: 93.827 speed: 9051 wps 0.304 perplexity: 91.716 speed: 9010 wps 0.404 perplexity: 91.088 speed: 8966 wps 0.504 perplexity: 90.654 speed: 8955 wps 0.604 perplexity: 88.841 speed: 8952 wps 0.703 perplexity: 88.550 speed: 8943 wps 0.803 perplexity: 88.268 speed: 8932 wps 0.903 perplexity: 86.610 speed: 8924 wps Epoch: 4 Train Perplexity: 86.030 Epoch: 4 Valid Perplexity: 127.415 Epoch: 5 Learning rate: 1.000 0.004 perplexity: 98.907 speed: 8952 wps 0.104 perplexity: 73.707 speed: 9238 wps 0.204 perplexity: 81.525 speed: 9112 wps 0.304 perplexity: 79.768 speed: 9074 wps 0.404 perplexity: 79.366 speed: 9060 wps 0.504 perplexity: 79.199 speed: 9039 wps 0.604 perplexity: 77.728 speed: 9037 wps 0.703 perplexity: 77.630 speed: 9037 wps 0.803 perplexity: 77.596 speed: 9033 wps 0.903 perplexity: 76.270 speed: 9005 wps Epoch: 5 Train Perplexity: 75.907 Epoch: 5 Valid Perplexity: 126.183 Epoch: 6 Learning rate: 0.500 0.004 perplexity: 88.458 speed: 8816 wps 0.104 perplexity: 64.231 speed: 9143 wps 0.204 perplexity: 69.896 speed: 9050 wps 0.304 perplexity: 67.342 speed: 9016 wps 0.404 perplexity: 66.162 speed: 8989 wps 0.504 perplexity: 65.290 speed: 8952 wps 0.604 perplexity: 63.331 speed: 8945 wps 0.703 perplexity: 62.617 speed: 8942 wps 0.803 perplexity: 61.883 speed: 8943 wps 0.903 perplexity: 60.149 speed: 8934 wps Epoch: 6 Train Perplexity: 59.222 Epoch: 6 Valid Perplexity: 119.635 Epoch: 7 Learning rate: 0.250 0.004 perplexity: 73.009 speed: 8941 wps 0.104 perplexity: 53.369 speed: 9241 wps 0.204 perplexity: 58.193 speed: 9115 wps 0.304 perplexity: 55.957 speed: 9091 wps 0.404 perplexity: 54.885 speed: 9073 wps 0.504 perplexity: 54.052 speed: 9059 wps 0.604 perplexity: 52.298 speed: 9053 wps 0.703 perplexity: 51.598 speed: 9036 wps 0.803 perplexity: 50.858 speed: 9024 wps With this change: (Consumes 700MiB on my TitanBlack) Epoch: 1 Learning rate: 1.000 0.004 perplexity: 6220.805 speed: 649 wps 0.104 perplexity: 847.498 speed: 5631 wps 0.204 perplexity: 628.919 speed: 6853 wps 0.304 perplexity: 506.395 speed: 7391 wps 0.404 perplexity: 435.559 speed: 7675 wps 0.504 perplexity: 389.903 speed: 7883 wps 0.604 perplexity: 351.013 speed: 8033 wps 0.703 perplexity: 324.474 speed: 8144 wps 0.803 perplexity: 303.551 speed: 8230 wps 0.903 perplexity: 284.267 speed: 8300 wps Epoch: 1 Train Perplexity: 269.826 Epoch: 1 Valid Perplexity: 178.575 Epoch: 2 Learning rate: 1.000 0.004 perplexity: 214.660 speed: 8880 wps 0.104 perplexity: 152.258 speed: 9222 wps 0.204 perplexity: 159.331 speed: 9072 wps 0.304 perplexity: 154.358 speed: 9036 wps 0.404 perplexity: 151.455 speed: 9019 wps 0.504 perplexity: 148.906 speed: 9008 wps 0.604 perplexity: 144.203 speed: 8990 wps 0.703 perplexity: 142.134 speed: 8979 wps 0.803 perplexity: 140.096 speed: 8971 wps 0.903 perplexity: 136.424 speed: 8968 wps Epoch: 2 Train Perplexity: 134.372 Epoch: 2 Valid Perplexity: 144.896 Epoch: 3 Learning rate: 1.000 0.004 perplexity: 146.571 speed: 9008 wps 0.104 perplexity: 105.991 speed: 9277 wps 0.204 perplexity: 114.965 speed: 9151 wps 0.304 perplexity: 112.041 speed: 9101 wps 0.404 perplexity: 110.948 speed: 9057 wps 0.504 perplexity: 110.141 speed: 9050 wps 0.604 perplexity: 107.539 speed: 9043 wps 0.703 perplexity: 106.877 speed: 9040 wps 0.803 perplexity: 106.181 speed: 9040 wps 0.903 perplexity: 103.940 speed: 9025 wps Epoch: 3 Train Perplexity: 103.023 Epoch: 3 Valid Perplexity: 132.966 Epoch: 4 Learning rate: 1.000 0.004 perplexity: 117.296 speed: 8990 wps 0.104 perplexity: 85.532 speed: 9764 wps 0.204 perplexity: 94.076 speed: 9784 wps 0.304 perplexity: 91.875 speed: 9773 wps 0.404 perplexity: 91.423 speed: 9689 wps 0.504 perplexity: 91.090 speed: 9546 wps 0.604 perplexity: 89.244 speed: 9460 wps 0.703 perplexity: 89.004 speed: 9399 wps 0.803 perplexity: 88.732 speed: 9352 wps 0.903 perplexity: 87.097 speed: 9312 wps Epoch: 4 Train Perplexity: 86.571 Epoch: 4 Valid Perplexity: 128.440 Epoch: 5 Learning rate: 1.000 0.004 perplexity: 100.152 speed: 8973 wps 0.104 perplexity: 74.050 speed: 9271 wps 0.204 perplexity: 81.658 speed: 9157 wps 0.304 perplexity: 79.822 speed: 9115 wps 0.404 perplexity: 79.594 speed: 9061 wps 0.504 perplexity: 79.486 speed: 9020 wps 0.604 perplexity: 78.066 speed: 8990 wps 0.703 perplexity: 78.046 speed: 8974 wps 0.803 perplexity: 77.968 speed: 8963 wps 0.903 perplexity: 76.702 speed: 8946 wps Epoch: 5 Train Perplexity: 76.386 Epoch: 5 Valid Perplexity: 127.245 Change: 117032081 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
3b55e1f4f4be8fd4a6a5084edf9daf01e0990c3c	12-Mar-2016	A. Unique TensorFlower <nobody@tensorflow.org>	Change safe_strto32 and safe_strto64 to accept StringPiece. Updates callers to pass the StringPiece values. Change: 117027762 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
ec1403e7dc2b919531e527d36d28659f60621c9e	02-Mar-2016	A. Unique TensorFlower <nobody@tensorflow.org>	Add optional comprehensive logging of memory allocation/deallocation events. When enabled, the following events are recorded: The start of a step, with the numerical step_id and a textual handle describing the step. A Tensor allocation, including the step_id, the name of the OpKernel, the data type, shape, allocation size, allocation_id, data pointer location, and allocator used (the allocation_id is local to an allocator). A Tensor deallocation, including the allocation_id and allocator used. A raw memory allocation, including the step_id, the name of the component (e.g. Eigen), the number of bytes, data pointer location, allocation_id and allocator used. A raw memory deallocation, including the step_id, the name of the component (e.g. Eigen), allocation_id and allocator used. For now many Tensor allocations show 'unknown' for the kernel and step_id. These mostly come from Tensors allocated by the system from protocol buffers, and Tensors allocated by Ops using the Tensor constructor directly instead of calling OpKernelContext::allocate_temp. The latter can in principle be cleaned up one by one as necessary. The former would require some plumbing to associate an allocation with the appropriate step_id. With this CL memory logging is enabled by raising the VLOG level to 1. Once there is an ability to set process-wide options programmatically it would make sense to update the machinery to do that. Currently recorded events are logged as INFO, and they can all be retrieved by filtering the log for lines including __LOG_MEMORY__. Some example lines are as follows: I0301 13:38:55.797563 81179 log_memory.cc:18] __LOG_MEMORY__ MemoryLogTensorAllocation { step_id: -6 kernel_name: "Unknown (from Proto)" tensor { dtype: DT_FLOAT shape { } allocation_description { requested_bytes: 4 allocated_bytes: 4 allocator_name: "cuda_host" allocation_id: 2 has_single_reference: true ptr: 8717861408 } } } I0301 13:38:55.802245 81179 log_memory.cc:18] __LOG_MEMORY__ MemoryLogTensorAllocation { step_id: -6 kernel_name: "Unknown" tensor { dtype: DT_FLOAT shape { } allocation_description { requested_bytes: 4 allocated_bytes: 256 allocator_name: "gpu_bfc" allocation_id: 1 has_single_reference: true ptr: 47378989056 } } } I0301 13:38:55.802347 81179 log_memory.cc:18] __LOG_MEMORY__ MemoryLogTensorDeallocation { allocation_id: 2 allocator_name: "cuda_host" } [...] I0301 13:38:55.806454 81179 log_memory.cc:18] __LOG_MEMORY__ MemoryLogStep { step_id: 1 handle: "->/init;0" } I0301 13:38:55.806659 81220 log_memory.cc:18] __LOG_MEMORY__ MemoryLogTensorOutput { step_id: 1 kernel_name: "random_normal/shape" tensor { dtype: DT_INT32 shape { dim { size: 4 } } allocation_description { requested_bytes: 16 allocated_bytes: 16 allocator_name: "cuda_host" allocation_id: 1 ptr: 8717860896 } } } [...] I0301 13:38:56.362898 81218 log_memory.cc:18] __LOG_MEMORY__ MemoryLogTensorAllocation { step_id: 1 kernel_name: "conv1/truncated_normal" tensor { dtype: DT_FLOAT shape { dim { size: 11 } dim { size: 11 } dim { size: 3 } dim { size: 96 } } allocation_description { requested_bytes: 139392 allocated_bytes: 139520 allocator_name: "gpu_bfc" allocation_id: 36 has_single_reference: true ptr: 47379030016 } } } I0301 13:38:56.362894 81217 log_memory.cc:18] __LOG_MEMORY__ MemoryLogTensorDeallocation { allocation_id: 24 allocator_name: "gpu_bfc" } I0301 13:38:56.362903 81213 log_memory.cc:18] __LOG_MEMORY__ MemoryLogTensorOutput { step_id: 1 kernel_name: "conv5/truncated_normal/mul" tensor { dtype: DT_FLOAT shape { dim { size: 3 } dim { size: 3 } dim { size: 1024 } dim { size: 1024 } } allocation_description { requested_bytes: 37748736 allocated_bytes: 37748736 allocator_name: "gpu_bfc" allocation_id: 34 ptr: 48512711168 } } } [...] I0229 16:39:57.482980 76558 log_memory.cc:18] __LOG_MEMORY__ MemoryLogRawAllocation { step_id: 13 operation: "xentropy/EigenAllocator" num_bytes: 64 ptr: 47386857472 allocation_id: 625 allocator_name: "gpu_bfc" } I0229 16:39:57.483147 76558 log_memory.cc:18] __LOG_MEMORY__ MemoryLogRawDeallocation { step_id: 13 operation: "xentropy/EigenAllocator" allocation_id: 625 allocator_name: "gpu_bfc" deferred: true } I0229 16:39:57.483197 76558 log_memory.cc:18] __LOG_MEMORY__ MemoryLogRawDeallocation { step_id: 13 operation: "xentropy/EigenAllocator" allocation_id: 625 allocator_name: "gpu_bfc" } Change: 116065112 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
c1e0d8d469ed467f7421a5a528895feb096a4a07	12-Feb-2016	Benoit Steiner <benoit.steiner.goog@gmail.com>	Avoid calling the default constructor of GpuDevice since it puts in motion a lot of expensive stream executor machinery. Moreover the stream executor initialization was done for nothing as the device always ends up being reinitialized before its first use. Change: 114488347 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
88b0cb44a468ca8c26b20f008d3414a1474a3f8e	11-Feb-2016	A. Unique TensorFlower <nobody@tensorflow.org>	Force the executor to wait until queued operations have finished executing before allowing the SINK Op to complete. This probably better matches the expected behavior, and avoids misleading output from benchmarks, as well as marginally improving the reported timing of some of our existing benchmarks, and reducing the amount of "burn-in" time needed before getting useful readings from benchmarks. A common pattern in benchmark python code is: start_time = time.time() _ = session.run(tf.group(*target)) duration = time.time() - start_time Since no tensor is fetched by session.run the call returns as soon as the SINK Op has finished its Compute method. With the current codebase this does not wait for GPU-queued operations to complete, and as a consequence steps that should execute sequentially are overlapped. By modifying benchmark_overfeat (in this case with batch size 2750) to print out timings for every step including "burn-in" we see the behavior, where between 5 and 6 steps are overlapped before backpressure kicks in (the direct session thread pool has 5 entries) and we see the steady state time stabilize: 2016-02-11 10:38:13.556584: step 1, duration = 0.005 2016-02-11 10:38:13.561401: step 2, duration = 0.005 2016-02-11 10:38:13.565904: step 3, duration = 0.004 2016-02-11 10:38:13.570100: step 4, duration = 0.004 2016-02-11 10:38:15.434168: step 5, duration = 1.864 2016-02-11 10:38:23.401597: step 6, duration = 7.967 2016-02-11 10:38:31.372718: step 7, duration = 7.971 2016-02-11 10:38:39.344853: step 8, duration = 7.972 2016-02-11 10:38:47.333486: step 9, duration = 7.989 2016-02-11 10:38:55.297039: step 10, duration = 7.963 2016-02-11 10:39:03.267716: step 11, duration = 7.971 2016-02-11 10:39:11.243412: step 12, duration = 7.976 2016-02-11 10:39:19.198777: step 13, duration = 7.955 2016-02-11 10:39:27.175440: step 14, duration = 7.977 With the change the output (same batch size) shows: 2016-02-11 10:35:24.421367: step 1, duration = 7.832 2016-02-11 10:35:32.269099: step 2, duration = 7.848 2016-02-11 10:35:40.118505: step 3, duration = 7.849 2016-02-11 10:35:47.946552: step 4, duration = 7.828 2016-02-11 10:35:55.802172: step 5, duration = 7.856 2016-02-11 10:36:03.637670: step 6, duration = 7.835 2016-02-11 10:36:11.505289: step 7, duration = 7.868 2016-02-11 10:36:19.331914: step 8, duration = 7.827 2016-02-11 10:36:27.162643: step 9, duration = 7.831 2016-02-11 10:36:34.994089: step 10, duration = 7.831 2016-02-11 10:36:42.850624: step 11, duration = 7.856 2016-02-11 10:36:50.707724: step 12, duration = 7.857 2016-02-11 10:36:58.574334: step 13, duration = 7.867 No steps are overlapped, so the first step reports meaningful timing, and as a bonus the overall reported step time is slightly less, presumably because we are able to use all 5 CPU threads in the thread pool for a step. The maximum batch size before OOM is the same before and after the change. Change: 114458853 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
bc6f617bd636c4e371d7c03d153e9ce9b8b0e3c3	10-Feb-2016	Xiaoqiang Zheng <zhengxq@google.com>	* Change gpu_count to gpu_device_enabled. * Switch to have the BaseGPUDevice constructor to enable GPU support, instead of relying on the entry point. This make it possible for TensorFlow to use pinned memory for GPU/CPU memory copies for all entry points. Change: 114364920 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
241698b6ba6cd9b13d606a9e4603baa4f33891f2	06-Feb-2016	Xiaoqiang Zheng <zhengxq@google.com>	Move CPU/GPU memory copies into their own streams. Change: 114000504 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
545dee2e9897e424b641f092eec9ffd4a277f9d1	03-Feb-2016	Xiaoqiang Zheng <zhengxq@google.com>	Put device-to-device GPU memory copies on a different stream. Change: 113784244 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
d821f6aeb66a93501673ac5314685bd7d58151f8	02-Feb-2016	Xiaoqiang Zheng <zhengxq@google.com>	Disable tensor tracking when only one GPU stream is used. Change: 113579306 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
f8fa35b8a1910772d6d6ba7b621f905358640c2c	26-Jan-2016	Josh Levenberg <josh11b@tensorflow.org>	Global search & replace to move to the new location for tensorflow/core/ files and build targets. Change: 113080048 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
ff8522de343a90813fc4e5cbb249e308c1819f1d	25-Jan-2016	A. Unique TensorFlower <nobody@tensorflow.org>	Eliminate per-op allocation of gpu device wrapper. The PerOpGpuDevice is allocated once in the OpKernelContext::Params struct, then re-used every time a new OpKernelContext uses the Params. Thus in the executor, as long as there is more work to do the PerOpGpuDevice is not freed. Change: 112909215 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
b481783fe0e00a86f6feb20a8dcad5fc4fc936a4	21-Jan-2016	Josh Levenberg <josh11b@tensorflow.org>	Move #include <vector> out of port.h to users of std::vector<>. After this we can replace port.h with types.h. Change: 112727463 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
c8eaac926c929e07ac8db69f67803a2223ff2d93	20-Jan-2016	Josh Levenberg <josh11b@tensorflow.org>	Many tensorflow/core build clean ups. Change: 112523833 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
f592f23775e2a6ac75496829db5005d3bb70a3d2	19-Jan-2016	A. Unique TensorFlower <nobody@tensorflow.org>	Replacing reference 'names' variable with 'example_names' variable. Change: 112481326 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
6f62e435ab6c36dfdfdef1acd580b5f278f6723c	18-Jan-2016	A. Unique TensorFlower <nobody@tensorflow.org>	Adds enough auditing to make it possible to track tensor buffers throughout an execution, and build a cost model of memory usage. There are two main components: 1) GPU allocators now assign to each allocated tensor buffer a unique ID so its use can be tracked within and across steps. 2) The checkin cleans up the tracking of usage of Tensor buffers, and makes it work for both sync and async kernels (async kernels did not previously track gpu memory correctly). Each use is now tracked by the OpKernelContext (for allocators that need this support) in a single uniquified set of TensorReferences. When the kernel finishes, the executor retrieves the list of references, logs it if needed in the nodeexecstats, then passes it to the device, which may add an additional reference to keep the memory from being reused until the execution completes. When the tensor is logged in the nodeexecstats a flag is set if there is a single remaining reference to the buffer, which means that the memory will be freed once the Op completes. Change: 112375683 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
8e388a5546f1466aa0d2afa00e5a015997a23a2b	14-Jan-2016	Vijay Vasudevan <vrv@google.com>	TensorFlow: Get rid of legacy command line flags use in TensorFlow. Change: 112105282 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
94538a944ed71eb8c0c22213fef245e09165f935	13-Jan-2016	A. Unique TensorFlower <nobody@tensorflow.org>	Added OpKernel::is_internal and used it to avoid a memcmp against the OpKernel::type_string() for every node executed on a GPU. Change: 111998968 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
3ffa307e49e5b150934a71386194d7ed621e3e98	07-Jan-2016	Josh Levenberg <josh11b@tensorflow.org>	#include third_party/tensorflow/core/platform/macros.h directly so we can drop it from port.h. Change: 111621646 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
1c579361cd1e088dd5e05a394b1561a73e3667ba	05-Jan-2016	A. Unique TensorFlower <nobody@tensorflow.org>	Added 'logging' import to control_flow_ops which is used in the file but not imported. Change: 110842260 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
f9d3e9d03c69bfac77a2fe1ad80f7c5aa517e0f0	06-Dec-2015	Vijay Vasudevan <vrv@google.com>	TensorFlow: upstream latest changes to git. Change 109537918 TensorFlow pip setup: wheel >= 0.26 for python3 pip install Change 109505848 Fix distortion default value to 1.0 in fixed_unigram_candidate_sampler. This means we default to the actual provided unigram distribution, instead of to the uniform (as it is currently). Change 109470494 Bugfix in gradients calculation when the ys rely on each other. Change 109467619 Fix CIFAR-10 model to train on all the training data instead of just 80% of it. Fixes #396. Change 109467557 Replaced checkpoint file with binary GraphDef. Change 109467433 Updates to C++ tutorial section. Change 109465269 TensorFlow: update documentation for tutorials to not assume use of bazel (when possible). Change 109462916 A tutorial for image recognition to coincide with the release of the latest Inception image classification model. Change 109462342 Clear control dependencies in variable_scope.get_variable() when creating ops for the initializer. Add tests of various error conditions. Change 109461981 Various performance improvements in low-level node execution code paths. Speeds up ptb_word_lm on my desktop with a Titan X from 3638 words per second to 3751 words per second (3.1% speedup). Changes include: o Avoided many strcmp operations per node execution and extra touches of cache lines in executor.cc, by making all the various IsMerge, IsSwitch, IsSend, etc. operations instead be based on an internal enum value that is pre-computed at Node construction time, rather than doing string comparisons against node->type_string(). We were doing about 6 such comparisons per executed node. o Removed mutex_lock in executor.cc in ExecutorState::Process. The lock was not needed and the comment about the iterations array being potentially resized is not true (the iterations arrays are created with a fixed size). Checked with yuanbyu to confirm this. o Added new two-argument port::Tracing::ScopedAnnotation constructor that takes two StringPiece arguments, and only concatenates them lazily if tracing is enabled. Also changed the code in platform/tracing.{h,cc} so that the ScopedAnnotation constructor and the TraceMe constructor can be inlined. o In BaseGPUDevice::Compute, used the two-argument ScopedAnnotation constructor to avoid doing StrCat(opkernel->name(), ":", op_kernel->type_string()) on every node execution on a GPU. o Introduced a new TensorReference class that just holds a reference to an underlying TensorBuffer, and requires an explicit Unref(). o Changed the EventMgr interface to take a vector of TensorReference objects for EventMgr::ThenDeleteTensors, rather than a vector of Tensor objects. o Used TensorReference in a few places in gpu_util.cc o Minor: switched to using InlinedVectors in a few places to get better cache locality. Change 109456692 Updated the label_image example to use the latest Inception model Change 109456545 Provides classify_image which performs image recognition on a 1000 object label set. $ ./classify_image giant panda, panda, panda bear, coon bear, Ailuropoda melanoleuca (score = 0.88493) indri, indris, Indri indri, Indri brevicaudatus (score = 0.00878) lesser panda, red panda, panda, bear cat, cat bear, Ailurus fulgens (score = 0.00317) custard apple (score = 0.00149) earthstar (score = 0.00127) Change 109455002 TensorFlow: make the helper libraries for various models available in the pip package so that when users type: python translate.py ... the absolute import works. This change is supposed to help make our tutorials run without the need to use bazel. Change 109450041 TensorFlow: remove cifar and convolutional binary copies from pip install. Adds embedding and some other models to the list. Change 109448520 Move the description of a failing invariant from a comment into the dcheck-fail message text. Change 109447577 TensorBoard has release tagging (tensorboard/TAG) Also track TensorBoard changes (tensorboard/CHANGES) Change 109444161 Added ParseSingleSequenceExample + python wrappers + unit tests. Change 109440864 Update all the TensorFlow Dockerfiles, and simplify GPU containers. This change updates all four of our Dockerfiles to match the targets discussed in https://github.com/tensorflow/tensorflow/issues/149. The most notable change here is moving the GPU images to use the NVidia containers which include cudnn and other build-time dependencies, dramatically simplifying both the build and run steps. A description of which tags exist and get pushed where will be in a follow-up. Change 109432591 Some pylint and pydoc changes in saver. Change 109430127 Remove unused hydrogen components Change 109419354 The RNN api, although moved into python/ops/, remains undocumented. It may still change at any time. Base CL: 109538006 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
9c3043ff3bf31a6a81810b4ce9e87ef936f1f529	20-Nov-2015	Manjunath Kudlur <keveman@gmail.com>	TensorFlow: Improve performance of Alexnet Changes: * error message that refers to removed `DefaultSession` method. * -Wnull-conversion warnings * the "_start_time" attr for recvs when the flag "--brain_enable_scheduling_for_recvs" is set. * typo in tutorial data download progress message. * a typo ("however their installing"=>"however installing"). * typo, rename "TensorFlow Mechanics" to "How To" to be consistent with the website. * a typo ("subtact"=>"subtract"). * protobuf examples in comments in tensorflow::Example.proto. * formula formatting in MNIST beginner tutorial * negative fraction-of-queue-full stats * protobuf inclusion path so that Android demo will build under Blaze. * small typo (moderatly > moderately) * Session.run() to check that tensor arguments come from the session's graph. * another six import * seq2seq typo in bazel command Base CL: 108349164 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
56313def004795f75ef8281a0294c958d28f1e06	16-Nov-2015	Vijay Vasudevan <vrv@google.com>	TensorFlow: Doc and linter fixes, some additional tests and error handling, updates to website. Changes: - Removes redundant reshape from image models by @mrry - Default TensorBoard to localhost by @danmane - Reformatting of tensorflow/core by @josh11b - Make tutorials backwards compatible to 0.5.0 by @girving - Improve print documentation (md files not updated). - Add proper scrolling to sitemap by @martinwicke Base CL: 107956254 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
4dffee7f62d81ec9173aba1b0ef6b96e47f8037c	12-Nov-2015	Vijay Vasudevan <vrv@google.com>	TensorFlow: Minor updates to docs, BUILD, GPU config / perf, etc. Changes: - Updates to op documentation and index by Josh - More changes to BUILD files for python 3 support by @girving - Fix to Eigen to use DenseIndex everywhere by @jiayq - Enable configuration for cuda compute capability by @zheng-xq, including updates to docs. - Route aggregation method through optimizer by schuster - Updates to install instructions for bazel 0.1.1. Base CL: 107702099 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
f2102f4e2c1c87f1d1bf9ab856a2849c54478760	12-Nov-2015	Vijay Vasudevan <vrv@google.com>	TensorFlow: upstream changes from the afternoon. Changes: - futurize --stage2 changes for Python 3 compatibility by @girving. - Small updates to documentation by @vrv, schuster and others - Account for failure of std::thread::hardware_concurrency by @ebrevdo. - More changes for backwards-compatibility tests by Josh - Updates to python op doc generation by Josh - Added support for using the best-fit allocator via ConfigProto by @vrv. - Rename LocalSession to DirectSession, since local was a bad name for it. - Enable tf.nn.moments() to work with tensors of unknown shape by @mrry. GITHUB_ISSUE: 139 - Changes for Android build by Andrew. Base CL: 107645181 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
f41959ccb2d9d4c722fe8fc3351401d53bcf4900	07-Nov-2015	Manjunath Kudlur <keveman@gmail.com>	TensorFlow: Initial commit of TensorFlow library. TensorFlow is an open source software library for numerical computation using data flow graphs. Base CL: 107276108 /external/tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc