When I started to train some neural network, it met the CUDA_ERROR_OUT_OF_MEMORY
but the training could go on without error. Because I wanted to use gpu memory as it really needs, so I set the gpu_options.allow_growth = True
.The logs are as follows:
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcublas.so locally I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcudnn.so locally I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcufft.so locally I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcuda.so.1 locally I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcurand.so locally I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero I tensorflow/core/common_runtime/gpu/gpu_device.cc:951] Found device 0 with properties: name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate (GHz) 1.7335 pciBusID 0000:01:00.0 Total memory: 7.92GiB Free memory: 7.81GiB I tensorflow/core/common_runtime/gpu/gpu_device.cc:972] DMA: 0 I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] 0: Y I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Creating TensorFlow device (/gpu:0) -> (device:0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0) E tensorflow/stream_executor/cuda/cuda_driver.cc:965] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY Iter 20, Minibatch Loss= 40491.636719 ...
And after using nvidia-smi
command, it gets:
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 367.27 Driver Version: 367.27 |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |===============================+======================+======================| | 0 GeForce GTX 1080 Off | 0000:01:00.0 Off | N/A | | 40% 61C P2 46W / 180W | 8107MiB / 8111MiB | 96% Default | +-------------------------------+----------------------+----------------------+ | 1 GeForce GTX 1080 Off | 0000:02:00.0 Off | N/A | | 0% 40C P0 40W / 180W | 0MiB / 8113MiB | 0% Default | +-------------------------------+----------------------+----------------------+ │ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 22932 C python 8105MiB | +-----------------------------------------------------------------------------+
After I commented the gpu_options.allow_growth = True
, I trained the net again and everything was normal. There was no the problem of CUDA_ERROR_OUT_OF_MEMORY
. Finally, ran the nvidia-smi
command, it gets:
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 367.27 Driver Version: 367.27 |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |===============================+======================+======================| | 0 GeForce GTX 1080 Off | 0000:01:00.0 Off | N/A | | 40% 61C P2 46W / 180W | 7793MiB / 8111MiB | 99% Default | +-------------------------------+----------------------+----------------------+ | 1 GeForce GTX 1080 Off | 0000:02:00.0 Off | N/A | | 0% 40C P0 40W / 180W | 0MiB / 8113MiB | 0% Default | +-------------------------------+----------------------+----------------------+ │ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 22932 C python 7791MiB | +-----------------------------------------------------------------------------+
I have two questions about it. Why did the CUDA_OUT_OF_MEMORY
come out and the procedure went on normally? why did the memory usage become smaller after commenting allow_growth = True
.