Skip to content

Memory issue (?) : failed to synchronize the stop event: CUDA_ERROR_LAUNCH_FAILED #15

Open
@ericj974

Description

@ericj974

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): 16.04.LTS
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): 1.3.0
Python version: 2.7.12
CUDA/cuDNN version: 8.0/6.0.21
GPU model and memory: Nvidia Tegra X2

Describe the problem

I'm trying to run an inference using resnet50 as a feature encoder (semantic segmentation with 2 classes). Depending on my memory load, I get the following error log sooner or later:

2017-11-10 05:10:43.484563: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: Invalid reduction dimension (-1146944963 for input with 4 dimension(s)
2017-11-10 05:10:44.646881: E tensorflow/stream_executor/cuda/cuda_driver.cc:1068] failed to synchronize the stop event: CUDA_ERROR_LAUNCH_FAILED
2017-11-10 05:10:44.646946: E tensorflow/stream_executor/cuda/cuda_timer.cc:54] Internal: error destroying CUDA event in context 0x30eb3d0: CUDA_ERROR_LAUNCH_FAILED
2017-11-10 05:10:44.646975: E tensorflow/stream_executor/cuda/cuda_timer.cc:59] Internal: error destroying CUDA event in context 0x30eb3d0: CUDA_ERROR_LAUNCH_FAILED
2017-11-10 05:10:44.647369: E tensorflow/stream_executor/cuda/cuda_blas.cc:551] failed to run cuBLAS routine cublasSgemm_v2: CUBLAS_STATUS_EXECUTION_FAILED
2017-11-10 05:10:44.647478: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: slice index 1000163558 of dimension 0 out of bounds.
2017-11-10 05:10:44.647529: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: slice index 1021428837 of dimension 0 out of bounds.
2017-11-10 05:10:44.647573: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: slice index 1004492442 of dimension 0 out of bounds.

This happens whether a swapfile is being used or not. When this happens, any other inference run is impossible, even with a network with a small footprint. I'm wondering whether there is a memory issue and if yes how to deal with this ?

For info, I happen to get a similar error log when using a TX1 (compiled and binary tensorflow were tried, same os / tf configuration as above)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions