RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

2024-11-22 来源：个人技术集锦

单gpu时一切正常，但是使用 torch.nn.DataParallel时出现了下述错误,

RuntimeError: CUDA error: misaligned address
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

进一步定位错误：

设置这个环境变量可以使CUDA内核同步执行，有助于更准确地定位错误发生的位置

 os.environ["CUDA_LAUNCH_BLOCKING"] = "1"

编译时启用设备端断言可以帮助捕获内核中的错误。
#os.environ["TORCH_USE_CUDA_DSA"] = "1"

再次运行：

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

cudnn出现问题，

解决办法：禁用cudnn

torch.backends.cudnn.enabled = False

解决。

探究：应该是cudnn与cuda版本不匹配导致的。具体原因待分析。

显示全文

全部栏目

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED