Open
Description
I have a recurring issue when I use any model for inference, where it hangs on multiple occasions. On a few occasions, it completes running the inference. This is the warning I get.
[rank1]:[W511 10:03:36.323250251 ProcessGroupNCCL.cpp:4715] [PG ID 0 PG GUID 0 Rank 1] using GPU 1 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device.
[rank0]:[W511 10:03:36.453588556 ProcessGroupNCCL.cpp:4715] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device.
Could you advise a quick fix?
Metadata
Metadata
Assignees
Labels
No labels