-
Notifications
You must be signed in to change notification settings - Fork 28
Description
Hello,
We found an issue with the mem_trace tool and multi-context workloads. I'm running the latest 1.7.2 release, CUDA 12.4, the mem_trace tool included with NVBit, and this sample workload (https://github.com/cesar-avalos3/simple_multi_gpu). When running the workload with mem_trace.so, we get the following assert fail:
"ASSERT FAIL: nvbit_imp.cpp:582:void Nvbit::create_ctx(CUcontext): FAIL !(tmp_dir != nullptr) MSG: temporary directory cannot be created, please make sure /tmp is writable!"
If we try the 1.5.5 release of NVBit and mem_trace, this works perfectly fine.
We tried getting around the error by overloading (via LD_PRELOAD) the offending mkdtemp, which resulted in a deadlock. No-oping nvbit_at_ctx_term allowed us to finish tracing "successfully", with the side-effect of the second context being invisible to the tracer.
We saw this behaviour in our servers (V100s) and Lambda-labs (A100) ones as well.
(Probably related to #133)
Thanks!