Multi-context workloads and mem_trace failing

Hello, 
We found an issue with the mem_trace tool and multi-context workloads. I'm running the latest 1.7.2 release, CUDA 12.4, the mem_trace tool included with NVBit, and this sample workload ([https://github.com/cesar-avalos3/simple_multi_gpu](https://github.com/cesar-avalos3/simple_multi_gpu)). When running the workload with mem_trace.so, we get the following assert fail:
`"ASSERT FAIL: nvbit_imp.cpp:582:void Nvbit::create_ctx(CUcontext): FAIL !(tmp_dir != nullptr) MSG: temporary directory cannot be created, please make sure /tmp is writable!"`
If we try the 1.5.5 release of NVBit and mem_trace, this works perfectly fine.
We tried getting around the error by overloading (via LD_PRELOAD) the offending mkdtemp, which resulted in a deadlock. No-oping nvbit_at_ctx_term allowed us to finish tracing "successfully", with the side-effect of the second context being invisible to the tracer.
We saw this behaviour in our servers (V100s) and Lambda-labs (A100) ones as well.
(Probably related to #133)
Thanks!



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-context workloads and mem_trace failing #137

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Multi-context workloads and mem_trace failing #137

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions