Skip to content

Use benchmark_cls for checking precision.#6375

Merged
zpcore merged 1 commit intomasterfrom
ysiraichi/dont-load-for-precision
Jan 25, 2024
Merged

Use benchmark_cls for checking precision.#6375
zpcore merged 1 commit intomasterfrom
ysiraichi/dont-load-for-precision

Conversation

@ysiraichi
Copy link
Copy Markdown
Collaborator

@ysiraichi ysiraichi commented Jan 24, 2024

This PR makes it so we don't have to call load_benchmark only for checking the precision to be used.

cc @miladm @JackCaoG

Comment thread benchmarks/torchbench_model.py
@zpcore
Copy link
Copy Markdown
Member

zpcore commented Jan 24, 2024

Refer to the issue here for the context: #6286
Thanks for making the fix.

The key point I think is to prevent leaving behind a dangling object which e.g., moved a model to xla device. del benchmark doesn't resolve the issue because it has already claimed the PJRT runtime. This will trigger the stackdump error: RuntimeError: Bad StatusOr access: UNKNOWN: TPU initialization failed: open(/dev/vfio/0): Device or resource busy: Device or resource busy; Couldn't open iommu group /dev/vfio/0.

@zpcore zpcore requested a review from will-cromar January 24, 2024 21:29
@zpcore
Copy link
Copy Markdown
Member

zpcore commented Jan 24, 2024

Since we only need to detect the precision, we can fetch the information directly without invoking

benchmark_cls(
        test=self.benchmark_experiment.test,
        device=device,
        batch_size=self.benchmark_experiment.batch_size,
    )

I think we can call the following load_benchmark_precision instead of load_benchmark to get the precision directly.

  def load_benchmark_precision(self):
   try:
     module = importlib.import_module(
         f"torchbenchmark.models.{self.model_name}")
   except ModuleNotFoundError:
     module = importlib.import_module(
         f"torchbenchmark.models.fb.{self.model_name}")
   benchmark_train_precision = getattr(module.Model, "DEFAULT_TRAIN_CUDA_PRECISION", None)
   benchmark_eval_precision = getattr(module.Model, "DEFAULT_EVAL_CUDA_PRECISION", None)
   return benchmark_train_precision, benchmark_eval_precision

WDYT?

@ysiraichi
Copy link
Copy Markdown
Collaborator Author

Right. Correct if I'm misunderstanding things, but isn't that exactly what I'm doing here?

@zpcore
Copy link
Copy Markdown
Member

zpcore commented Jan 25, 2024

Right. Correct if I'm misunderstanding things, but isn't that exactly what I'm doing here?

Hah, you are right. I didn't notice that you called benchmark_cls instead.

Now it LGTM!

@zpcore zpcore self-requested a review January 25, 2024 17:41
@zpcore zpcore merged commit a1e51e4 into master Jan 25, 2024
@lezcano lezcano changed the title Use benchmark_cls for checking precision.` Use benchmark_cls for checking precision. Feb 5, 2024
bhavya01 pushed a commit that referenced this pull request Apr 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

benchmarks/torchbench_model: some benchmarks fail to load and kill experiment_runner's main process

2 participants