Avoid re-computing computation hashes by rpsilva-aws · Pull Request #8976 · pytorch/xla

rpsilva-aws · 2025-04-14T23:02:14Z

Currently, we are recomputing the hash of the underlying computation for every hash lookup, as a mere logging in two places. For small models where tracing is not negligible, this can have a small impact - particularly since we deserialize the protobuf deterministically (requiring the ordering of unordered dictionary/map entries). The logging was unchanged, but the underlying deserialization logic is relatively slower, in order to guarantee deterministic hashes for user computations. C++'s evaluates stream operators eagerly, so the impact is there with or without the logging levels.

This is only observed if the model is tracing bound. We recently saw ~5% throughput impact for small BERT models.

Note that this is only used to provide an unique hash string for which a hash key maps to. The actual hash of the protobuf is only meaningful for UserComputation computations, where it is factored in the hash key. In all other cases, it is unnecessary and serves as an unique (debug) identifier, and the user can still verify the mapping for any given graph hash key when enabling post_compilation_analysis.

We see this during hash lookup, which is evaluated every time. We also see it in Compile, though it is there only for the very first computation (across all instances). The user can still access the computation proto hash by enabling PT_XLA_DEBUG.

e.g. for BERT HF pretraining (20 steps) - 48 metrics with 27 samples each, the collective tracing of each hash computation metric is as follows:

- Average Rate: ~1.98 operations/second
- Most rates fall between 1.4-2.5 ops/second with a few outliers
- Highest Rate: 7.26772 ops/second (outlier)
- Lowest Rate: ~1.42 ops/second

- Typical p50 (median) latency per op: ~8-9 microseconds
- Typical p95 latency per op: ~450-500 microseconds
- Typical p99 latency per op: ~500-600 microseconds

tengyifei · 2025-04-15T01:11:25Z

Looks like you need to update some tests that check for these kind of logs.

rpsilva-aws force-pushed the rpsilva_avoid_hash_recom branch from 9f7b5d1 to 7de630d Compare April 14, 2025 23:04

rpsilva-aws requested review from jeffhataws and tengyifei April 14, 2025 23:19

tengyifei approved these changes Apr 15, 2025

View reviewed changes

rpsilva-aws force-pushed the rpsilva_avoid_hash_recom branch from 7de630d to 1b06c82 Compare April 15, 2025 01:12

rpsilva-aws marked this pull request as ready for review April 15, 2025 02:05

Avoid re-computing computation hashes

fd585f1

rpsilva-aws force-pushed the rpsilva_avoid_hash_recom branch from 1b06c82 to fd585f1 Compare April 15, 2025 05:23

jeffhataws reviewed Apr 15, 2025

View reviewed changes

Comment thread torch_xla/csrc/debug_util.cpp

jeffhataws approved these changes Apr 15, 2025

View reviewed changes

rpsilva-aws merged commit bcca5eb into pytorch:master Apr 15, 2025
45 of 46 checks passed

rpsilva-aws added a commit to rpsilva-aws/xla that referenced this pull request Apr 15, 2025

Avoid re-computing computation hashes (pytorch#8976)

58bba62

rpsilva-aws mentioned this pull request Apr 15, 2025

2.7 backport PR request list #8829

Closed

rpsilva-aws added a commit to rpsilva-aws/xla that referenced this pull request Apr 16, 2025

Avoid re-computing computation hashes (pytorch#8976)

77978df

zpcore pushed a commit that referenced this pull request Apr 16, 2025

Avoid re-computing computation hashes (#8976) (#8977)

65fc779

jeffhataws pushed a commit that referenced this pull request Apr 18, 2025

Avoid re-computing computation hashes (#8976)

a836283

jeffhataws mentioned this pull request Apr 24, 2025

[torch-xla 2.6] Training performance regression in torch-xla 2.6 for medium/small models #9037

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid re-computing computation hashes#8976

Avoid re-computing computation hashes#8976
rpsilva-aws merged 1 commit intopytorch:masterfrom
rpsilva-aws:rpsilva_avoid_hash_recom

rpsilva-aws commented Apr 14, 2025 •

edited

Loading

Uh oh!

tengyifei commented Apr 15, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rpsilva-aws commented Apr 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tengyifei commented Apr 15, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rpsilva-aws commented Apr 14, 2025 •

edited

Loading