Skip to content

Conversation

@snnn
Copy link
Contributor

@snnn snnn commented Jan 31, 2025

Description

The latest driver for Linux A10 machines is 535.
The latest driver for Windows A10 machines is 550.
We have some unused A100 quota. If we can replace the A10 machines with A100 machines, we might be able to upgrade our CUDA from 12.2 to 12.4.

Motivation and Context

@snnn snnn marked this pull request as ready for review February 19, 2025 21:03
@snnn snnn merged commit 4dff07d into main Feb 19, 2025
96 of 98 checks passed
@snnn snnn deleted the snnn/replace_pool branch February 19, 2025 21:04
guschmue pushed a commit that referenced this pull request Mar 6, 2025
### Description
The latest driver for Linux A10 machines is 535.
The latest driver for Windows A10 machines is 550.
We have some unused A100 quota. If we can replace the A10 machines with
A100 machines, we might be able to upgrade our CUDA from 12.2 to 12.4.
ashrit-ms pushed a commit that referenced this pull request Mar 17, 2025
### Description
The latest driver for Linux A10 machines is 535.
The latest driver for Windows A10 machines is 550.
We have some unused A100 quota. If we can replace the A10 machines with
A100 machines, we might be able to upgrade our CUDA from 12.2 to 12.4.
tianleiwu added a commit that referenced this pull request Jan 5, 2026
Update GQA benchmark to support bfloat16 and default to testing the
first configuration (fast mode).

Note that test_sparse_attention.py was removed in
#23547. It is referenced by
the benchmark script, so I add it back and disable the test in pipeline
mode.

Example output from H200 GPU:
```
prompt-sm90-Llama3-8B-b1-h32_8x128-float16:
   sequence_length  ORT-GQA-Dense  ORT-GQA-Dense-PackedQKV
0             16.0       0.781751                 0.571226
1             32.0       0.893813                 0.684198
2             64.0       1.434056                 1.589263
3            128.0       1.142192                 1.681969
4            256.0       1.503483                 2.225498
5            512.0       1.045732                 1.878660
6           1024.0       2.334924                 0.916745
7           2048.0       2.229924                 3.001290
8           4096.0       4.309678                 3.198855
9           8192.0       7.932211                 7.910411

token-sm90-Llama3-8B-b1-h32_8_d128-float16:
   past_sequence_length  ORT-GQA-Dense  ORT-GQA-Dense-PackedQKV
0                  16.0       1.751966                 0.780081
1                  32.0       1.302806                 0.043939
2                  64.0       2.301024                 2.207282
3                 128.0       2.294556                 3.010107
4                 256.0       2.931330                 1.781768
5                 512.0       1.210220                 2.799579
6                1024.0       2.767142                 2.660434
7                2048.0       1.420229                 0.091433
8                4096.0       0.860655                 0.801022
9                8191.0       0.749525                 0.820858

prompt-sm90-Llama3-8B-b1-h32_8x128-bfloat16:
   sequence_length  ORT-GQA-Dense  ORT-GQA-Dense-PackedQKV
0             16.0       1.085427                 0.666664
1             32.0       1.714795                 0.931262
2             64.0       1.729093                 1.438733
3            128.0       1.071263                 2.486135
4            256.0       1.957349                 1.342417
5            512.0       1.159680                 1.591321
6           1024.0       0.743702                 2.035150
7           2048.0       1.452736                 1.788801
8           4096.0       4.029917                 4.041565
9           8192.0       7.934485                 7.931600

token-sm90-Llama3-8B-b1-h32_8_d128-bfloat16:
   past_sequence_length  ORT-GQA-Dense  ORT-GQA-Dense-PackedQKV
0                  16.0       0.044354                 0.043983
1                  32.0       0.040715                 0.044061
2                  64.0       0.045586                 0.044071
3                 128.0       0.062204                 0.061418
4                 256.0       0.074764                 4.874854
5                 512.0       2.472094                 2.102259
6                1024.0       4.911269                 1.396149
7                2048.0       4.898032                 1.684034
8                4096.0       2.523432                 2.192279
9                8191.0       1.651366                 3.427370
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants