Skip to content

Conversation

@andrewchernyh
Copy link
Contributor

Fix issue described in #2300

Only updating MAX_OUT_TOKES doesn't helps and main issue is in temp_buf that was allocated just after output tensor

Then in attention_unfused
T* workspace = (T*)output + bsz * seq_len * heads * k; where output is temp_buf from ds_softmax_context
In the result temp_buf overlaps query_cont allocated in ds_softmax_context

I've put temp_buf just after kv_cache and it helps

Minimal steps to reproduce:

python3 benchmarks/inference/gpt-bench.py -m EleutherAI/gpt-neo-125M --kernel-inject --deepspeed --dtype=fp32 --max-tokens=1020 --trials 1

@ghost
Copy link

ghost commented Sep 22, 2022

CLA assistant check
All CLA requirements met.

@andrewchernyh
Copy link
Contributor Author

Not actual more, see #2300 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant