I am attempting to use a non-causal attention mask for a llama model with Unsloth and looking over the code to find the best way to achieve all the other speedups but use a non-causal attention mask.
I would appreciate some advice on places to look for code changes, prior to stepping through with a debugger.
I see there that the Attention forward still seems to support the attention mask:
https://github.com/unslothai/unsloth/blob/main/unsloth/models/llama.py#L375
But it may be ignored in the forward fn:
https://github.com/unslothai/unsloth/blob/main/unsloth/models/llama.py#L589
I am attempting to use a non-causal attention mask for a llama model with Unsloth and looking over the code to find the best way to achieve all the other speedups but use a non-causal attention mask.
I would appreciate some advice on places to look for code changes, prior to stepping through with a debugger.
I see there that the Attention forward still seems to support the attention mask:
https://github.com/unslothai/unsloth/blob/main/unsloth/models/llama.py#L375
But it may be ignored in the
forwardfn:https://github.com/unslothai/unsloth/blob/main/unsloth/models/llama.py#L589