Conversation
46effa1 to
bfed2c4
Compare
89f3ea2 to
02e2ab6
Compare
7b3c313 to
1296b39
Compare
| } | ||
|
|
||
| maskTensor, err := ctx.Input().FromFloatSlice(mask, length, batchSize) | ||
| maskTensor, err := ctx.Input().FromFloatSlice(mask, batchSize, length) |
There was a problem hiding this comment.
I think there is an issue because we are swapping the order of dimensions here but the actual mask data is still laid out in the original order:
https://github.com/dhiltgen/ollama/blob/1296b3999ec5d4c15f32f5ac8311da94cdb808c4/kvcache/causal.go#L243
This works for GGML because the mask is in its native format and we just swap the arguments of FromFloatSlice back. However, it's probably at least part of the cache drift issue in MLX since the mask is not actually row order.
Most of the other inputs (which are the ones that aren't in the backend's native format) are only a single dimension, so the order doesn't make a difference. However, the mask is 2D.
I think we should change the mask generation to be row order native and in GGML do a permute in FromFloatSlice and FromIntSlice for multidimensional tensors. For the mask specifically, we may not actually need a contiguous, which would make it fast, though that is probably not generically true for all inputs.
7fe73e1 to
909b23a
Compare
|
Moving back to draft status. Matmul has replaced Mulmat, and now conforms to the behavior of pytorch matmul, however the current implementation has a significant performance hit. Once I can get it back to comparable performance, I'll take it back out of draft. |
af75cd7 to
e809a68
Compare
This change switches the model API (and backend) to be row-order to make it easier to port model definitions from other frameworks that use row-order patterns.
|
@dhiltgen Could other Ollama engineers help? 🙂 |
|
Considers how this is blocking the MLX code, please move this forward soon |
|
Obsoleted by the new MLX based engine. |
|
@dhiltgen where is this MLX-based engine? |
Replaces #8731 on main.
This change switches the model API (and backend) to be row-order to make it easier to port model definitions from other frameworks that use row-order patterns. I've made the following changes to the Backend API interface definitions:
This requires a number of notable changes in the GGML backend:
1as no-ops, so in order to retain the correct number of dimensions if the leading dimension has a shape of 1 (thus reversed to be the trailing dimension), this tracking is used instead of the underlying GGML reported number of dimensions.-1as a dimension consistent with other APIs, where the value will be calculated and filled in automatically.Other potential refinements that aren't currently included but which may make sense:
The cache also required some adjustments based on these changes.