Import matrix-multiply from a separate wasm module#56
Import matrix-multiply from a separate wasm module#56abhi-agg merged 18 commits intobrowsermt:masterfrom
Conversation
|
@kpu @XapaJIaMnu I have made this change for 1 function (int8PrepareA) of matrix multiply interface initially to have an early feedback from you. I can do similar changes for rest of the functions of this interface after your feedback. For int8Prepare, I have made the following changes (based on my understanding):
Please let me know if this needs to change and I will make the required changes 👍 |
|
This looks like a bunch of WASM binding code. I'm not sure how to review it. One comment is the fallback or not should be atomic for the entirety of the API. Mixing a native prepareA with a fallback WASM 128 bit multiply won't work because they can have different opaque representations. Perhaps use the existing CPU model dispatch mechanism in Also, the fallback should be easy to implement by calling the existing function rather than an abort. |
That's fine. I guess you can ignore that part. I will add @yurydelendik and @eqrion to review that part of the code.
I didn't follow you here. It is a bit unclear to me where I am mixing the native prepareA with a fallback WASM 128 bit multiply. Could you please point out? That would be helpful. As I already mentioned, I made this change just for 1 function (int8PrepareA) of matrix multiply interface to have an early feedback from you. I can do similar changes for rest of the functions of this interface after your feedback. So, either the native implementation will be called for the entire wasm matrix multiplication interface or the fallback implementation will be called. Using the pre-processor directive, I separated the code for wasm target from non-wasm targets for matrix multiplication. The non-wasm targets are still unchanged (i.e. use intgemm directly) while for the wasm target, I am calling the new matrix multiply interface now. Specifically, this is the part of the code where I am looking for your feedback 👍 |
|
I think Marian is the wrong layer of abstraction for this. Why not just hide it behind the dispatcher in intgemm, treating the presence of JS intrinsics as a CPU model? https://github.com/kpu/intgemm/blob/master/intgemm/intgemm.cc That way you get a matrix multiply library that happens to run faster if there's a JS API. And MXNet will just happen to get a faster matrix multiply too. PrepareA just quantizes so it's compatible across architectures anyway. But e.g. if you had a native PrepareB with CPUID dispatch that picked up 256-bit AVX2 and a WASM Multiply that does 128-bit you'll get wrong results. The function-level polyfill design you have here carries that risk whereas something that checks for all functions to decide what to use does not carry that risk. That said, I presume Firefox would implement all-or-nothing, so the risk is low. |
I am confused now. If I hide everything behind dispatcher in intgemm then that would mean I can just throw away the gemm interface that we recently designed. There are still some subtle differences b/w intgemm interface and the one that we recently designed (callbacks is one of them). With these changes in place, it is difficult to abstract at intgemm level. Additionally, Marian can link and work with different gemm libraries (intgemm, fbgemm). Wouldn't marian be the right place to link and use another gemm library that runs efficiently on wasm target? This also keeps a possibility to use different backend implementations if one doesn't want to link to intgemm at all.
As you also mentioned, Firefox will implement all-or-nothing (as all of us also agreed in the beginning that implementing only a subset of the gemm functions wouldn't give us the performance gain that we want). In this scenario, we will not have the risk that you mentioned. On a side note, I have updated this PR to refactor marian code for the rest of the functions of gemm interface as well. |
Yes, exactly. Now you understand why I was protesting an unnecessary redesign of the interface. If you want an intgemm interface you can have it. If you want a matrix multiply interface that's standards-ready for others to use, implement WebNN. The middle does not serve much purpose. |
|
Sorry, for chiming in late. I don't fully understand lots of concerns mentioned above, probably because different terminology is used.
Maybe or maybe not. The ideal location for fallback implementation is in separate wasm module and in a completely new project, you have to start and maintain as dependency. Keeping fallback logic in the Marian for first iteration will also save you some time, you would spend on dealing of wasm memory instantiation when multiple wasm modules are used.
I don't think we are redesigning an interface. We are building the interface suitable for wasm intrinsics. You will continue to use existing interface outside the wasm target. |
As I have previously stated few times, the current API is not a redesign of intgemm but an interface that is suitable for intrinsics. This mainly involved changing the intgemm's callbacks/function-pointers style to using arguments. We had to do it because there is no good way now to express function pointer (@yurydelendik can confirm this). |
There are currently two layers of dispatch: Marian and CPUID in intgemm. In Marian, you manually configure which fbgemm or fbgemm library you want the Options object and they have peculiar behavior like whether they handle the output matrix. There is currently no automatic dispatch mechanism that switches between the two. In intgemm it favors CPUID dispatch but one can call the routines directly with the class for that CPUID. The templates would admittedly try to force the existence of functions you haven't implemented (and don't need for our particular case). This appears to be inserting a third layer in between with dispatch controlled by a compiler predefine. Is there any means of sensing browser support or will it always be two different builds? If yes, is To give a different example, with the current wormhole instruction, I could (but have not gotten around to) run the wormhole instruction, inspect the output, and then use that as a CPUID of sorts to dispatch different intgemm implementations. Is there an analogous way to sense the presence of an optimized implementation here or is it always two different builds? |
|
Hi @kpu, we think we should have a meeting to clarify these issues. I'm sending your an email to have it scheduled. Thanks |
|
@kpu @XapaJIaMnu This PR is up for review. I made all the changes as per our meeting and updated the description of this PR to summarize those changes 👍 |
|
@kpu @XapaJIaMnu Could you folks please review the PR? |
kpu
left a comment
There was a problem hiding this comment.
Overall I think the principle is there, good job.
There are places where code was copied and it looks like a narrower ifdef would have avoided copying code.
|
@abhi-agg We good? I think the changes should be quick. |
@kpu Actually, I used narrower ifdef in the begining but it caused compilation failure on Windows platform. So I had to do it this way. |
- The fallback implementation of PrepareA interface is called successfully
…ically
- The fallback implementations of these functions are called
successfully
- Abort for all other configurations - Int8::PrepareA and Int8::Multiply are not supported - gemm operation for 16-bit operands are not supported
- The flags are only link time flags and not compile time flags
|
Can you be more specific about the problem with windows? Something to do wit unused typedef? |
The error is not related to unused typedef. It is following: I found the explanation here: https://docs.microsoft.com/en-us/cpp/error-messages/compiler-errors-1/compiler-error-c2121?view=msvc-160 So, it is with using a preprocessor inside the I had found https://stackoverflow.com/questions/28187619/using-preprocessor-conditionals-in-macro-functions-with-double-parentheses but this solution doesn't really solve the problem of code duplication. If you have some suggestion then I am all ears 👍 |
|
NodeOp is a very simple macro. Just expand it then use ifdefs inside. (And don't feel bad about abstraction, there are already places in the code doing that) |
b51654a to
0b72e37
Compare
|
@kpu I have pushed the new commits to narrow the ifdef scopes 👍 |
Provides an arm backend for matrix multiplies using google/ruy and math functions through simde (https://simd-everywhere.github.io/blog/about/) effectively getting marian-decoder to run on ARM. The following cmake flags are added: - USE_INTGEMM (to switch intgemm on/off) - USE_RUY (to switch ruy on/off) - USE_ONNX_SGEMM (use onnx sgemm added by wasm to provide attention matrix multiply which is currently reliant on a BLAS library). - USE_SIMDE (swaps out existing intel based functions by using SIMDE instead). The built marian-decoder is tested on an Oracle Cloud ARM Machine with the following specs: Architecture : aarch64 CPU op-mode(s) : 32-bit, 64-bit Byte Order : Little Endian Vendor ID : ARM Model name : Neoverse-N1 Flags : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs A CI check on GitHub actions is added to use android-ndk cross-compile targetting arm64-v8a. The built binary is tested to work on an Android Phone using termux (Samsung M30s). Successful android build additionally requires a patch (sentencepiece -> protobuf). See opencv/opencv#17282 and opencv/opencv#19049. -Werror etc causes issues with ruy (-Wmulti-line-comment) and are disabled. The following minor changes are also applied: - Remove M32_BINARIES use COMPILE_WASM for -m32 - Hide msse4.1 if unknown platform - faiss was previously hardcoded for platforms with SSE available. This has been mitigated by adding a refernce standard cpp implementation of the missing function. - Exclude packed_gemm_....cpp from sources if USE_FBGEMM=off - MSVC workaround following browsermt#56 (comment)
Description
This PR resolves all but last task of the issue browsermt/bergamot-translator#204
List of changes:
For generic matrix multiply interface:
Please note that:
--int8shiftAlphaAllconfiguration0.0is being passed aszero pointeverywhere.Added dependencies: none
How to test
Follow the README instructions to run the wasm test page and observe the translated sentences in browser console logs. This confirms that the matrix-multiply implementation is imported from different wasm modules.
Checklist