Skip to content

Import matrix-multiply from a separate wasm module#56

Merged
abhi-agg merged 18 commits intobrowsermt:masterfrom
abhi-agg:import-wasm-gemm-dynamically
Oct 18, 2021
Merged

Import matrix-multiply from a separate wasm module#56
abhi-agg merged 18 commits intobrowsermt:masterfrom
abhi-agg:import-wasm-gemm-dynamically

Conversation

@abhi-agg
Copy link
Copy Markdown

@abhi-agg abhi-agg commented Sep 9, 2021

Description

This PR resolves all but last task of the issue browsermt/bergamot-translator#204

List of changes:
For generic matrix multiply interface:

  • Restructure marian-dev, allowing it to import different implementations of generic matrix multiply interface for wasm target
  • Add a fallback (a polyfill) dummy implementation of this interface in marian-dev for wasm target and export it from marian wasm module
    • The fallback dummy implementation can throw initially
  • Change emscripten generated JS glue code to plug in this fallback implementation (until an optimized implementation is not landed in Firefox)
  • Verify the restructuring by running wasm artifacts on wasm test page
  • Provide a valid (non-dummy) fallback implementations for entire interface

Please note that:

  1. We currently import only those functions that are required to support --int8shiftAlphaAll configuration
  2. The value 0.0 is being passed as zero point everywhere.
  3. The wasm target doesn't support 16-bit matrix multiplication (i.e. we are aborting for integer 16-bit matrix multiplication for wasm target).

Added dependencies: none

How to test

Follow the README instructions to run the wasm test page and observe the translated sentences in browser console logs. This confirms that the matrix-multiply implementation is imported from different wasm modules.

Checklist

  • I have tested the code manually
  • I have run regression tests
  • I have read and followed CONTRIBUTING.md
  • I have updated CHANGELOG.md

@abhi-agg abhi-agg requested review from XapaJIaMnu and kpu September 9, 2021 05:11
@abhi-agg
Copy link
Copy Markdown
Author

abhi-agg commented Sep 9, 2021

@kpu @XapaJIaMnu I have made this change for 1 function (int8PrepareA) of matrix multiply interface initially to have an early feedback from you. I can do similar changes for rest of the functions of this interface after your feedback.

For int8Prepare, I have made the following changes (based on my understanding):

  1. Both intgemm::Int8Shift::PrepareA and intgemm::Int8::PrepareA calls are routed to int8PrepareA for wasm target.
  2. The value 0.0 is being passed as zero point currently.

Please let me know if this needs to change and I will make the required changes 👍

@kpu
Copy link
Copy Markdown
Member

kpu commented Sep 17, 2021

This looks like a bunch of WASM binding code. I'm not sure how to review it.

One comment is the fallback or not should be atomic for the entirety of the API. Mixing a native prepareA with a fallback WASM 128 bit multiply won't work because they can have different opaque representations.

Perhaps use the existing CPU model dispatch mechanism in intgemm.cc for your polyfill? Just sense the added functions then treat it like a CPU model.

Also, the fallback should be easy to implement by calling the existing function rather than an abort.

@abhi-agg
Copy link
Copy Markdown
Author

abhi-agg commented Sep 20, 2021

This looks like a bunch of WASM binding code. I'm not sure how to review it.

That's fine. I guess you can ignore that part. I will add @yurydelendik and @eqrion to review that part of the code.

One comment is the fallback or not should be atomic for the entirety of the API. Mixing a native prepareA with a fallback WASM 128 bit multiply won't work because they can have different opaque representations.

I didn't follow you here. It is a bit unclear to me where I am mixing the native prepareA with a fallback WASM 128 bit multiply. Could you please point out? That would be helpful.

As I already mentioned, I made this change just for 1 function (int8PrepareA) of matrix multiply interface to have an early feedback from you. I can do similar changes for rest of the functions of this interface after your feedback.

So, either the native implementation will be called for the entire wasm matrix multiplication interface or the fallback implementation will be called.

Using the pre-processor directive, I separated the code for wasm target from non-wasm targets for matrix multiplication. The non-wasm targets are still unchanged (i.e. use intgemm directly) while for the wasm target, I am calling the new matrix multiply interface now. Specifically, this is the part of the code where I am looking for your feedback 👍

@kpu
Copy link
Copy Markdown
Member

kpu commented Sep 20, 2021

I think Marian is the wrong layer of abstraction for this. Why not just hide it behind the dispatcher in intgemm, treating the presence of JS intrinsics as a CPU model?

https://github.com/kpu/intgemm/blob/master/intgemm/intgemm.cc

That way you get a matrix multiply library that happens to run faster if there's a JS API. And MXNet will just happen to get a faster matrix multiply too.

PrepareA just quantizes so it's compatible across architectures anyway. But e.g. if you had a native PrepareB with CPUID dispatch that picked up 256-bit AVX2 and a WASM Multiply that does 128-bit you'll get wrong results. The function-level polyfill design you have here carries that risk whereas something that checks for all functions to decide what to use does not carry that risk. That said, I presume Firefox would implement all-or-nothing, so the risk is low.

@abhi-agg
Copy link
Copy Markdown
Author

abhi-agg commented Sep 20, 2021

I think Marian is the wrong layer of abstraction for this. Why not just hide it behind the dispatcher in intgemm, treating the presence of JS intrinsics as a CPU model?

https://github.com/kpu/intgemm/blob/master/intgemm/intgemm.cc

That way you get a matrix multiply library that happens to run faster if there's a JS API. And MXNet will just happen to get a faster matrix multiply too.

I am confused now. If I hide everything behind dispatcher in intgemm then that would mean I can just throw away the gemm interface that we recently designed. There are still some subtle differences b/w intgemm interface and the one that we recently designed (callbacks is one of them). With these changes in place, it is difficult to abstract at intgemm level.

Additionally, Marian can link and work with different gemm libraries (intgemm, fbgemm). Wouldn't marian be the right place to link and use another gemm library that runs efficiently on wasm target? This also keeps a possibility to use different backend implementations if one doesn't want to link to intgemm at all.

PrepareA just quantizes so it's compatible across architectures anyway. But e.g. if you had a native PrepareB with CPUID dispatch that picked up 256-bit AVX2 and a WASM Multiply that does 128-bit you'll get wrong results. The function-level polyfill design you have here carries that risk whereas something that checks for all functions to decide what to use does not carry that risk. That said, I presume Firefox would implement all-or-nothing, so the risk is low.

As you also mentioned, Firefox will implement all-or-nothing (as all of us also agreed in the beginning that implementing only a subset of the gemm functions wouldn't give us the performance gain that we want). In this scenario, we will not have the risk that you mentioned.

On a side note, I have updated this PR to refactor marian code for the rest of the functions of gemm interface as well.

@kpu
Copy link
Copy Markdown
Member

kpu commented Sep 20, 2021

If I hide everything behind dispatcher in intgemm then that would mean I can just throw away the gemm interface that we recently designed. There are still some subtle differences b/w intgemm interface and the one that we recently designed (callbacks is one of them). With these changes in place, it is difficult to abstract at intgemm level.

Yes, exactly. Now you understand why I was protesting an unnecessary redesign of the interface. If you want an intgemm interface you can have it. If you want a matrix multiply interface that's standards-ready for others to use, implement WebNN. The middle does not serve much purpose.

@yurydelendik
Copy link
Copy Markdown

Sorry, for chiming in late. I don't fully understand lots of concerns mentioned above, probably because different terminology is used.

I think Marian is the wrong layer of abstraction for this.

Maybe or maybe not. The ideal location for fallback implementation is in separate wasm module and in a completely new project, you have to start and maintain as dependency. Keeping fallback logic in the Marian for first iteration will also save you some time, you would spend on dealing of wasm memory instantiation when multiple wasm modules are used.

Now you understand why I was protesting an unnecessary redesign of the interface.

I don't think we are redesigning an interface. We are building the interface suitable for wasm intrinsics. You will continue to use existing interface outside the wasm target.

@abhi-agg
Copy link
Copy Markdown
Author

If I hide everything behind dispatcher in intgemm then that would mean I can just throw away the gemm interface that we recently designed. There are still some subtle differences b/w intgemm interface and the one that we recently designed (callbacks is one of them). With these changes in place, it is difficult to abstract at intgemm level.

Yes, exactly. Now you understand why I was protesting an unnecessary redesign of the interface. If you want an intgemm interface you can have it. If you want a matrix multiply interface that's standards-ready for others to use, implement WebNN. The middle does not serve much purpose.

As I have previously stated few times, the current API is not a redesign of intgemm but an interface that is suitable for intrinsics. This mainly involved changing the intgemm's callbacks/function-pointers style to using arguments. We had to do it because there is no good way now to express function pointer (@yurydelendik can confirm this).

@abhi-agg abhi-agg changed the title Import matrix-multiply (PrepareA) from a separate wasm module Import matrix-multiply from a separate wasm module Sep 21, 2021
@kpu
Copy link
Copy Markdown
Member

kpu commented Sep 21, 2021

Additionally, Marian can link and work with different gemm libraries (intgemm, fbgemm). Wouldn't marian be the right place to link and use another gemm library that runs efficiently on wasm target? This also keeps a possibility to use different backend implementations if one doesn't want to link to intgemm at all.

There are currently two layers of dispatch: Marian and CPUID in intgemm.

In Marian, you manually configure which fbgemm or fbgemm library you want the Options object and they have peculiar behavior like whether they handle the output matrix. There is currently no automatic dispatch mechanism that switches between the two.

In intgemm it favors CPUID dispatch but one can call the routines directly with the class for that CPUID. The templates would admittedly try to force the existence of functions you haven't implemented (and don't need for our particular case).

This appears to be inserting a third layer in between with dispatch controlled by a compiler predefine. Is there any means of sensing browser support or will it always be two different builds? If yes, is WASM the correct define for this switch?

To give a different example, with the current wormhole instruction, I could (but have not gotten around to) run the wormhole instruction, inspect the output, and then use that as a CPUID of sorts to dispatch different intgemm implementations. Is there an analogous way to sense the presence of an optimized implementation here or is it always two different builds?

@andrenatal
Copy link
Copy Markdown

Hi @kpu, we think we should have a meeting to clarify these issues. I'm sending your an email to have it scheduled. Thanks

@abhi-agg
Copy link
Copy Markdown
Author

abhi-agg commented Sep 27, 2021

@kpu @XapaJIaMnu This PR is up for review. I made all the changes as per our meeting and updated the description of this PR to summarize those changes 👍

@abhi-agg
Copy link
Copy Markdown
Author

abhi-agg commented Oct 6, 2021

@kpu @XapaJIaMnu Could you folks please review the PR?

Comment thread src/tensors/cpu/intgemm_interface.h Outdated
Comment thread src/tensors/cpu/intgemm_interface.h
Comment thread src/tensors/cpu/intgemm_interface.h Outdated
Copy link
Copy Markdown
Member

@kpu kpu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I think the principle is there, good job.

There are places where code was copied and it looks like a narrower ifdef would have avoided copying code.

@kpu
Copy link
Copy Markdown
Member

kpu commented Oct 11, 2021

@abhi-agg We good? I think the changes should be quick.

@abhi-agg
Copy link
Copy Markdown
Author

There are places where code was copied and it looks like a narrower ifdef would have avoided copying code.

@kpu Actually, I used narrower ifdef in the begining but it caused compilation failure on Windows platform. So I had to do it this way.

@abhi-agg abhi-agg requested a review from kpu October 18, 2021 08:44
…ically

  - The fallback implementations of these functions are called
    successfully
 - Abort for all other configurations
   - Int8::PrepareA and Int8::Multiply are not supported
   - gemm operation for 16-bit operands are not supported
 - The flags are only link time flags and not compile time flags
@kpu
Copy link
Copy Markdown
Member

kpu commented Oct 18, 2021

Can you be more specific about the problem with windows? Something to do wit unused typedef?

  #if defined(WASM)
    return {NodeOp(
      quantMult_ = *child(1)->val()->data();
    // snip
 #else
    typedef typename intgemm_<vtype>::type Integer;
    return {NodeOp(
      quantMult_ = *child(1)->val()->data();
      typedef typename intgemm_<vtype>::type Integer;

@abhi-agg
Copy link
Copy Markdown
Author

abhi-agg commented Oct 18, 2021

Can you be more specific about the problem with windows? Something to do wit unused typedef?

The error is not related to unused typedef. It is following:
error C2121: '#': invalid character: possibly the result of a macro expansion

I found the explanation here: https://docs.microsoft.com/en-us/cpp/error-messages/compiler-errors-1/compiler-error-c2121?view=msvc-160

So, it is with using a preprocessor inside the NodeOp macro expansion. I believe that is why even the COMPILE_CPU preprocessor has a corresponding else part which is empty.

I had found https://stackoverflow.com/questions/28187619/using-preprocessor-conditionals-in-macro-functions-with-double-parentheses but this solution doesn't really solve the problem of code duplication.

If you have some suggestion then I am all ears 👍

@kpu
Copy link
Copy Markdown
Member

kpu commented Oct 18, 2021

NodeOp is a very simple macro.

#define NodeOp(op) [=]() { op; }

Just expand it then use ifdefs inside.
https://github.com/marian-nmt/marian-dev/blob/c29cc83dc4d234f0e3a00a46a729053132b408b8/src/graph/chainable.h#L11

(And don't feel bad about abstraction, there are already places in the code doing that)

@abhi-agg abhi-agg force-pushed the import-wasm-gemm-dynamically branch from b51654a to 0b72e37 Compare October 18, 2021 15:31
@abhi-agg
Copy link
Copy Markdown
Author

@kpu I have pushed the new commits to narrow the ifdef scopes 👍

@abhi-agg abhi-agg merged commit a1a82ff into browsermt:master Oct 18, 2021
@abhi-agg abhi-agg deleted the import-wasm-gemm-dynamically branch November 1, 2021 10:52
jerinphilip added a commit to jerinphilip/marian that referenced this pull request Mar 3, 2022
jerinphilip added a commit to jerinphilip/marian that referenced this pull request Mar 9, 2022
Provides an arm backend for matrix multiplies using google/ruy and math
functions through simde (https://simd-everywhere.github.io/blog/about/)
effectively getting marian-decoder to run on ARM.

The following cmake flags are added:

- USE_INTGEMM (to switch intgemm on/off)
- USE_RUY (to switch ruy on/off)
- USE_ONNX_SGEMM (use onnx sgemm added by wasm to provide attention
  matrix multiply which is currently reliant on a BLAS library).
- USE_SIMDE (swaps out existing intel based functions by using SIMDE
  instead).

The built marian-decoder is tested on an Oracle Cloud ARM Machine with
the following specs:

  Architecture   : aarch64
  CPU op-mode(s) : 32-bit, 64-bit
  Byte Order     : Little Endian
  Vendor ID      : ARM
  Model name     : Neoverse-N1
  Flags          : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp
                   asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs

A CI check on GitHub actions is added to use android-ndk cross-compile
targetting arm64-v8a. The built binary is tested to work on an Android
Phone using termux (Samsung M30s).

Successful android build additionally requires a patch (sentencepiece ->
protobuf). See opencv/opencv#17282 and
opencv/opencv#19049.

-Werror etc causes issues with ruy (-Wmulti-line-comment) and are
disabled.

The following minor changes are also applied:

 - Remove M32_BINARIES use COMPILE_WASM for -m32
 - Hide msse4.1 if unknown platform
 - faiss was previously hardcoded for platforms with SSE available. This
   has been mitigated by adding a refernce standard cpp implementation of
   the missing function.
 - Exclude packed_gemm_....cpp from sources if USE_FBGEMM=off
 - MSVC workaround following
   browsermt#56 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants