Import matrix-multiply from a separate wasm module by abhi-agg · Pull Request #56 · browsermt/marian-dev

abhi-agg · 2021-09-09T05:10:20Z

Description

This PR resolves all but last task of the issue browsermt/bergamot-translator#204

List of changes:
For generic matrix multiply interface:

Restructure marian-dev, allowing it to import different implementations of generic matrix multiply interface for wasm target
Add a fallback (a polyfill) dummy implementation of this interface in marian-dev for wasm target and export it from marian wasm module
- The fallback dummy implementation can throw initially
Change emscripten generated JS glue code to plug in this fallback implementation (until an optimized implementation is not landed in Firefox)
Verify the restructuring by running wasm artifacts on wasm test page
Provide a valid (non-dummy) fallback implementations for entire interface

Please note that:

We currently import only those functions that are required to support --int8shiftAlphaAll configuration
The value 0.0 is being passed as zero point everywhere.
The wasm target doesn't support 16-bit matrix multiplication (i.e. we are aborting for integer 16-bit matrix multiplication for wasm target).

Added dependencies: none

How to test

Follow the README instructions to run the wasm test page and observe the translated sentences in browser console logs. This confirms that the matrix-multiply implementation is imported from different wasm modules.

Checklist

I have tested the code manually
I have run regression tests
I have read and followed CONTRIBUTING.md
I have updated CHANGELOG.md

abhi-agg · 2021-09-09T05:41:17Z

@kpu @XapaJIaMnu I have made this change for 1 function (int8PrepareA) of matrix multiply interface initially to have an early feedback from you. I can do similar changes for rest of the functions of this interface after your feedback.

For int8Prepare, I have made the following changes (based on my understanding):

Both intgemm::Int8Shift::PrepareA and intgemm::Int8::PrepareA calls are routed to int8PrepareA for wasm target.
The value 0.0 is being passed as zero point currently.

Please let me know if this needs to change and I will make the required changes 👍

kpu · 2021-09-17T13:01:09Z

This looks like a bunch of WASM binding code. I'm not sure how to review it.

One comment is the fallback or not should be atomic for the entirety of the API. Mixing a native prepareA with a fallback WASM 128 bit multiply won't work because they can have different opaque representations.

Perhaps use the existing CPU model dispatch mechanism in intgemm.cc for your polyfill? Just sense the added functions then treat it like a CPU model.

Also, the fallback should be easy to implement by calling the existing function rather than an abort.

abhi-agg · 2021-09-20T06:44:21Z

This looks like a bunch of WASM binding code. I'm not sure how to review it.

That's fine. I guess you can ignore that part. I will add @yurydelendik and @eqrion to review that part of the code.

One comment is the fallback or not should be atomic for the entirety of the API. Mixing a native prepareA with a fallback WASM 128 bit multiply won't work because they can have different opaque representations.

I didn't follow you here. It is a bit unclear to me where I am mixing the native prepareA with a fallback WASM 128 bit multiply. Could you please point out? That would be helpful.

As I already mentioned, I made this change just for 1 function (int8PrepareA) of matrix multiply interface to have an early feedback from you. I can do similar changes for rest of the functions of this interface after your feedback.

So, either the native implementation will be called for the entire wasm matrix multiplication interface or the fallback implementation will be called.

Using the pre-processor directive, I separated the code for wasm target from non-wasm targets for matrix multiplication. The non-wasm targets are still unchanged (i.e. use intgemm directly) while for the wasm target, I am calling the new matrix multiply interface now. Specifically, this is the part of the code where I am looking for your feedback 👍

kpu · 2021-09-20T09:37:57Z

I think Marian is the wrong layer of abstraction for this. Why not just hide it behind the dispatcher in intgemm, treating the presence of JS intrinsics as a CPU model?

https://github.com/kpu/intgemm/blob/master/intgemm/intgemm.cc

That way you get a matrix multiply library that happens to run faster if there's a JS API. And MXNet will just happen to get a faster matrix multiply too.

PrepareA just quantizes so it's compatible across architectures anyway. But e.g. if you had a native PrepareB with CPUID dispatch that picked up 256-bit AVX2 and a WASM Multiply that does 128-bit you'll get wrong results. The function-level polyfill design you have here carries that risk whereas something that checks for all functions to decide what to use does not carry that risk. That said, I presume Firefox would implement all-or-nothing, so the risk is low.

abhi-agg · 2021-09-20T15:53:08Z

I think Marian is the wrong layer of abstraction for this. Why not just hide it behind the dispatcher in intgemm, treating the presence of JS intrinsics as a CPU model?

https://github.com/kpu/intgemm/blob/master/intgemm/intgemm.cc

That way you get a matrix multiply library that happens to run faster if there's a JS API. And MXNet will just happen to get a faster matrix multiply too.

I am confused now. If I hide everything behind dispatcher in intgemm then that would mean I can just throw away the gemm interface that we recently designed. There are still some subtle differences b/w intgemm interface and the one that we recently designed (callbacks is one of them). With these changes in place, it is difficult to abstract at intgemm level.

Additionally, Marian can link and work with different gemm libraries (intgemm, fbgemm). Wouldn't marian be the right place to link and use another gemm library that runs efficiently on wasm target? This also keeps a possibility to use different backend implementations if one doesn't want to link to intgemm at all.

PrepareA just quantizes so it's compatible across architectures anyway. But e.g. if you had a native PrepareB with CPUID dispatch that picked up 256-bit AVX2 and a WASM Multiply that does 128-bit you'll get wrong results. The function-level polyfill design you have here carries that risk whereas something that checks for all functions to decide what to use does not carry that risk. That said, I presume Firefox would implement all-or-nothing, so the risk is low.

As you also mentioned, Firefox will implement all-or-nothing (as all of us also agreed in the beginning that implementing only a subset of the gemm functions wouldn't give us the performance gain that we want). In this scenario, we will not have the risk that you mentioned.

On a side note, I have updated this PR to refactor marian code for the rest of the functions of gemm interface as well.

kpu · 2021-09-20T16:00:12Z

If I hide everything behind dispatcher in intgemm then that would mean I can just throw away the gemm interface that we recently designed. There are still some subtle differences b/w intgemm interface and the one that we recently designed (callbacks is one of them). With these changes in place, it is difficult to abstract at intgemm level.

Yes, exactly. Now you understand why I was protesting an unnecessary redesign of the interface. If you want an intgemm interface you can have it. If you want a matrix multiply interface that's standards-ready for others to use, implement WebNN. The middle does not serve much purpose.

yurydelendik · 2021-09-20T16:27:54Z

Sorry, for chiming in late. I don't fully understand lots of concerns mentioned above, probably because different terminology is used.

I think Marian is the wrong layer of abstraction for this.

Maybe or maybe not. The ideal location for fallback implementation is in separate wasm module and in a completely new project, you have to start and maintain as dependency. Keeping fallback logic in the Marian for first iteration will also save you some time, you would spend on dealing of wasm memory instantiation when multiple wasm modules are used.

Now you understand why I was protesting an unnecessary redesign of the interface.

I don't think we are redesigning an interface. We are building the interface suitable for wasm intrinsics. You will continue to use existing interface outside the wasm target.

abhi-agg · 2021-09-20T16:29:03Z

If I hide everything behind dispatcher in intgemm then that would mean I can just throw away the gemm interface that we recently designed. There are still some subtle differences b/w intgemm interface and the one that we recently designed (callbacks is one of them). With these changes in place, it is difficult to abstract at intgemm level.

Yes, exactly. Now you understand why I was protesting an unnecessary redesign of the interface. If you want an intgemm interface you can have it. If you want a matrix multiply interface that's standards-ready for others to use, implement WebNN. The middle does not serve much purpose.

As I have previously stated few times, the current API is not a redesign of intgemm but an interface that is suitable for intrinsics. This mainly involved changing the intgemm's callbacks/function-pointers style to using arguments. We had to do it because there is no good way now to express function pointer (@yurydelendik can confirm this).

kpu · 2021-09-21T19:40:05Z

Additionally, Marian can link and work with different gemm libraries (intgemm, fbgemm). Wouldn't marian be the right place to link and use another gemm library that runs efficiently on wasm target? This also keeps a possibility to use different backend implementations if one doesn't want to link to intgemm at all.

There are currently two layers of dispatch: Marian and CPUID in intgemm.

In Marian, you manually configure which fbgemm or fbgemm library you want the Options object and they have peculiar behavior like whether they handle the output matrix. There is currently no automatic dispatch mechanism that switches between the two.

In intgemm it favors CPUID dispatch but one can call the routines directly with the class for that CPUID. The templates would admittedly try to force the existence of functions you haven't implemented (and don't need for our particular case).

This appears to be inserting a third layer in between with dispatch controlled by a compiler predefine. Is there any means of sensing browser support or will it always be two different builds? If yes, is WASM the correct define for this switch?

To give a different example, with the current wormhole instruction, I could (but have not gotten around to) run the wormhole instruction, inspect the output, and then use that as a CPUID of sorts to dispatch different intgemm implementations. Is there an analogous way to sense the presence of an optimized implementation here or is it always two different builds?

andrenatal · 2021-09-22T05:50:35Z

Hi @kpu, we think we should have a meeting to clarify these issues. I'm sending your an email to have it scheduled. Thanks

abhi-agg · 2021-09-27T05:01:44Z

@kpu @XapaJIaMnu This PR is up for review. I made all the changes as per our meeting and updated the description of this PR to summarize those changes 👍

abhi-agg · 2021-10-06T16:37:20Z

@kpu @XapaJIaMnu Could you folks please review the PR?

kpu

Overall I think the principle is there, good job.

There are places where code was copied and it looks like a narrower ifdef would have avoided copying code.

kpu · 2021-10-11T19:04:24Z

@abhi-agg We good? I think the changes should be quick.

abhi-agg · 2021-10-18T08:42:48Z

There are places where code was copied and it looks like a narrower ifdef would have avoided copying code.

@kpu Actually, I used narrower ifdef in the begining but it caused compilation failure on Windows platform. So I had to do it this way.

…ain module

- The fallback implementation of PrepareA interface is called successfully

…ically - The fallback implementations of these functions are called successfully

- Abort for all other configurations - Int8::PrepareA and Int8::Multiply are not supported - gemm operation for 16-bit operands are not supported

- The flags are only link time flags and not compile time flags

kpu · 2021-10-18T13:39:16Z

Can you be more specific about the problem with windows? Something to do wit unused typedef?

  #if defined(WASM)
    return {NodeOp(
      quantMult_ = *child(1)->val()->data();
    // snip
 #else
    typedef typename intgemm_<vtype>::type Integer;
    return {NodeOp(
      quantMult_ = *child(1)->val()->data();
      typedef typename intgemm_<vtype>::type Integer;

abhi-agg · 2021-10-18T14:04:10Z

Can you be more specific about the problem with windows? Something to do wit unused typedef?

The error is not related to unused typedef. It is following:
error C2121: '#': invalid character: possibly the result of a macro expansion

I found the explanation here: https://docs.microsoft.com/en-us/cpp/error-messages/compiler-errors-1/compiler-error-c2121?view=msvc-160

So, it is with using a preprocessor inside the NodeOp macro expansion. I believe that is why even the COMPILE_CPU preprocessor has a corresponding else part which is empty.

I had found https://stackoverflow.com/questions/28187619/using-preprocessor-conditionals-in-macro-functions-with-double-parentheses but this solution doesn't really solve the problem of code duplication.

If you have some suggestion then I am all ears 👍

kpu · 2021-10-18T14:18:10Z

NodeOp is a very simple macro.

#define NodeOp(op) [=]() { op; }

Just expand it then use ifdefs inside.
https://github.com/marian-nmt/marian-dev/blob/c29cc83dc4d234f0e3a00a46a729053132b408b8/src/graph/chainable.h#L11

(And don't feel bad about abstraction, there are already places in the code doing that)

abhi-agg · 2021-10-18T15:35:00Z

@kpu I have pushed the new commits to narrow the ifdef scopes 👍

Provides an arm backend for matrix multiplies using google/ruy and math functions through simde (https://simd-everywhere.github.io/blog/about/) effectively getting marian-decoder to run on ARM. The following cmake flags are added: - USE_INTGEMM (to switch intgemm on/off) - USE_RUY (to switch ruy on/off) - USE_ONNX_SGEMM (use onnx sgemm added by wasm to provide attention matrix multiply which is currently reliant on a BLAS library). - USE_SIMDE (swaps out existing intel based functions by using SIMDE instead). The built marian-decoder is tested on an Oracle Cloud ARM Machine with the following specs: Architecture : aarch64 CPU op-mode(s) : 32-bit, 64-bit Byte Order : Little Endian Vendor ID : ARM Model name : Neoverse-N1 Flags : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs A CI check on GitHub actions is added to use android-ndk cross-compile targetting arm64-v8a. The built binary is tested to work on an Android Phone using termux (Samsung M30s). Successful android build additionally requires a patch (sentencepiece -> protobuf). See opencv/opencv#17282 and opencv/opencv#19049. -Werror etc causes issues with ruy (-Wmulti-line-comment) and are disabled. The following minor changes are also applied: - Remove M32_BINARIES use COMPILE_WASM for -m32 - Hide msse4.1 if unknown platform - faiss was previously hardcoded for platforms with SSE available. This has been mitigated by adding a refernce standard cpp implementation of the missing function. - Exclude packed_gemm_....cpp from sources if USE_FBGEMM=off - MSVC workaround following browsermt#56 (comment)

abhi-agg requested review from XapaJIaMnu and kpu September 9, 2021 05:11

abhi-agg changed the title ~~Import matrix-multiply (PrepareA) from a separate wasm module~~ Import matrix-multiply from a separate wasm module Sep 21, 2021

kpu reviewed Oct 6, 2021

View reviewed changes

Comment thread src/tensors/cpu/intgemm_interface.h Outdated

kpu reviewed Oct 6, 2021

View reviewed changes

Comment thread src/tensors/cpu/intgemm_interface.h

kpu reviewed Oct 6, 2021

View reviewed changes

Comment thread src/tensors/cpu/intgemm_interface.h Outdated

kpu requested changes Oct 6, 2021

View reviewed changes

abhi-agg requested a review from kpu October 18, 2021 08:44

abhi-agg added 8 commits October 18, 2021 14:22

Dynamic linking intgemm:PrepareA in wasm generic interface

d1dfb44

First successful wasm build with PrepareA

a056c10

Modify emscripten generated artifacts to import int 8-bit gemm from m…

902e640

…ain module

Successfully importing PrepareA function dynamically

a5383bc

- The fallback implementation of PrepareA interface is called successfully

Build fix + documentation for fallback file

d0efcae

Cleaned up fallback file

3f68e62

Renamed fallback file + ran clang-format

7c8888e

Resolved compilation failures for windows builds

1b8a4e3

abhi-agg added 7 commits October 18, 2021 14:22

Successfully imported rest of the wasm gemm interface functions dynam…

63626a7

…ically - The fallback implementations of these functions are called successfully

Support only --int8shiftAlphaAll mode for wasm gemm

3426b36

- Abort for all other configurations - Int8::PrepareA and Int8::Multiply are not supported - gemm operation for 16-bit operands are not supported

Small refactoring

1a74c62

Fixed inference errors for the fallback wasm gemm

9ab0c23

Ran clang format

b0d604a

Appropriate use of emscripten flags

71d06d4

- The flags are only link time flags and not compile time flags

Changelog update and addressed reviewer comments

d6ce0f3

abhi-agg added 3 commits October 18, 2021 16:36

Narrower ifdef scopes

8ecf819

Narrower ifdef for all wasm gemm calls

8c38164

Clean up after ifdef scope narrowing

0b72e37

abhi-agg force-pushed the import-wasm-gemm-dynamically branch from b51654a to 0b72e37 Compare October 18, 2021 15:31

kpu approved these changes Oct 18, 2021

View reviewed changes

abhi-agg merged commit a1a82ff into browsermt:master Oct 18, 2021

abhi-agg mentioned this pull request Oct 27, 2021

Import matrix-multiply from a separate wasm module browsermt/bergamot-translator#232

Merged

abhi-agg deleted the import-wasm-gemm-dynamically branch November 1, 2021 10:52

jerinphilip mentioned this pull request Nov 19, 2021

ARM Compile on CI jerinphilip/marian#1

Draft

jerinphilip mentioned this pull request Feb 21, 2022

Import fallback and WASM portable matmul here jerinphilip/MozIntGemm#27

Open

jerinphilip added a commit to jerinphilip/marian that referenced this pull request Mar 3, 2022

MSVC workaround following browsermt#56 (comment)

9e155f0

jerinphilip mentioned this pull request Mar 9, 2022

ARM Backend using ruy for fp32 and int8 #79

Merged

3 tasks

Conversation

abhi-agg commented Sep 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

How to test

Checklist

Uh oh!

abhi-agg commented Sep 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kpu commented Sep 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abhi-agg commented Sep 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kpu commented Sep 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abhi-agg commented Sep 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kpu commented Sep 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yurydelendik commented Sep 20, 2021

Uh oh!

abhi-agg commented Sep 20, 2021

Uh oh!

kpu commented Sep 21, 2021

Uh oh!

andrenatal commented Sep 22, 2021

Uh oh!

abhi-agg commented Sep 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abhi-agg commented Oct 6, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kpu left a comment

Choose a reason for hiding this comment

Uh oh!

kpu commented Oct 11, 2021

Uh oh!

abhi-agg commented Oct 18, 2021

Uh oh!

kpu commented Oct 18, 2021

Uh oh!

abhi-agg commented Oct 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kpu commented Oct 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abhi-agg commented Oct 18, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

abhi-agg commented Sep 9, 2021 •

edited

Loading

abhi-agg commented Sep 9, 2021 •

edited

Loading

kpu commented Sep 17, 2021 •

edited

Loading

abhi-agg commented Sep 20, 2021 •

edited

Loading

kpu commented Sep 20, 2021 •

edited

Loading

abhi-agg commented Sep 20, 2021 •

edited

Loading

kpu commented Sep 20, 2021 •

edited

Loading

abhi-agg commented Sep 27, 2021 •

edited

Loading

abhi-agg commented Oct 18, 2021 •

edited

Loading

kpu commented Oct 18, 2021 •

edited

Loading