Fixes big endian arch bugs. by abalib · Pull Request #26383 · pytorch/pytorch

abalib · 2019-09-18T01:23:04Z

Serialization.cpp fails on big endian machines.
This patch fixes the endian bugs and also makes the pytorch
model files portable across different endian architectures.
x86 generated model file can be read on s390 arch.

First problem, is serialization.cpp forgets to convert "size" value
of the storage elements to the native byte order.
torch.load throws an assertion as a result
(see the first stack trace below).

Second problem is when it reads the model from storage (doRead)
it decodes values to little endian which is the wrong order
on a big endian machine. The decode should be
to THP_nativeByteOrder() instead
(see the model dump below)

File "/usr/local/lib/python3.6/dist-packages/torch/serialization.py", line 422, in load
return _load(f, map_location, pickle_module, **pickle_load_args)
File "/usr/local/lib/python3.6/dist-packages/torch/serialization.py", line 616, in _load
deserialized_objects[key]._set_from_file(f, offset, f_should_read_directly)
RuntimeError: storage has wrong size: expected 2305843009213693952 got 32
	(the very long number is actually 32 in the wrong endianness)

Model file load on x86 (correct output)

>>> torch.load('400f2k_best.model', map_location=torch.device("cpu"))
{'epoch': 24, 'model_type': 'emb_aec', 'classifier_model': OrderedDict([('model.0.weight', tensor([[ 2.4608e-01, -1.1174e-01, -1.0854e-01,  4.0124e-01, -1.5261e-02,
         -1.2206e-01,  1.3229e-01, -1.2615e-01, -5.2773e-01,  2.6333e-01,
         -3.1462e-03, -1.4902e-01,  9.8545e-02, -1.5789e-01, -2.2625e-01,
         -1.0776e-01, -9.0895e-02, -3.8530e-01,  9.1152e-01, -3.9720e-01,
         -8.5848e-01, -4.7837e-02, -1.5178e-01,  8.5023e-02,  1.5013e-01,
         -9.9294e-02, -2.7422e-01, -4.3986e-01, -4.4297e-01, -3.9570e-01,

Model file load on s390x (wrong endianness; notice the exponents)

>>> torch.load( "400f2k_best.model", map_location=torch.device("cpu"))
{'epoch': 24, 'model_type': 'emb_aec', 'classifier_model': OrderedDict([('model.0.weight', tensor([[ 9.2780e+21, -9.7722e-11,  4.1350e+33,  7.782e+34,  4.2056e-31,
          9.0784e+18,  1.1846e-32,  3.3320e-32, -4.8288e-28, -7.2679e+12,
          1.5379e-16, -5.2604e+12, -4.7240e+17,  4.6092e-21, -1.8360e-20,
         -2.7712e-31,  1.4548e-16, -2.5089e-27,  7.9094e-10,  7.1977e+34,
          1.1930e+26,  8.4536e+15,  2.7757e+23, -5.8455e-10, -1.5611e+09,
         -1.1311e-23,  6.6451e+19, -2.0970e+20,  3.4878e-19, -1.0857e-12,
          7.8098e+22,  5.3998e-35],

Serialization.cpp fails on big endian machines. This patch fixes the endian bugs and also makes the pytorch model files portable across different endian architectures. x86 generated model file can be read on s390 arch. First problem, is serialization.cpp forgets to convert "size" value of the storage elements to the native byte order. torch.load throws an assertion as a result (see the first stack trace below). Second problem is when it reads the model from storage (doRead) it decodes values to little endian which is the wrong order on a big endian machine. The decode should be to THP_nativeByteOrder() instead (see the model dump below) loaded_model = torch.load( opt.model_file, map_location=torch.device("cpu")) File "/usr/local/lib/python3.6/dist-packages/torch/serialization.py", line 422, in load return _load(f, map_location, pickle_module, **pickle_load_args) File "/usr/local/lib/python3.6/dist-packages/torch/serialization.py", line 616, in _load deserialized_objects[key]._set_from_file(f, offset, f_should_read_directly) RuntimeError: storage has wrong size: expected 2305843009213693952 got 32 (the very long number is actually 32 in the wrong endianness) x86 (original) >>> import torch >>> torch.load('400f2k_best.model', map_location=torch.device("cpu")) {'epoch': 24, 'model_type': 'emb_aec', 'classifier_model': OrderedDict([('model.0.weight', tensor([[ 2.4608e-01, -1.1174e-01, -1.0854e-01, 4.0124e-01, -1.5261e-02, -1.2206e-01, 1.3229e-01, -1.2615e-01, -5.2773e-01, 2.6333e-01, -3.1462e-03, -1.4902e-01, 9.8545e-02, -1.5789e-01, -2.2625e-01, -1.0776e-01, -9.0895e-02, -3.8530e-01, 9.1152e-01, -3.9720e-01, -8.5848e-01, -4.7837e-02, -1.5178e-01, 8.5023e-02, 1.5013e-01, -9.9294e-02, -2.7422e-01, -4.3986e-01, -4.4297e-01, -3.9570e-01, s390x (wrong endianness) >>> import torch >>> torch.load( "400f2k_best.model", map_location=torch.device("cpu")) {'epoch': 24, 'model_type': 'emb_aec', 'classifier_model': OrderedDict([('model.0.weight', tensor([[ 9.2780e+21, -9.7722e-11, 4.1350e+33, 7.782e+34, 4.2056e-31, 9.0784e+18, 1.1846e-32, 3.3320e-32, -4.8288e-28, -7.2679e+12, 1.5379e-16, -5.2604e+12, -4.7240e+17, 4.6092e-21, -1.8360e-20, -2.7712e-31, 1.4548e-16, -2.5089e-27, 7.9094e-10, 7.1977e+34, 1.1930e+26, 8.4536e+15, 2.7757e+23, -5.8455e-10, -1.5611e+09, -1.1311e-23, 6.6451e+19, -2.0970e+20, 3.4878e-19, -1.0857e-12, 7.8098e+22, 5.3998e-35],

abalib · 2019-09-18T01:28:27Z

@geert56 FYI

resistor · 2019-09-19T00:03:31Z

Does this fix an existing functional test on your big endian target, or should we add a dedicated test for it?

abalib · 2019-09-19T01:07:10Z

Does this fix an existing functional test on your big endian target, or should we add a dedicated test for it?

@resistor Fixes an existing functional test. I don't think we need to add a test. The fix is operational and working as expected on our system. We also tested in on x86 (little endian) just to make sure we didn't break anything.

torch/csrc/generic/serialization.cpp

resistor

Approved modulo one nit.

detect erroneous usage if it ever arises.

facebook-github-bot

@resistor has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2019-09-20T04:08:01Z

This pull request has been merged in afa5d08.

pytorchbot added the module: internals Related to internal abstractions in c10 and ATen label Sep 18, 2019

ezyang added the open source label Sep 18, 2019

resistor self-requested a review September 19, 2019 00:18

resistor reviewed Sep 19, 2019

View reviewed changes

torch/csrc/generic/serialization.cpp Outdated Show resolved Hide resolved

resistor approved these changes Sep 19, 2019

View reviewed changes

Do not to initialize nsize, so that Asan will correctly

4dcc2a2

detect erroneous usage if it ever arises.

facebook-github-bot reviewed Sep 19, 2019

View reviewed changes

facebook-github-bot closed this in afa5d08 Sep 20, 2019

facebook-github-bot added the merged label Sep 20, 2019

mruberry added the Merged label Oct 28, 2020

abalib deleted the s390x/big-endian-fix branch July 25, 2025 13:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes big endian arch bugs.#26383

Fixes big endian arch bugs.#26383
abalib wants to merge 2 commits intopytorch:masterfrom
abalib:s390x/big-endian-fix

abalib commented Sep 18, 2019 •

edited

Loading

Uh oh!

abalib commented Sep 18, 2019

Uh oh!

resistor commented Sep 19, 2019

Uh oh!

abalib commented Sep 19, 2019 •

edited

Loading

Uh oh!

Uh oh!

resistor left a comment

Uh oh!

facebook-github-bot left a comment

Uh oh!

facebook-github-bot commented Sep 20, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

abalib commented Sep 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abalib commented Sep 18, 2019

Uh oh!

resistor commented Sep 19, 2019

Uh oh!

abalib commented Sep 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

resistor left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Sep 20, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

abalib commented Sep 18, 2019 •

edited

Loading

abalib commented Sep 19, 2019 •

edited

Loading