Skip to content

TST: Reproducer for SkylakeX Issue gh-13401#13466

Closed
tylerjereddy wants to merge 4 commits intonumpy:masterfrom
tylerjereddy:intel-mac-sde-azure
Closed

TST: Reproducer for SkylakeX Issue gh-13401#13466
tylerjereddy wants to merge 4 commits intonumpy:masterfrom
tylerjereddy:intel-mac-sde-azure

Conversation

@tylerjereddy
Copy link
Contributor

CI-based Intel SDE (SkylakeX emulation) reproducer for #13401.

Many thanks to @isuruf and @martin-frbg for help upstream. For gory details: OpenMathLib/OpenBLAS#2108

This only works for Mac because our MacPython (ecosystem) Linux OpenBLAS was built via maylinux1 containers where the gcc version is not high enough to empower the AVX switch that triggers the problem in OpenBLAS. So, from the standpoint of our recently-released wheels, Linux should be ok for this matter, consistent with pip install reports in the original issue only failing on Mac.

conda is another story--there you can induce the failure on Linux by using recent OpenBLAS built outside manylinux1 constraints as @isuruf observed today.

This is really just the beginning of something that is going somewhere (?) and perhaps not in NumPy repo proper. When BLIS does this they go through a for loop of emulation archs: https://github.com/flame/blis/blob/master/travis/do_sde.sh

Perhaps the short-term objective here is simply a public CI run that fails because of the reported OpenBLAS issue, and provides a platform for us to try the latest OpenBLAS binary (build in our MacPython ecosystem) to see if it helps for that specific issue.

Plans beyond that are perhaps contentious, as discussed in #13401, but maybe seeing this will help drive that discussion in one direction or another. One problem upstream is that there may not be a maintainer with sufficient OpenBLAS repo access rights to set up new CI for Azure, etc. But they are open to the idea, it seems.

* draft of new test and Azure CI job
to detect the OpenBLAS 0.3.5.dev issue
on Mac versions of NumPy 1.16.3 wheel

* use Intel SDE to emulate / force SkylakeX
usage by OpenBLAS in Azure job
@tylerjereddy
Copy link
Contributor Author

The expected / desired behavior, if / when I try the newer OpenBLAS binary on our rackspace server in this PR, is likely just that the SkylakeX instructions are not used and there's a fallback to an older arch along with test passing, and NOT that the AVX instruction stuff is fixed, since that appears to be the case for the temporary fix upstream.

I also don't know if test_sdot_bug_8577 would always fail under emulation, or if it is actually failing because of recent / pertinent OpenBLAS issue.

* use stricter threshold for test_issue_13401
and take absolute value of mean result

* skip a failing sdot bug test that is not
currently compatible with Intel SDE emulation

* switch to "full" linalg test suite for Intel SDE
Azure CI job
Copy link
Contributor Author

@tylerjereddy tylerjereddy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised with another commit based on feedback & what I thought was still needed here.

If this works as intended, CI should be all green except for 1 failure in the SDE Azure job for the issue we are concerned with.

The next step, if that happens, would be to bump the version of MacPython OpenBLAS used in CI (probably for all matrix entries, but especially for the SDE). If that solves the issue, great, otherwise the are already more patches related to SkylakeX stuff upstream & also a C-level reproducer, though not in their CI yet I don't think.

pytest.skip('Numpy xerbla not linked in.')


@pytest.mark.skip(reason='gh-13478')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unfortunate, but I couldn't manage to get i.e., an environment variable to pass through the SDE for selective skipping in the CI job.

It would also be useful if I could use runtests.py -s "linalg" -- -k "not test_sdot_bug_8577", but I don't think mixing and matching test selection between runtests.py and the args fed to -- (pytest) is fool-proof.

A further complication is the potential special meaning of -- after the emulation command, and not sure how well emulation works with pytest-xdist if we want to use many cores...

* bump the version of MacPython ecosystem
OpenBLAS used in Azure CI to see if the SkylakeX
issue is resolved for Intel SDE job, while expecting
all other Azure entries to also pass CI with latest
ecosystem OpenBLAS
@tylerjereddy
Copy link
Contributor Author

Since the CI did indeed produce only the single failure related to #13401 under emulation, I've now pushed the commit that bumps OpenBLAS to v0.3.7.dev in all Azure CI entries. Note that the filename from the MacPython ecosystem is deceptive to the true version number, but we have the hash in the filename & the ctypes version check in some CI entries.

The result on my fork, which I suspect we'll see here again as well--is that CI will be all-green for everything except the emulation case, which actually gets worse with the upstream patch on Skylake & produces 15 fails intead of 1 in linalg suite.

That probably means we need to consider another round of MacPython openblas builds with recent OpenBLAS again--someone else should feel free to do that if they want, otherwise I'll get back to it at some point. OpenMathLib/OpenBLAS#2111 may help

Once there's a working patch for SDE CI entry, I'll probably have to replicate some of this effort downstream in SciPy before I can justify the OpenBLAS churn there as well.

@martin-frbg
Copy link

It is unclear to me what you are doing here - is that report about 13401 still failing and an additional 15 failures based on the full OpenBLAS PR2111 as merged in current OpenBLAS-develop or on its earlier state with only one of the two SkylakeX xCOPY kernels disabled ?

@tylerjereddy
Copy link
Contributor Author

@martin-frbg It is an earlier state with only one of the kernels disabled, so your most recent fixes aren't included in the OpenBLAS used here. Some people had speculated that the earlier fix might be sufficient, but as you noted there was certainly no guarantee of that, and this demonstrates that indeed it was premature to assume that that fix was sufficient for NumPy.

@tylerjereddy
Copy link
Contributor Author

I've tried bumping OpenBLAS to latest develop branch commit again in MacPython/openblas-libs#3 -- if that builds then I'll try again here later with the newer version

* another bump of Azure CI OpenBLAS version
as we attempt to incorporate appropriate patches
for SkylakeX-related linalg issues
@tylerjereddy
Copy link
Contributor Author

Testing on my fork / Azure suggests that the most recent MacPython ecosystem build of OpenBLAS is now passing linalg tests, including the new one here, with NumPy under SkylakeX emulation. I've pushed the most recent OpenBLAS version bump to this PR for reviewer confirmation.

I think this means we may be in good shape for wheels for NumPy again, and probably for SciPy too (though I should likely replicate this effort downstream to be sure).

@tylerjereddy tylerjereddy marked this pull request as ready for review May 7, 2019 17:05
@tylerjereddy tylerjereddy reopened this May 7, 2019
@tylerjereddy
Copy link
Contributor Author

Alright, I've lifted the draft status since Azure results look good with most recent OpenBLAS + SkylakeX emulation. Whether this belongs in NumPy proper is not a debate I'm super-interested in though--should go somewhere in the ecosystem.

For now, this allows me to move forward with other things--I've preserved the git history for the CI results demo of fail -> pass, for now anyway. They get purged for Azure in 30 days, I think.

@charris charris changed the title TST: reproducer for Issue 13401 TST: Reproducer for skylake Issue gh-13401 May 11, 2019
@charris charris changed the title TST: Reproducer for skylake Issue gh-13401 TST: Reproducer for SkylakeX Issue gh-13401 May 11, 2019
@charris
Copy link
Member

charris commented May 11, 2019

This check is moving to OpenBLAS, so closing.

@charris charris closed this May 11, 2019
@tylerjereddy tylerjereddy deleted the intel-mac-sde-azure branch May 15, 2019 21:51
@tylerjereddy
Copy link
Contributor Author

The regression guard is now in place in upstream CI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants