TST: Reproducer for SkylakeX Issue gh-13401#13466
TST: Reproducer for SkylakeX Issue gh-13401#13466tylerjereddy wants to merge 4 commits intonumpy:masterfrom
Conversation
* draft of new test and Azure CI job to detect the OpenBLAS 0.3.5.dev issue on Mac versions of NumPy 1.16.3 wheel * use Intel SDE to emulate / force SkylakeX usage by OpenBLAS in Azure job
|
The expected / desired behavior, if / when I try the newer OpenBLAS binary on our rackspace server in this PR, is likely just that the SkylakeX instructions are not used and there's a fallback to an older arch along with test passing, and NOT that the AVX instruction stuff is fixed, since that appears to be the case for the temporary fix upstream. I also don't know if |
* use stricter threshold for test_issue_13401 and take absolute value of mean result * skip a failing sdot bug test that is not currently compatible with Intel SDE emulation * switch to "full" linalg test suite for Intel SDE Azure CI job
tylerjereddy
left a comment
There was a problem hiding this comment.
Revised with another commit based on feedback & what I thought was still needed here.
If this works as intended, CI should be all green except for 1 failure in the SDE Azure job for the issue we are concerned with.
The next step, if that happens, would be to bump the version of MacPython OpenBLAS used in CI (probably for all matrix entries, but especially for the SDE). If that solves the issue, great, otherwise the are already more patches related to SkylakeX stuff upstream & also a C-level reproducer, though not in their CI yet I don't think.
| pytest.skip('Numpy xerbla not linked in.') | ||
|
|
||
|
|
||
| @pytest.mark.skip(reason='gh-13478') |
There was a problem hiding this comment.
This is unfortunate, but I couldn't manage to get i.e., an environment variable to pass through the SDE for selective skipping in the CI job.
It would also be useful if I could use runtests.py -s "linalg" -- -k "not test_sdot_bug_8577", but I don't think mixing and matching test selection between runtests.py and the args fed to -- (pytest) is fool-proof.
A further complication is the potential special meaning of -- after the emulation command, and not sure how well emulation works with pytest-xdist if we want to use many cores...
* bump the version of MacPython ecosystem OpenBLAS used in Azure CI to see if the SkylakeX issue is resolved for Intel SDE job, while expecting all other Azure entries to also pass CI with latest ecosystem OpenBLAS
|
Since the CI did indeed produce only the single failure related to #13401 under emulation, I've now pushed the commit that bumps OpenBLAS to v0.3.7.dev in all Azure CI entries. Note that the filename from the MacPython ecosystem is deceptive to the true version number, but we have the hash in the filename & the ctypes version check in some CI entries. The result on my fork, which I suspect we'll see here again as well--is that CI will be all-green for everything except the emulation case, which actually gets worse with the upstream patch on Skylake & produces 15 fails intead of 1 in linalg suite. That probably means we need to consider another round of MacPython openblas builds with recent OpenBLAS again--someone else should feel free to do that if they want, otherwise I'll get back to it at some point. OpenMathLib/OpenBLAS#2111 may help Once there's a working patch for SDE CI entry, I'll probably have to replicate some of this effort downstream in SciPy before I can justify the OpenBLAS churn there as well. |
|
It is unclear to me what you are doing here - is that report about 13401 still failing and an additional 15 failures based on the full OpenBLAS PR2111 as merged in current OpenBLAS-develop or on its earlier state with only one of the two SkylakeX xCOPY kernels disabled ? |
|
@martin-frbg It is an earlier state with only one of the kernels disabled, so your most recent fixes aren't included in the OpenBLAS used here. Some people had speculated that the earlier fix might be sufficient, but as you noted there was certainly no guarantee of that, and this demonstrates that indeed it was premature to assume that that fix was sufficient for NumPy. |
|
I've tried bumping OpenBLAS to latest develop branch commit again in MacPython/openblas-libs#3 -- if that builds then I'll try again here later with the newer version |
* another bump of Azure CI OpenBLAS version as we attempt to incorporate appropriate patches for SkylakeX-related linalg issues
|
Testing on my fork / Azure suggests that the most recent MacPython ecosystem build of OpenBLAS is now passing linalg tests, including the new one here, with NumPy under SkylakeX emulation. I've pushed the most recent OpenBLAS version bump to this PR for reviewer confirmation. I think this means we may be in good shape for wheels for NumPy again, and probably for SciPy too (though I should likely replicate this effort downstream to be sure). |
|
Alright, I've lifted the draft status since Azure results look good with most recent OpenBLAS + SkylakeX emulation. Whether this belongs in NumPy proper is not a debate I'm super-interested in though--should go somewhere in the ecosystem. For now, this allows me to move forward with other things--I've preserved the git history for the CI results demo of fail -> pass, for now anyway. They get purged for Azure in 30 days, I think. |
|
This check is moving to OpenBLAS, so closing. |
|
The regression guard is now in place in upstream CI. |
CI-based Intel SDE (SkylakeX emulation) reproducer for #13401.
Many thanks to @isuruf and @martin-frbg for help upstream. For gory details: OpenMathLib/OpenBLAS#2108
This only works for Mac because our MacPython (ecosystem) Linux OpenBLAS was built via
maylinux1containers where the gcc version is not high enough to empower the AVX switch that triggers the problem in OpenBLAS. So, from the standpoint of our recently-released wheels, Linux should be ok for this matter, consistent with pip install reports in the original issue only failing on Mac.condais another story--there you can induce the failure on Linux by using recent OpenBLAS built outside manylinux1 constraints as @isuruf observed today.This is really just the beginning of something that is going somewhere (?) and perhaps not in NumPy repo proper. When BLIS does this they go through a for loop of emulation archs: https://github.com/flame/blis/blob/master/travis/do_sde.sh
Perhaps the short-term objective here is simply a public CI run that fails because of the reported OpenBLAS issue, and provides a platform for us to try the latest OpenBLAS binary (build in our MacPython ecosystem) to see if it helps for that specific issue.
Plans beyond that are perhaps contentious, as discussed in #13401, but maybe seeing this will help drive that discussion in one direction or another. One problem upstream is that there may not be a maintainer with sufficient OpenBLAS repo access rights to set up new CI for Azure, etc. But they are open to the idea, it seems.