Skip to content

cmov: implement constant-time equality comparisons#873

Merged
tarcieri merged 11 commits intoRustCrypto:masterfrom
brxken128:cmoveq
Apr 1, 2023
Merged

cmov: implement constant-time equality comparisons#873
tarcieri merged 11 commits intoRustCrypto:masterfrom
brxken128:cmoveq

Conversation

@brxken128
Copy link
Contributor

@brxken128 brxken128 commented Mar 26, 2023

Following on from #872, this is heavily a WIP, and a lot of the implementation is sub-par.

I cleaned the macro usage up, although it seemed intentionally designed that way so I can revert those changes if needs be.

I'm not 100% sure on the portable implementation, nor am I sure that EOR is the most suitable instruction for Aarch64 (but it works).

Additionally, I think we should note that direct usage won't provide the most ideal outcomes - I believe the compiler will still attempt to optimise in some places, although I'm not 100% sure on that. Using this crate within subtle (or similar) sounds the most suitable to me.

Adding benches via dudect_bencher could be very handy, just as a sanity-check (but I'd be happy to hear any ideas regarding that).

Edit: in hindsight, I'm not sure that dereferencing within the macros is the best approach either (implementation-wise).

Feel free to rip this work to shreds - it's best that we get it right and make no mistakes 🙏

@tarcieri
Copy link
Member

tarcieri commented Mar 26, 2023

Looks great at first glance. I can take a more detailed look later.

I believe the compiler will still attempt to optimise in some places, although I'm not 100% sure on that. Using this crate within subtle (or similar) sounds the most suitable to me.

The portable implementation could borrow subtle's trick of using a volatile write as an optimization barrier. It's not a guarantee but it's the best thing you can do without ASM for now, and we already have better ASM implementations where ASM is available.

Though I agree finding some way to get all of this work upstream into subtle seems good.

@brxken128
Copy link
Contributor Author

Looks great at first glance. I can take a more detailed look later.

Thank you! I'll take a look into the volatile write, and also take a look at how subtle is structured, primarily to see how large of a task it'd be.

I don't assume it'd be overly difficult, and it seems very beneficial so I'm all for it.

@brxken128
Copy link
Contributor Author

I have some very promising news!

Using the x86 implementation in the PR, along with dudect_bencher, here are the results for cmoveq and cmovne (with 100 million measurements):

running 10 benches
bench cmoveq_u128 seeded with 0xc2eac874bbfff9f8
bench cmoveq_u128 ... : n == +100.000M, max t = -0.76411, max tau = -0.00008, (5/tau)^2 = 4281830469
bench cmoveq_u16  seeded with 0x1d448ef882317e00
bench cmoveq_u16  ... : n == +100.000M, max t = +1.25790, max tau = +0.00013, (5/tau)^2 = 1579961129
bench cmoveq_u32  seeded with 0x616eabe391b02a50
bench cmoveq_u32  ... : n == +100.000M, max t = -1.85403, max tau = -0.00019, (5/tau)^2 = 727287432
bench cmoveq_u64  seeded with 0x944c16dd3a738d52
bench cmoveq_u64  ... : n == +100.000M, max t = -2.05989, max tau = -0.00021, (5/tau)^2 = 589188106
bench cmoveq_u8   seeded with 0xeae6f3b6361779ca
bench cmoveq_u8   ... : n == +99.889M, max t = -0.87249, max tau = -0.00009, (5/tau)^2 = 3280448987
bench cmovne_u128 seeded with 0x1b08ea492f9608de
bench cmovne_u128 ... : n == +99.568M, max t = +1.84216, max tau = +0.00018, (5/tau)^2 = 733503522
bench cmovne_u16  seeded with 0x48573db461a7b65b
bench cmovne_u16  ... : n == +100.000M, max t = +0.70990, max tau = +0.00007, (5/tau)^2 = 4960715094
bench cmovne_u32  seeded with 0x98279b1b4c577fed
bench cmovne_u32  ... : n == +100.000M, max t = +1.49098, max tau = +0.00015, (5/tau)^2 = 1124598322
bench cmovne_u64  seeded with 0x700c4d546f7f329d
bench cmovne_u64  ... : n == +99.831M, max t = +1.11933, max tau = +0.00011, (5/tau)^2 = 1991996799
bench cmovne_u8   seeded with 0xd04b925e2035c828
bench cmovne_u8   ... : n == +38.389M, max t = +1.71178, max tau = +0.00028, (5/tau)^2 = 327532597

Here's the benchmark. Obviously my methodology could be flawed, or dudect_bencher itself could be, but I don't yet have a reason to doubt the accuracy.

I'm unsure why cmovne_u8 didn't complete the full cycle - I will re-run it and include it here in a moment.

@tarcieri
Copy link
Member

FWIW Intel guarantees the CMOV family of instructions is constant-time on all extant CPUs, and the use of inline assembly should ensure LLVM does not alter the emitted assembly code

@brxken128
Copy link
Contributor Author

FWIW Intel guarantees the CMOV family of instructions is constant-time on all extant CPUs, and the use of inline assembly should ensure LLVM does not alter the emitted assembly code

I recall seeing it in the crate docs, but it never hurts to validate the implementation I suppose (even if just a sanity check). I wonder how this will act on Aarch64.

The portable implementation could borrow subtle's trick of using a volatile write as an optimization barrier.

Which part of the portable implementation were you referring to here? I had a look but couldn't quite find where it'd fit in.

On another note, I ran the benches using the portable backend, and it also looks promising. It's not collecting the full 100M measurements here either - I'm not too sure what's going on, but I assume there's enough to be reliable.

running 10 benches
bench cmoveq_u128 seeded with 0x395c16b9e516050e
bench cmoveq_u128 ... : n == +100.000M, max t = +1.55722, max tau = +0.00016, (5/tau)^2 = 1030960941
bench cmoveq_u16  seeded with 0x8d3f1cb6f23e4d46
bench cmoveq_u16  ... : n == +40.532M, max t = -1.28140, max tau = -0.00020, (5/tau)^2 = 617124023
bench cmoveq_u32  seeded with 0xf0c6688cbbebefd5
bench cmoveq_u32  ... : n == +41.231M, max t = +1.26486, max tau = +0.00020, (5/tau)^2 = 644286082
bench cmoveq_u64  seeded with 0x968a9fa0ea121e3b
bench cmoveq_u64  ... : n == +39.363M, max t = +0.34390, max tau = +0.00005, (5/tau)^2 = 8320645836
bench cmoveq_u8   seeded with 0x47c5285a05d1159e
bench cmoveq_u8   ... : n == +100.000M, max t = -2.00943, max tau = -0.00020, (5/tau)^2 = 619149459
bench cmovne_u128 seeded with 0xbf87b34e478c30ce
bench cmovne_u128 ... : n == +100.000M, max t = +1.18042, max tau = +0.00012, (5/tau)^2 = 1794170662
bench cmovne_u16  seeded with 0x8c8925270439a99b
bench cmovne_u16  ... : n == +100.000M, max t = -1.31689, max tau = -0.00013, (5/tau)^2 = 1441593334
bench cmovne_u32  seeded with 0x77bc338505a0dcb9
bench cmovne_u32  ... : n == +100.000M, max t = +0.63647, max tau = +0.00006, (5/tau)^2 = 6171382062
bench cmovne_u64  seeded with 0xda022ec5135c349b
bench cmovne_u64  ... : n == +41.537M, max t = +1.01576, max tau = +0.00016, (5/tau)^2 = 1006469926
bench cmovne_u8   seeded with 0xbf6e437458bd8a00
bench cmovne_u8   ... : n == +100.000M, max t = -1.27892, max tau = -0.00013, (5/tau)^2 = 1528455770

@brxken128
Copy link
Contributor Author

brxken128 commented Mar 31, 2023

Was just adding #[inline(always)] to the functions and the x86 ones seem to break once inlined - very odd.

@tarcieri
Copy link
Member

@brxken128 is there a minimal repro of that? I'd like to take a look in Godbolt

@brxken128
Copy link
Contributor Author

brxken128 commented Mar 31, 2023

@brxken128 is there a minimal repro of that? I'd like to take a look in Godbolt

This should do it:

use std::arch::asm;

macro_rules! cmov_eq {
    ($instruction:expr, $lhs:expr, $rhs:expr, $condition:expr, $dst:expr) => {
        let mut tmp = *$dst as u16;
        unsafe {
            asm! {
                "xor {0:e}, {1:e}",
                $instruction,
                in(reg) *$lhs,
                in(reg) *$rhs,
                inlateout(reg) tmp,
                in(reg) $condition as u16,
                options(pure, nomem, nostack),
            };
        };
        *$dst = tmp as u8;
    };
}

pub fn cmovne_u32(src: &u32, rhs: &u32, input: u8, output: &mut u8) {
    cmov_eq!("cmovnz {2:e}, {3:e}", src, rhs, input, output);
}

pub fn main() {
    let mut dst = 1u8;
    cmovne_u32(&1, &2, 0u8, &mut dst);
}

Inlining cmovne_u32 seems to break (at least for me locally), although I'm not quite sure why.

@tarcieri
Copy link
Member

@brxken128 I went through in godbolt and didn't notice anything broken in particular. It looked like it inlined the cmovne_u32 call in main regardless of what inlining hint was suggested.

What's failing exactly?

@brxken128
Copy link
Contributor Author

brxken128 commented Mar 31, 2023

What's failing exactly?

I had cargo watch running while I was code-editing (to re-test on every save), and this was active while I started adding #[inline(always)] to all of the functions.

Once I got to the x86, tests started failing. It only happens on some - others just run infinitely. It's really spontaneous to be honest.

Take this impl from x86:

impl CmovEq for u32 {
    fn cmovne(&self, rhs: &Self, input: Condition, output: &mut Condition) {
        cmov_eq!("cmovnz {2:e}, {3:e}", self, rhs, input, output);
    }
}

This works great, until #[inline(always)] is added above the function, and then the relevant test fails (on one of the assertions, and as though the behaviour of the function had changed entirely).

I'm unsure if this failing is down to an error in my code, an error with Rust, or an error with even my testing methodology but I don't think inline-always should be causing something such as this - everything works great without it.

@tarcieri
Copy link
Member

tarcieri commented Apr 1, 2023

It'd be good to get the whole test into godbolt so you can see the difference in the emitted assembly

@brxken128
Copy link
Contributor Author

brxken128 commented Apr 1, 2023

It'd be good to get the whole test into godbolt so you can see the difference in the emitted assembly

It runs fine in godbolt, but not locally. Extremely odd.

I compared the two ASM outputs and don't see anything awry.

I'll commit the changes, push and see if CI shows similar failures.

Edit: it's reproducible at least haha

It appears to happen here:

cond.cmovne(&cond, 0, &mut o);
assert_eq!(o, cond as u8);

cond == cond, so 0 should not be moved, but it is. Only happens when inline(always) is specified, although it could very well be a logic error on my part.

If the cmoveq functions are inlined, the tests run continuously (as though they're stuck). If the cmovne functions are inlined, they move even when they're not supposed to. I'll keep digging until I find the cause.

It only happens on x86, and only affects the functions that use the cmov_eq! macro.

Replacing xor with cmp in the macro works flawlessly, even with inline(always), although that might branch? I'm not sure there, but it just keeps getting more odd.

@tarcieri
Copy link
Member

tarcieri commented Apr 1, 2023

cmp is fine and won't branch. Behind the scenes it's actually a sub instruction.

If you can just compose it now in terms of the existing CMOV instructions (i.e. not define any additional assembly), doing XOR in Rust and using e.g. Cmov::cmovz on the result, that should work fine for now.

We could investigate doing optimized assembly as a separate pass.

@brxken128
Copy link
Contributor Author

brxken128 commented Apr 1, 2023

Done as of my latest commit - I'll keep the ASM for Aarch64 as it seems to work perfectly well from my testing.

Is the macro usage fine? I changed quite a bit in 0df5579 and I'm not sure if it's ideal. I'm also unsure over 59cad3d (dereferencing within the macros), I can easily revert both if it'd be better!

Edit: it may also be wise to cancel this action, it appears there's no end in sight.

@brxken128 brxken128 marked this pull request as ready for review April 1, 2023 15:28
Copy link
Member

@tarcieri tarcieri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good now, thanks!

@tarcieri tarcieri merged commit 734c63b into RustCrypto:master Apr 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants