Skip to content

hashes: Epic re-write#2770

Closed
tcharding wants to merge 2 commits intorust-bitcoin:masterfrom
tcharding:hashes
Closed

hashes: Epic re-write#2770
tcharding wants to merge 2 commits intorust-bitcoin:masterfrom
tcharding:hashes

Conversation

@tcharding
Copy link
Copy Markdown
Member

@tcharding tcharding commented May 15, 2024

Re-write the hashes API for fun and profit.

Please note, this removes the ability to display tagged hashes backward.

Resolves the following issues:

Close: #1481
Close: #1552
Close: #1689
Close: #2197
Close: #2377
Close: #2665
Close: #2811

TODO

  • Resolve the QUESTIONS in hashes/examples/*.rs

@github-actions github-actions bot added C-bitcoin PRs modifying the bitcoin crate C-hashes PRs modifying the hashes crate test C-base58 PRs modifying the base58 crate labels May 15, 2024
@apoelstra
Copy link
Copy Markdown
Member

Overall 6c7e143af7b05ff4176530d583a21592be0c2f60 looks good. The duplication of the merkle root code I don't like (but I don't think it's necessary ... in fact I suspect we could pull it into the hashes crate and keep it generic since it's not particularly Bitcoin-specific ... but anyway it doesn't need to happen as part of the initial hashes API changes).

We should have some public type aliases for HMAC-SHA256 and HMAC-SHA512 since those are the overwhelmingly most common forms of HMAC.

As for "hash newtypes should not be generic hash types" I am leaning strongly into implementing this by adding visibility specifiers to the hash_newtype! macro which would control how it implements the engine; a pub engine is a "generic hash type" while a pub(crate) or private engine is a hash type that can't be constructed by hashing using the public API.). Or maybe we want a different macro for this. (The hash_newtype macro currently seems to just implement a bunch of byte accessors, and I wonder if it should be renamed to reflect this, and/or whether it could replace impl_array_newtype.)

@tcharding
Copy link
Copy Markdown
Member Author

Leave it with me, I'm going to go for the least tricky solution that does exactly what we want.

. (The hash_newtype macro currently seems to just implement a bunch of byte accessors, and I wonder if it should be renamed to reflect this, and/or whether it could replace impl_array_newtype.)

Will look into this.

@tcharding
Copy link
Copy Markdown
Member Author

tcharding commented May 16, 2024

As for "hash newtypes should not be generic hash types" I am leaning strongly into implementing this by adding visibility specifiers to the hash_newtype! macro which would control how it implements the engine; a pub engine is a "generic hash type" while a pub(crate) or private engine is a hash type that can't be constructed by hashing using the public API.).

lol, double lol - look at the diff required to get this functionality, 5 changed lines.

diff --git a/bitcoin/src/blockdata/block.rs b/bitcoin/src/blockdata/block.rs
index 1a660fb8..98074e10 100644
--- a/bitcoin/src/blockdata/block.rs
+++ b/bitcoin/src/blockdata/block.rs
@@ -26,7 +26,7 @@ hashes::hash_newtype! {
     /// A bitcoin block hash.
     pub struct BlockHash(sha256d);
     /// A hash of the Merkle tree branch or root for transactions.
-    pub struct TxMerkleNode(sha256d);
+    pub struct TxMerkleNode(pub(crate) sha256d);
     /// A hash corresponding to the Merkle tree root for witness data.
     pub struct WitnessMerkleNode(sha256d);
     /// A hash corresponding to the witness structure commitment in the coinbase transaction.
diff --git a/bitcoin/src/blockdata/transaction.rs b/bitcoin/src/blockdata/transaction.rs
index 4916e43a..2834be04 100644
--- a/bitcoin/src/blockdata/transaction.rs
+++ b/bitcoin/src/blockdata/transaction.rs
@@ -44,10 +44,10 @@ hashes::hash_newtype! {
     /// versions of the Bitcoin Core software itself, this and other [`sha256d::Hash`] types, are
     /// serialized in reverse byte order when converted to a hex string via [`std::fmt::Display`]
     /// trait operations. See [`hashes::Hash::DISPLAY_BACKWARD`] for more details.
-    pub struct Txid(sha256d);
+    pub struct Txid(pub(crate) sha256d);
 
     /// A bitcoin witness transaction ID.
-    pub struct Wtxid(sha256d);
+    pub struct Wtxid(pub(crate) sha256d);
 }
 impl_hashencode!(Txid);
 impl_hashencode!(Wtxid);
diff --git a/hashes/src/macros.rs b/hashes/src/macros.rs
index 45b53ae1..6a32c935 100644
--- a/hashes/src/macros.rs
+++ b/hashes/src/macros.rs
@@ -128,11 +128,11 @@ macro_rules! hash_newtype {
 
             /// Constructs a new engine.
             #[inline]
-            pub fn engine() -> $hash::Engine { $hash::Hash::engine() }
+            $field_vis fn engine() -> $hash::Engine { $hash::Hash::engine() }
 
             /// Produces a hash froam the current state of a given engine.
             #[inline]
-            pub fn from_engine(e: $hash::Engine) -> Self { Self($hash::Hash::from_engine(e)) }
+            $field_vis fn from_engine(e: $hash::Engine) -> Self { Self($hash::Hash::from_engine(e)) }
 
             /// Copies a byte slice into a hash object.
             #[inline]

@tcharding
Copy link
Copy Markdown
Member Author

tcharding commented May 16, 2024

We should have some public type aliases for HMAC-SHA256 and HMAC-SHA512 since those are the overwhelmingly most common forms of HMAC.

I've used public functions in the hmac module.

/// Returns an engine for computing `HMAC-SHA256`.
///
/// Equivalent to `hmac::Engine::<sha256::Engine>::new(key)`.
pub fn new_engine_sha256(key: &[u8]) -> Engine<sha256::Engine> {
    Engine::<sha256::Engine>::new(key)
}

/// Returns an engine for computing `HMAC-SHA512`.
///
/// Equivalent to `hmac::Engine::<sha512::Engine>::new(key)`.
pub fn new_engine_sha512(key: &[u8]) -> Engine<sha512::Engine> {
    Engine::<sha512::Engine>::new(key)
}

@tcharding
Copy link
Copy Markdown
Member Author

tcharding commented May 16, 2024

The duplication of the merkle root code I don't like (but I don't think it's necessary ... in fact I suspect we could pull it into the hashes crate and keep it generic since it's not particularly Bitcoin-specific ... but anyway it doesn't need to happen as part of the initial hashes API changes).

I agree its ugly, before we fix it we should ask how much change we are willing to accept to de-duplicate something that is two exact copies and likely never changes? AFIACT there is no way of doing this generically without an invasive change, hence the question.

Said another way, there is no way of going from a generic type (we only have HashEngine now) to the concrete hash type so this cannot be done generically IIUC.

The quick obvious thing is to use a macro but I don't know if its possible or the implications of calling a macro recursively.

@tcharding tcharding force-pushed the hashes branch 3 times, most recently from f8c33e0 to 5f57240 Compare May 16, 2024 05:05
@tcharding
Copy link
Copy Markdown
Member Author

tcharding commented May 16, 2024

This isn't going to pass CI because it has a patched manifest to use rust-bitcoin/hex-conservative#90

In case it gets past you, this PR is hot as f**k. Check out the hashes::internal_macros and macros modules - gawd damn.

Massive props @apoelstra for sticking to your guns and refusing to accept the split out of rust-chf - team work makes the dream work.

@apoelstra
Copy link
Copy Markdown
Member

apoelstra commented May 16, 2024

Said another way, there is no way of going from a generic type (we only have HashEngine now) to the concrete hash type so this cannot be done generically IIUC.

I'll play with it. Fine for now if we have to duplicate the code. But I think/hope we can somehow be generic over the engine.

The quick obvious thing is to use a macro but I don't know if its possible or the implications of calling a macro recursively.

I'd rather add a dummy trait before I added a macro. (Or a new Digest trait which just had a method to construct an engine and nothing else, which I think would be okay since users would almost never need to be aware of it.) But we'll see.

@tcharding
Copy link
Copy Markdown
Member Author

tcharding commented May 17, 2024

Well that was pretty easy with a fresh set of eyes, I just threw this into the merkle_tree module.

/// Trait used for types that can be hashed into a merkle tree.
pub trait Hash {
    /// Constructs a new [`sha256d`] hash engine.
    fn engine() -> sha256d::Engine { sha256d::Engine::new() }

    /// Produces a hash froam the current state of a given engine.
    fn from_engine(e: sha256d::Engine) -> Self;
}

impl Hash for Txid {
    fn from_engine(e: sha256d::Engine) -> Self { Self::from_engine(e) }
}

impl Hash for Wtxid {
    fn from_engine(e: sha256d::Engine) -> Self { Self::from_engine(e) }
}

And changed the trait bounds on all the functions to T: Hash + Encodable + Copy,

Wallah, the original calculate_ functions work as before.

@tcharding tcharding force-pushed the hashes branch 4 times, most recently from 0a10c43 to 4622bf8 Compare May 17, 2024 00:55
@tcharding
Copy link
Copy Markdown
Member Author

tcharding commented May 17, 2024

I'm happy with this now, I'll just leave it sitting here until the hex release is done.

I had an idea - although the PR title is hashes: Epic re-write the epic is more just a reference to how much work it took me to get here, the actual changes to the hashes surface API are minor, mainly just removal of the Hash trait in favour of inherent functions. The other changes are only deeper in the crate ie., the HashEngine trait. If we put a nice thorough regression test patch at the front of this PR (for each hash type: hash a hard coded array of bytes and round trip the hard coded hex string for those bytes) then we can basically just YOLO this in and be confident it doesn't introduce bugs. What do you rekon?

@apoelstra
Copy link
Copy Markdown
Member

Chunked feeding of data to an engine is pretty common though. Perhaps at least those methods could be exposed.

Yes, pretty common but it still seems uncommon enough that it's okay for users to need to import a trait to do it. But agreed, just like everything we should have inherents for this.

IIRC we have non-cryptographic hashes in the crate and using HMAC with them is at least suspicious. If you think it's not worth a marker trait I can understand that.

Yeah :/ agreed it may be strictly better to have a marker trait, but I continue to think it's not worth it. I think siphash is the only non-cryptographic hash in this crate (other than sha1 which needs to be used in cryptographic contexts for legacy/compatibility reasons, even though I would definitely not recommend its use as a crypto hash).

That seems excessive to me, but not strictly wrong. You can still accidentally feed some data into wrong engine but at least you can't assign an engine to an incompatible one. I don't think people are assigning engines around too much. They just use them to compute the hashes.

Though I could see a benefit of having .finalize() method.

Yeah, exactly. I'd like to have a finalize method and this means that each hash engine needs to have a unique hash that it finalizes to.

I believe inherent const methods are super useful and we need macros for that.

Yeah :( earlier in this thread I thought there was a way with "trigger traits" that we could do this without macros. It turns out that you can use these to implement inherent methods on Vec<T> for example... but not T itself.

Not really. Hashing two txids into a merkle node doesn't produce a Txid but a merkle node. Merkle node could have engine exposed but we can simply internally use raw sha256. So perhaps this is another good example of needing to publish raw hash type.

Yeah, great points. Need to think about what our Merkle tree computation function should look like then.. unfortunately we need to resolve this as step 1 of this project (well, maybe step 2, after just splitting Hash into two traits but always implementing both). Because all our breaking API decisions here affect what a generic Merkle tree computation will look like. Maybe the answer is that we don't want a generic function...we want something that can take either Txid or Wtxid and computes a merkle root in an ad-hoc way which is used in Bitcoin blockheader computations.

@apoelstra
Copy link
Copy Markdown
Member

I think we should open a new PR since github has started hiding comments on this one.

@apoelstra
Copy link
Copy Markdown
Member

I also think that "new PR" should be one that just splits up Hash into two traits (and always implements both) since I think that's a clear step forward we can all agree on, including the compiler. And possibly moves/copies some methods out of both traits and into inherents.

@Kixunil
Copy link
Copy Markdown
Collaborator

Kixunil commented Jun 7, 2024

I think we can provide generic merkle tree over non-wrapped hashes (maybe in hashes) and then just have a facade that uses Txid Wtxid and Taproot.

@apoelstra
Copy link
Copy Markdown
Member

I think you're right. That's the approach I'd try first.

@tcharding
Copy link
Copy Markdown
Member Author

Thanks for the input fellas, just wanted to say that I read everything (multiple times) but the PRs that fell out today may not seem like I did. There are six hundred things going on, not exactly sure how to pull everything together. (I did read and grok you list in #2770 (comment) @apoelstra, but I went a different way - no disrespect intended).

@tcharding
Copy link
Copy Markdown
Member Author

tcharding commented Jun 14, 2024

There has been so much discussion on the hashes crate and so much back and forth that I do not know exactly where we are trying to get to or what we are trying to solve. Also, I don't know what is the objection to the current hmac module presented in this PR. I my opinion this PR, in its current state, fixes all the issues linked to as claimed. I agree the const N stuff is ugly but apart from just ugly I don't know what is the objection, could you please re-state it @apoelstra?

I believe the correct observation from @Kixunil that a past state did not correctly tie the Hmac to the hash engine was correct (we used to only have the const N generic) but now hmac::Hash is generic over the engine also.

(Note that this PR should be re-done on top of what ever is decided for the merkle tree stuff.)

There are a few niggling problems with the current `hashes` API that we
would like to improve:

- Its not that discoverable
- It is annoying to have to import the `Hash` trait all the time.
- New hash types (created with `hash_newtype`) are general purpose but
  we'd like to hide this from users of `rust-bitcoin`.
- The API of the `hmac` module is different to the other hash types.
- The `hash_newtype` and `sha256t_hash_newtype` macros are a bit hard
  to read and work on

This re-write of the `hashes` API is an attempt to address all these
issues. Note that the public API surface of `hashes` does not change all
that much.
@apoelstra
Copy link
Copy Markdown
Member

I agree the const N stuff is ugly but apart from just ugly I don't know what is the objection, could you please re-state it @apoelstra?

It needs to be generic over the hash type. If it is only generic over N (which will basically always be 32 and therefore isn't much of a generic at all) then it loses type safety vs the current API.

If it's now generic over both N and the hash engine I'll take another look, once we have the other hashes PRs merged. But it's not obvious to me that it needs to be generic over both. The HashEngine trait should have enough information, and if it doesn't, we should extend it..

@tcharding
Copy link
Copy Markdown
Member Author

But it's not obvious to me that it needs to be generic over both.

If you can get rid of it then I'll be happy to see how that was achieved, as far as I can tell you are going to hit all the associated const problems I've been hitting - its a language limitation. But by all means please do try to solve it if you think I've missed something.

If it's now generic over both N and the hash engine I'll take another look

So we've been having API design discussions, with you giving me directives of things to explore, and ways you want it done, and you never looked at the latest state. That is pretty frustrating man. Even when I repeatedly said that we are hitting language limitations and that associated const don't "just work" as one would expect.

There are so many problems with hashes, as shown by the list of issues this PR claims to fix, I don't think the strategy of piecemeal improvements is a good one. The crate is an absolute punish to work on, that is why the only way I was able to almost maintain my sanity was just to cut it up wildly. Since we have regression tests for all hashes it is ok I believe to merge a single massive PR. Then we go over it again with a fine tooth combe to make sure the public API is as we want it.

@apoelstra
Copy link
Copy Markdown
Member

There are so many problems with hashes, as shown by the list of issues this PR claims to fix, I don't think the strategy of piecemeal improvements is a good one.

It has to be. If we try to do tons of things at once, in a PR that makes compromises due to language limitations, then it will just be stuck forever in iterations of "why did you make this ugly-looking decision" and "it was a language limitation, I don't remember the details because there were tons of them". Compare to #2864 where we were trying to do a single thing, there was disagreement about whether the language would let us do the thing I wanted, and I was able to try it without my attempt requiring 1000 other changes just to get the code to compile. And then it turned out we'd been talking past each other about the calculate_root_inline function, but within very few messages we'd figured that out and solved the issue at hand.

There's not much I can say about comments I made "when I hadn't even looked at the latest state" because Github is now hiding most of the history of this thread, so I can't see the order of specific pushes and comments. But I apologize if I was complaining about an outdated state.

@apoelstra
Copy link
Copy Markdown
Member

Oh, I see. I found a branch where I was hacking on your existing state -- I immediately found that I wanted to split the Hash trait into one related to byte-array-wrapping and one related to hashing, and then found that I couldn't do this because of the merkle_tree module's confused typing, which I fixed in #2864; then I can split the Hash trait and add some extra auxiliary stuff to the byte-array half (which I suggested in one of the hidden comments above); and then it will be possible to remove the N parameter from Hmac.

But I can't relitigate this here because we're talking about too many things at once, in an unreadably-long thread implicitly contextualized by the changing state of a 4000-line diff. The piecemeal approach lets us have digestible conversations and make progress.

@tcharding
Copy link
Copy Markdown
Member Author

Righto, good points about the conversation difficulty. I guess we just chalk this whole thread and the development associated with it as learning. FTR I'm not exactly sure of the target we are aiming for.

@tcharding tcharding closed this Jun 16, 2024
@apoelstra
Copy link
Copy Markdown
Member

FTR I'm not exactly sure of the target we are aiming for.

  • An API where only "general hashes" (i.e. hashes defined by their algorithms) can hash arbitrary data, while "normal hashes" (e.g. Txid) can only be constructed from actual transaction data, from deserialization, and from "hash casting" by converting to/from byte arrays.
  • An API where it is easy/obvious how to just hash some data from a file or something.
  • An API where you can define strongly-typed but generic algorithms from general hashes (specifically: Merkle trees, HMAC and PBKDF). To prove this out we should have HMAC defined within the hashes lib and at least one Merkle tree impl outside of it (e.g. the one in bitcoin).
  • An API which supports BIP-340 tagged hashes in a strongly-typed way.
  • An API where you can at least construct fixed hashes in constfns ... ideally you could actually hash data but ok that's hard.
  • (Eventually) a way to distinguish KnownPreimage/UnknownPreimage so that we can prevent people from e.g. signing opaque hash blobs without making a solemn oath that they mean to do so. But we should revisit that later because it's a big invasive API change and will likely have its own issues with the Rust language.

The issues we have with this right now are:

  • An overuse of the word Hash for both traits and hashes; and to do anything, users to name types like sha256::Hash which are not discoverable from documentation.
  • An overuse of traits for things that don't/rarely need to be generic, which again hurts discoverabliity because users further need to import crate::Hash.
  • An overuse of macros which hides all of the above even from a user who attempts to read the source code.
  • Poorly-defined traits which don't distinguish the different uses of hashes and put inappropriate methods onto stuff (e.g. LegacySighash::all_zeros which is trivial to forge messages for).

Where we can definitely unambiguously start are:

  • Reducing/splitting the existing traits to reduce API surface.
  • Adding inherents and constfns.

Where we are hitting difficulty is trying to do stuff generically over general hashes which might have arbitrary digest sizes. I think this will be much more tractable if we do the "easy" stuff first.

@tcharding
Copy link
Copy Markdown
Member Author

Thanks for writing that up man, appreciate the effort. That is something concrete to work towards.

@tcharding
Copy link
Copy Markdown
Member Author

At the risk of sounding like a whinging bitch, I achieved all these goals that you wrote up with this PR, I added regression tests so that it was plausibly reliable to do a massive patch merge, I stated my intentions the whole way along, and I only ever used HashEngine as the trait bound because it was suggested in review. Yes it led to const N but I still believe this was a step forward. It was pretty much a kick in the teeth to have it close. This crate is horrible to work on because of the how entangled everything is. I'll keep working on this but put it on the record that we messed up badly in communication and work flow this last month and it contributed heavily to nearly burning me out.

@Kixunil
Copy link
Copy Markdown
Collaborator

Kixunil commented Jun 19, 2024

FTR I also attempted some gradual approach to splitting the trait before but got tangled in rebases & stuff.

I think to introduce it gradually we need to first bound things by HashEngine in a commit, the next commit can move a bunch of things to inherent methods and the next one removing the Hash trait.

@apoelstra
Copy link
Copy Markdown
Member

@tcharding I hear you, and I apologize for how much stress and apparent lack of foward progress there was on this.

But. This PR has consistently been a single commit that does multiple changes that cannot be tweaked independently and (almost?) none of which are optimal by themselves. And when I went to change things, I would find other issues (the abuse of Txid in bitcoin/src/merkle_tree/mod.rs was the most prominent) that could be fixed independently and uncontroversially and which would reduce the diff in this PR -- but which were not compatible with the existing diff.

Adding to this, there were multiple parallel conversations that Github could not keep track of (and several serial conversations -- e.g. the dead-end about trigger traits, the back-and-forth about whether we wanted zero one or two Hash traits, etc). In my view there was basically zero chance that this PR was going to get in as a single PR; it needed to be used as an API testbed until we found something that worked with the language and with our goals, and then we would need to do the "real" implementation in carefully-reviewable stages. And to that end we were making forward progress because we found several API improvements that we all agreed on.

@tcharding
Copy link
Copy Markdown
Member Author

Thanks man. I'm still not convinced that we are going to be able to tease it apart gradually but I'll give it a shot. I have a feeling some of the changes are not going to be clear steps forward. If we argue over things like #2860 its going to make this more painful than it needs to be.

@tcharding
Copy link
Copy Markdown
Member Author

I think to introduce it gradually we need to first bound things by HashEngine in a commit, the next commit can move a bunch of things to inherent methods and the next one removing the Hash trait.

Thanks @Kixunil but while you've been away I have literally attempted cutting this crate up and re-doing it so many times I cannot remember. I have hit so many dead ends that have gotten to a stage where I am not taking any suggestions unless they are backed up by patches that I can see actually work. Sorry about that.

@tcharding tcharding mentioned this pull request Jun 19, 2024
6 tasks
@Kixunil
Copy link
Copy Markdown
Collaborator

Kixunil commented Jun 20, 2024

No problem! I wanted to do the exact thing I suggested here a few months before but IIRC I also hit some limitations. I know first hand this stuff is very messy. What might help is what I did: I commented-out some APIs to make sure I correctly modify the rest of the code (so I let the compiler guide me). It mostly worked but still took quite some time.

apoelstra added a commit that referenced this pull request Jun 25, 2024
8bd5c64 api changes for "drop GeneralHash from wrapped hash types" (Andrew Poelstra)
9126597 hashes: stop exposing engine/from_engine and general hashing methods in hash_newtype (Andrew Poelstra)
8c4899f bitcoin: remove all direct use of hashing/engines in unit tests (Andrew Poelstra)
b8d85a1 bitcoin: remove all use of engine/from_engine on opaque hash types (Andrew Poelstra)
0aa539f hashes: remove engine/from_engine from embedded test (Andrew Poelstra)
46dad84 api changes for split between Hash/GeneralHash (Andrew Poelstra)
73dcc79 hashes: split Hash trait into two (Andrew Poelstra)
1fe4c63 hashes: remove unused Hash import in embedded test (Andrew Poelstra)

Pull request description:

  I'm not thrilled with these names. Personally I would prefer having `ByteArrayWrapper` (for non-general hashes) and `Hash` (for general hashes). But using `Hash` and `GeneralHash` greatly minimizes the diff because most of our use of the `Hash` trait was only for non-general stuff.

  Maybe that tradeoff will change as we move stuff to inherents? I hope to do that in the next PR (or maybe the one after that, since I still have some "drop `GeneralHash` work to do for tagged hashes). And after that the hashing API should be "clean" enough that we can figure out HashEngine, possibly even folding GeneralHash into it. But that's the part of #2770 that we didn't finish nailing down so I'm not sure.

  But other than naming, I like this PR. I think, if you approve of this PR except the naming, it would be best to ACK it and then we can do a followup rename-only PR, rather than dealing with the review pain of mass-renaming in rebases.

ACKs for top commit:
  tcharding:
    ACK 8bd5c64
  Kixunil:
    ACK 8bd5c64

Tree-SHA512: 3754f0f7ffae487e9f826fb6ecc48a6d9f606204a981bc56b98e118813a881b905e38a3cf5d6c3b3142aa0876ce2efc2ab75ec5a2f59fcc36fd86d148ab253bb
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

C-base58 PRs modifying the base58 crate C-bitcoin PRs modifying the bitcoin crate C-hashes PRs modifying the hashes crate doc test

Projects

None yet

3 participants