primitives: single allocation witness decoder#5239
primitives: single allocation witness decoder#5239apoelstra merged 1 commit intorust-bitcoin:masterfrom
Conversation
| } | ||
|
|
||
| /// Resizes the content buffer if needed, doubling the size each time. | ||
| fn resize_if_needed(&mut self, required_len: usize) { |
There was a problem hiding this comment.
This resize logic is copied from the old decoder, but should maybe delegate to some consensus_encoding defined thing like @jrakibi is working on?
| // be large enough to cover most witnesses without reallocating. | ||
| let witness_index_space = witness_elements * 4; | ||
| self.cursor = witness_index_space; | ||
| self.content = alloc::vec![0u8; self.cursor + 128]; |
There was a problem hiding this comment.
why did we choose 128 here ?
There was a problem hiding this comment.
I copied this over from the old decoding logic, and if I had to guess, it probably covers the majority of witness sizes (e.g. P2WPKH witness is ~100 bytes). But I don't have the data and it might be worth making it a documented const.
There was a problem hiding this comment.
Yeah, I was just swagging "a pubkey and a signature and some overhead". We can add a comment if you'd like. I don't think it's worth a named constant.
There was a problem hiding this comment.
Beefed up the comment.
f513e42 to
f4269c0
Compare
|
I wanted to convince myself that this was an optimisation only PR so I put moved the tests into a separate patch at the front. Three of them fail (at a minimum I would have expected |
|
Looks good man! |
Nope I did not...which makes me feel a bit uneasy! I can dig more into that tomorrow though. |
|
$10 says my impl is buggy and yours is correct - you can pay me later. I wouldn't put too much thought into it, the tests you add look correct. I'd just mention in the PR description that these three tests don't pass before and that at least for two of them its surprising. |
This converts the new witness decoder to use the optimized single allocation internal memory structure which was shown back in commit 2fd0125 with the old decoder to be a lot more performant, justifying this increased complexity.
f4269c0 to
df64705
Compare
|
@tcharding I added to the description reasoning for each test. |
|
Thanks man, good effort - as usual! |
| fn resize_if_needed(&mut self, required_len: usize) { | ||
| if required_len >= self.content.len() { | ||
| let mut new_len = self.content.len().max(1); | ||
| while new_len <= required_len { | ||
| new_len *= 2; | ||
| } | ||
| self.content.resize(new_len, 0); | ||
| } |
There was a problem hiding this comment.
do u think we can use ~1MB batch allocation here instead of doubling ?
something like:
fn resize_if_needed(&mut self, required_len: usize) {
if required_len > self.content.len() {
let bytes_remaining = required_len - self.content.len();
let batch_size = bytes_remaining.min(Self::MAX_VECTOR_ALLOCATE);
self.content.resize(self.content.len() + batch_size, 0);
}
}if this make sense, am happy to ACK the PR as it is, and I can address this in #5177
There was a problem hiding this comment.
Yea, I copied this allocation logic over from the old decoder. I think it would be better to address the DoS stuff more holistically in your work. One thing I am wondering is if the witness decoder should have its own strategy, which would allow the consensus_encoding stuff to be private to that crate, or if witness should delegate to some exposed logic.
There was a problem hiding this comment.
I think with the new refactoring you did we can have consensus_encoding stuff to be private
There was a problem hiding this comment.
FWIW exponential back off is a pretty standard algorithm when allocating memory, so I found it easy to review and understand why it was chosen. If we choose a more domain specific algo it would need IMO more justification.
There was a problem hiding this comment.
I think the current doubling strategy is still vulnerable to DoS attack
Here’s a test that demonstrate that:
#[test]
fn test_dos_attack() {
let mut encoded = Vec::new();
encoded.extend_from_slice(&[0xFE, 0x00, 0x09, 0x3D, 0x00]); // 4_000_000 (witness count)
encoded.extend_from_slice(&[0xFE, 0x00, 0x09, 0x3D, 0x00]); // 4_000_000 (1st element length)
let mut slice = encoded.as_slice();
let mut dec = WitnessDecoder::new();
assert!(dec.push_bytes(&mut slice).unwrap());
assert!(dec.content.len() >= 32_000_000);
}With only 10 bytes inputs an attacker can allocate 32 Mbytes (16mb for index space and it doubles when we call resize_if_needed)
There was a problem hiding this comment.
I guess, if we were allocating 1 meg at a time this example would take 20Mb instead of 32?
It seems to me the way this works is:
- The attacker declares that there are 4 million witness elements (the maximum we allow, on the theory that maybe an entire block could be filled with empty witnesses, okay) so we allocate 16Mb to store these 4 million indices.
- Then the attacker declares that the first witness element has size 4Mb (though in this example he could say any positive value); we then need to store 16Mb + 1 so we allocate 32Mb.
But if we were allocating 1Mb at a time, we'd still be allocating 16Mb of index space and still be allocating 4Mb for the first witness.
And of course, because you aren't actually providing the claimed data, we return an error and deallocate everything. So I don't see anyway to make this attack use more than 32Mb, nor any way to amplify the attack to allocate lots of 32Mbs.
There was a problem hiding this comment.
I really like this unit test though. I wonder if there's a way we can turn it into a fuzz test so we can test my "you can't make it allocate more than 32Mb without providing at least 16Mb of input" belief.
There was a problem hiding this comment.
BTW if we really wanted to reduce this memory (which I do not think we should do in this PR) we could be cleverer about how to store the index. At the very least we could use 3-byte lengths instead of 4-byte ones.
| if required_len >= self.content.len() { | ||
| let mut new_len = self.content.len().max(1); | ||
| while new_len <= required_len { | ||
| new_len *= 2; |
There was a problem hiding this comment.
In df64705:
This can overflow on a 16-bit machine. We should clamp new_len to usize::MAX. (Fine to address in a followup.)
|
In df64705: I also think that somewhere we should be capping Again, can wait for a followup, especially because I'm unsure of the exact number and I think we want to review this separately from the rest of the logic. |
|
Follow up issue for the allocation improvements: #5258 |
add batched allocation for witness decoder. we also apply batched allocation for the index space to avoid allocating the full 16 MB upfrontwhen an attacker claims 4 million witness_elements. instead, we allocate gradually in ~1 MB batches The current approach automatically addresses the concerns raised in rust-bitcoin#5258 and rust-bitcoin#5239 (comment)
add batched allocation for witness decoder. we also apply batched allocation for the index space to avoid allocating the full 16 MB upfrontwhen an attacker claims 4 million witness_elements. instead, we allocate gradually in ~1 MB batches The current approach automatically addresses the concerns raised in rust-bitcoin#5258 and rust-bitcoin#5239 (comment)
Add batched allocation for witness decoder (data area). we allocate incrementally in ~1 MB batches The current approach automatically addresses the concerns raised in rust-bitcoin#5258 and rust-bitcoin#5239 (comment)
Add batched allocation for witness decoder (data area). we allocate incrementally in ~1 MB batches The current approach automatically addresses the concerns raised in rust-bitcoin#5258 and rust-bitcoin#5239 (comment)
Add batched allocation for witness decoder (data area). we allocate incrementally in ~1 MB batches The current approach automatically addresses the concerns raised in rust-bitcoin#5258 and rust-bitcoin#5239 (comment)
Add batched allocation for witness decoder (data area). we allocate incrementally in ~1 MB batches The current approach automatically addresses the concerns raised in rust-bitcoin#5258 and rust-bitcoin#5239 (comment)
Add batched allocation for witness decoder (data area). we allocate incrementally in ~1 MB batches The current approach automatically addresses the concerns raised in rust-bitcoin#5258 and rust-bitcoin#5239 (comment) we aslo add `reserve_batch` to Mutants exclusion list Mutation testing in WitnessDecoder::reserve_batch changes the buffer allocation logic and they cause infinite loop during decoding
…nessDecoder 9bf1522 consensus_encoding: implement batched allocation for WitnessDecoder (jrakibi) Pull request description: Currently an attacker can force a 32 MB memory allocation by only providing 10 bytes as inputs (see test case that proves that in #5239 (comment)) With the current approach we reduce this attack surface to ~17 MB. ### Attack vector (test case inputs): - `[0xFE, 0x00, 0x09, 0x3D, 0x00]` = 4_000_000 (witness count) - `[0xFE, 0x00, 0x09, 0x3D, 0x00]` = 4_000_000 (1st element length) 10 bytes of input will trigger the allocation below --- Old approach with `resize_if_needed` (32 MB allocation): * Index space = 16 MB (4_000_000 witness elements × 4 bytes) * Initial buffer = 16 MB + 128 bytes * When the 4 MB element length is decoded, `resize_if_needed` doubles the buffer * Because `required_len = 16 MB + 128 + 4 MB = 20 MB + 128 bytes`, it doubles from 16 MB to 32 MB * Total allocation = 32 MB New approach with `reserve_batch` (~17 MB allocation): * Index space = 16 MB * Initial buffer = 16 MB + 128 bytes * When a 4 MB element length is decoded, `reserve_batch` allocates in 1 MB batches * Total allocation = 16 MB + 128 bytes + 1 MB = 17_000_128 bytes The current approach automatically addresses the concerns raised in #5258 and #5239 (comment) ACKs for top commit: tcharding: ACK 9bf1522 apoelstra: ACK 9bf1522; successfully ran local tests Tree-SHA512: 81ad652f6158d333fc473846e0bb56d0726d4bf1b68a0ba82d4e673915224be306593a685b948d3e71bb114577033cb0d9f935ad69c39ba07b9ffe971a69a9d7
When attempting to simplify the
WitnessDecoderin #5224, we noticed that we were double allocating which erased the performance gains of the old decoder from #672. This updatesWitnessDecoderto use a single allocation like the old decoder.While the intention of this patch was purely for performance, three of the new tests fail against the current version of the
WitnessDecoder:decode_empty_element,decode_multiple_empty_elements, anddecode_incomplete_witness_count.For the first two tests, there is a functional change in how empty witness elements are handled. This patch introduces a short-circuit where the old version might get stuck in a loop forever requesting zero bytes.
The third test,
decode_incomplete_witness_count, shows that the impl has a more explicit failure now ifendis called before the witness element count is complete, instead of just returning and empty witness.