Skip to content

Implemented instruction encoder (Closes #129)#254

Merged
athre0z merged 21 commits intozyantific:masterfrom
mappzor:encoder
Nov 9, 2021
Merged

Implemented instruction encoder (Closes #129)#254
athre0z merged 21 commits intozyantific:masterfrom
mappzor:encoder

Conversation

@mappzor
Copy link
Copy Markdown
Contributor

@mappzor mappzor commented Oct 18, 2021

Design goals

  • Semantic-driven API for encoding every possible instruction (ZydisEncoderRequest + ZydisEncoderEncodeInstruction).
  • Support for easy re-encoding of instructions (ZydisEncoderDecodedInstructionToEncoderRequest).
  • Strong correctness guarantee:
    • Every decodable instruction is also (re-)encodable.
    • Encoder must perform strict validation of encoder requests and must not produce invalid output. Output is considered valid when it can be decoded as instruction that matches semantics defined via ZydisEncoderRequest structure.

Non-goals

  • Natural consequence of semantic-driven API is lack of control over physical encoding (mostly). This means that even when decoded instruction is passed to ZydisEncoderDecodedInstructionToEncoderRequest and back to ZydisEncoderEncodeInstruction without any modifications, output generated by the encoder might be different although semantically equivalent to original instruction.
  • Performance. No benchmarks were performed and no time was spent on profiling and optimization, so there's some space for future improvements (this will be discussed later). That being said encoder tries to do its job doing as little things as possible. Process of matching encoder request against instruction definition candidates is the most time-consuming part. List of definitions is selected based on mnemonic via constant time lookup. Matching definitions has linear time complexity and biggest time saves can be found here (e.g. faster rejections based on most common criteria). Most instructions don't have many definitions per mnemonic though, so I wouldn't expect anything groundbreaking here. Current performance levels should be acceptable for most use cases.
  • Generating smallest possible instructions. There is no strong guarantee that encoder will produce smallest output possible for given encoder request. However some efforts have been made towards size-optimal outputs e.g. not using REX when it's not necessary, prioritizing 2-byte VEX when possible, not emitting useless legacy prefixes at all times, using optimal ModR/M/SIB encodings, optimizing sizes of immediates while taking advantage of sign-extending behaviors. Some possible future improvements will be discussed later.

Design

  • Encoding is performed by passing ZydisEncoderRequest structure to ZydisEncoderEncodeInstruction function.
  • Unlike decoder, encoder has no supporting structures persistent across calls.
  • Unlike decoder, encoder doesn't have features, it supports whole ISA with all extensions supported by Zydis. Exceptions: AMD branches are not supported (no place for things like that with semantic-driven API), "unsafe" nops are not supported (variants that require some ISA extensions NOT to be present)
  • ZydisEncoderDecodedInstructionToEncoderRequest facilitates one of most important use cases for this feature: decoding (-> modification) -> conversion to encoder request -> re-encoding. It's a lightweight function that performs minimal validation and simply moves data around. It will never crash but it might produce invalid encoder request when malformed ZydisDecodedInstruction is passed to it (btw you shouldn't do that in general). Corrupting some critical fields like e.g. operand count will be detected though, so ZYAN_STATUS_INVALID_ARGUMENT might be reported. Don't forget to check for errors :)
  • Most fields in ZydisEncoderRequest should be self-explanatory but some of them are results of non-trivial design decisions.
    • allowed_encodings most of the time, most users will set this to ZYDIS_ENCODABLE_ENCODING_DEFAULT. This field can be used to gain some control over physical encoding e.g. you can effectively disable AVX-512 by prohibiting EVEX encodings. This is important because many instructions can have same mnemonic but definitions for more than one prefix (e.g. VEX and EVEX).
    • branch_type gives granular control over branch size and type (near, far, wherever you are), used only for control-flow instructions.
    • address_size_hint/operand_size_hint for vast majority of instructions those hints are simply ignored. For some instructions though, they serve a very important purpose of expressing semantics that cannot be derived from explicit/implicit operands. String instructions (e.g. stos family) are a good example as they take no arguments but hidden operands can scale with address size attribute.
    • evex/mvex substructures contain information about features specific to those prefixes. Don't specify broadcast if instruction uses static broadcast. zeroing_mask should be ZYAN_TRUE if instruction uses forced zeroing mask.
    • ZydisEncoderOperand.reg.is4 that's probably one of my favourite special cases in whole ISA. is4 is meant to denote 4th source operand. If it's 4th source why is it necessary to specify that? Because with some instructions more than one configuration is possible and 4th operand becomes... 3rd operand. Brilliant and perfectly logical! This field helps to resolve that ambiguity.
  • Encoder tries to infer as much information as possible from encoder request. It decides on its own about things like 66/67 prefixes to match user-requested semantics.

Implementation

Conceptually encoding process is divided into 4 phases:

  • early sanity checks (ZydisEncoderCheckRequestSanity)
  • definition matching (ZydisFindMatchingDefinition) - parts of encoder request are matched against definition candidates, tons of checks happen here to ensure a proper instruction will be generated
  • instruction building (ZydisBuildInstruction) - if matching definition is found, encoder converts user-provided operands to members of ZydisEncoderInstruction using instruction definition as a blueprint
  • emission (ZydisEmitInstruction) - instruction is emitted byte by byte, this is the simplest part of the process, only error that can happen at this stage is running out of space in user-provided buffer

TODOs, known issues, limitations, future improvements, etc.

Item tags:

  • [Required] - this item must be completed before merge
  • [Optional] - decision has to be made before merge, resulting in some action or no action before merge
  • [Improvement] - possible future improvements

Items:

  • [Required] Integrate encoder table generation scripts with ZydisEditor (I will push them to this repo soon: https://github.com/mappzor/zydis-db). EDIT: Will be resolved by reusing existing python generator.
  • [Optional] Decide what to do about Zycore-related TODOs (there are 2)
  • [Optional] Currently fuzz target in ZydisFuzzIn can be selected via ZYDIS_FUZZ_TARGET define. I think it would be better to move shared code to a separate file and make 3 separate projects for different fuzz targets (every project would include shared source file + target-only file).
  • [Optional] Improve eded fuzz target by adding extra validation step. Semantics in encoder request should be compared with decoded instruction to make sure encoder doesn't misrepresent anything. It's not critical because ded target should catch such cases but it would be nice to have.
  • [Optional] I have over 100 crashes from fuzzing and perhaps we should use them as regression tests. I already use them that way during development.
  • [Improvement] Probably the biggest optimization opportunity: compress operand count and operand types for every definition into a single integer. Storing such integer in encoder's table will increase its size but will allow faster rejections of many unreasonable instruction candidates.
  • [Improvement] Size-optimal output. It seems that there's only one more thing to do to achieve this in almost every case: sorting encoder's table correctly. I will probably work on this in the future.
  • [Improvement] Support for compressed 8-bit displacements. Good luck to anyone wishing to implement validation for that stuff. Determining cd8_scale value in code is just hell. It's possible to cheat and add it to encoder's table but this will waste space as only EVEX/MVEX needs this. Currently 32-bit displacements are forced to avoid the problem (2 places).

Closes #129

@mappzor mappzor force-pushed the encoder branch 4 times, most recently from cfbe0aa to 5ccf3c3 Compare October 18, 2021 20:03
@flobernd flobernd requested review from athre0z and flobernd October 19, 2021 06:07
@flobernd
Copy link
Copy Markdown
Member

First of all again thank you very much for this very high quality PR! I did still not check all the code in detail yet, but already have some input:

Natural consequence of semantic-driven API is lack of control over physical encoding (mostly).

From what I see, the EncoderRequest already provides a lot of options to control the resulting physical encoding. It might be a neat thing to have a preferred_length as well (e.g. for the NOPs).

ZydisEncoderOperand.reg.is4

Do you have a list of all affected instructions?

Decide what to do about Zycore-related TODOs (there are 2)

For the INT8_MIN, ... types I would definitely like to have them added to Zycore/Types.h. The second one I have not found yet.

Size-optimal output. It seems that there's only one more thing to do to achieve this in almost every case: sorting encoder's table correctly. I will probably work on this in the future.

That was as well our approach in our early encoder draft. Should work pretty well.

Probably the biggest optimization opportunity: compress operand count and operand types for every definition into a single integer. Storing such integer in encoder's table will increase its size but will allow faster rejections of many unreasonable instruction candidates.

I like that idea. Binary size is important for some micro-controllers and stuff, but I personally would always prioritize performance over size.

Support for compressed 8-bit displacements.

I think we should brainstorm about this point. I would definitely love to fully support this somehow. Most users won't care if the resulting bytecode in the process of decoding -> modifying -> encoding is smaller than the original one (they can always fill with NOPs), but the other case might be pretty bad for automatic code translation/patching. You often have limited space and shifting instructions around is not possible.

It's possible to cheat and add it to encoder's table but this will waste space as only EVEX/MVEX needs this.

Probably we can choose a similar approach like in the decoder tables where we have different structs for LEGACY, VEX, XOP, EVEX and MVEX insturction definitions.

Another point that would be great for the future is support for the conditional feature defines to disable AVX512 / KNC entirely to reduce binay size (and probably gain a pint of additional performance).

@mappzor
Copy link
Copy Markdown
Contributor Author

mappzor commented Oct 19, 2021

I pushed scripts: https://github.com/mappzor/zydis-db/tree/encoder/ZydisEncoder
Quick guide:

Generate tables (to stdout), with my setup I use this: py generate_encoder_tables.py >..\..\zydis\src\Generated\EncoderTables.inc
Print all defs (no hidden operands): py generate_encoder_tables.py --mode print
Print all defs (with hidden operands): py generate_encoder_tables.py --mode print-all

They require Python 3, no external dependencies. Generator uses argparse, so --help works (there are few other modes but they aren't very important). Disregard instruction-order-test stuff completely, it's only for testing that main loop performs same deduplication steps as ZydisEditor (it's a critical assumption, so I was very cautious about this in early stages of development).

As mentioned main loop does same processing as ZydisEditor however it does some extra merging as well (allowed CPU modes, operand sizes, address sizes). Not everything there is very elegant but it just works. After main loop there are 2 elimination passes: one for mandatory prefixes and another one for hidden operand scaling. Hidden operand scaling is probably the biggest new concept introduced to instruction definitions and there is some non-trivial processing being done here (not in the elimination pass itself but mostly later inside get_size_hint).

You may notice that nothing gets ever removed from instruction list and that's intended. I flag instructions with extra attributes for various reasons but don't remove them. This way print mode remains very useful for debugging.

From what I see, the EncoderRequest already provides a lot of options to control the resulting physical encoding. It might be a neat thing to have a preferred_length as well (e.g. for the NOPs).

Yes, control is provided where it's absolutely necessary e.g. branching. Whole idea behind things like NEAR16, NEAR32 instead of just NEAR is that branching uses rel operands, so size of instruction must be controllable. I thought about something like preferred_length early during development however implementing it for all instructions is more challenging than it seems at first. Ultimately I made it a non-goal and reasoning was that eventually size-optimal outputs will cover most use-cases related to this (because as you said, padding is always possible) and that moves us to the next point (2 points actually)...

Size-optimal output. It seems that there's only one more thing to do to achieve this in almost every case: sorting encoder's table correctly. I will probably work on this in the future.

That was as well our approach in our early encoder draft. Should work pretty well.

Sorting process will involve calculating things like "minimal instruction size for this definition", maybe maximal as well. I believe that implementation process here will shed some light on whole "user-requested size" idea and probably help to discover funny edge cases like always.

I think we should brainstorm about this point. I would definitely love to fully support this somehow. Most users won't care if the resulting bytecode in the process of decoding -> modifying -> encoding is smaller than the original one (they can always fill with NOPs), but the other case might be pretty bad for automatic code translation/patching. You often have limited space and shifting instructions around is not possible.

Second part that's missing is of course about those damned displacements. Use-case you've mentioned was in my head since day 1. I believe it will be a very common one and I will probably even need this for myself. That's what motivated me to take some steps in this direction already. Luckily number of users that need size-optimal transformations for AVX-512 instructions is probably quite low, so while this item should get done eventually it's not the most urgent thing. The more I think about it, the more I lean towards pre-computing scales and putting them into tables. Especially after considering next point...

Probably the biggest optimization opportunity: compress operand count and operand types for every definition into a single integer. Storing such integer in encoder's table will increase its size but will allow faster rejections of many unreasonable instruction candidates.

I like that idea. Binary size is important for some micro-controllers and stuff, but I personally would always prioritize performance over size.

I fully agree. While it's tricky to strike a balance between size, performance and ISA-support levels for decoder I don't think we have that issue with encoder. I can't think of any use-case where x86 code generation happens but environment is so constrained that few kilobytes of extra data make a drastic difference. However I can think of many use-cases where a lot of code needs to be generated as fast as possible so for me it's simple, performance is king here.

Btw just to let you know what we are talking about here's some data about current tables. Under MSVC, natural size of a single entry in main table was 48 bytes. After rearranging members, dropping enum types, packing and using bitfields this size got reduced to only 8 bytes per entry and that's where we stand today. With 7708 definitions (if I remember correctly) that's
60 KB (instead of original 360 KB). I think that's very decent and we should not worry about adding more stuff if it improves performance.

It's possible to cheat and add it to encoder's table but this will waste space as only EVEX/MVEX needs this.

Probably we can choose a similar approach like in the decoder tables where we have different structs for LEGACY, VEX, XOP, EVEX and MVEX insturction definitions.

I thought about this a lot and wanted to avoid it at least for initial implementation (as we discussed some weeks ago) because it complicated lookups. I think as we go forward this might be a good step to take though. It will enable us to have compile-time features for encoder just like decoder has.

ZydisEncoderOperand.reg.is4

Do you have a list of all affected instructions?

No, I will try to fetch. It's a bunch of AMD-specific stuff if I remember correctly. Thing to consider here is that there are only 2 configurations here: normal and retarded swapped, so I'm tempted to make it a ZyanBool in main struct instead of having it for every operand. However API will break if someone at Intel/AMD decides it's funny to introduce 3rd option of having 4th operand as e.g. 5th operand.

Decide what to do about Zycore-related TODOs (there are 2)

Second was very minor. I wanted to have something like ZyanPopcnt to count set bits quickly. This would depend on compiler-specific builtin (and support of associated instructions in underlying hardware). Probably not a good idea and not worth wasting time.

@mappzor
Copy link
Copy Markdown
Contributor Author

mappzor commented Oct 19, 2021

List of offending is4 cases:

vfmaddpd
vfmaddps
vfmaddsd
vfmaddss
vfmaddsubpd
vfmaddsubps
vfmsubaddpd
vfmsubaddps
vfmsubpd
vfmsubps
vfmsubsd
vfmsubss
vfnmaddpd
vfnmaddps
vfnmaddsd
vfnmaddss
vfnmsubpd
vfnmsubps
vfnmsubsd
vfnmsubss
vpcmov
vpermil2pd
vpermil2ps
vpperm

@mrexodia
Copy link
Copy Markdown
Contributor

give

@mappzor
Copy link
Copy Markdown
Contributor Author

mappzor commented Oct 19, 2021

I added support for compressed displacements. After closer look I've changed my mind about pre-computing, so everything is handled via code.

@mappzor
Copy link
Copy Markdown
Contributor Author

mappzor commented Oct 23, 2021

Probably the biggest optimization opportunity: compress operand count and operand types for every definition into a single integer. Storing such integer in encoder's table will increase its size but will allow faster rejections of many unreasonable instruction candidates.

I tried this and as expected improvement is quite significant. For a simple benchmark I decoded and re-encoded every instruction from .text section of Win10 kernel (that's around 1.1M instructions) and repeated this process 100 times. Using operand masks for fast rejection of instruction candidates resulted in around 20% improvement in execution time. Of course results will vary depending on binary, its characteristics, etc. but this little experiment proved it's worth to add those 2 extra bytes per entry.

@flobernd
Copy link
Copy Markdown
Member

Very nice! Definitely sounds more than worth to add these few bytes.

Copy link
Copy Markdown
Member

@athre0z athre0z left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The quality of this PR is just insane, mad respect. The design looks very well thought-out. You managed to emulate our code-style so close that it's hard to believe given that we don't have it written down anywhere.

I'm nowhere near having gone though all the details yet, but I figured I'd leave some preliminary comments. I'll try to do a full review within this week.

Copy link
Copy Markdown
Member

@athre0z athre0z left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more comments, still not through with all of it!

Comment on lines +136 to +123
ZYDIS_ENCODABLE_BRANCH_TYPE_NONE,
ZYDIS_ENCODABLE_BRANCH_TYPE_SHORT,
ZYDIS_ENCODABLE_BRANCH_TYPE_NEAR16,
ZYDIS_ENCODABLE_BRANCH_TYPE_NEAR32,
ZYDIS_ENCODABLE_BRANCH_TYPE_NEAR64,
ZYDIS_ENCODABLE_BRANCH_TYPE_FAR16,
ZYDIS_ENCODABLE_BRANCH_TYPE_FAR32,
ZYDIS_ENCODABLE_BRANCH_TYPE_FAR64,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For de-deplication, would it make sense to merge this with the decoder's ZydisBranchType, extending the decoder variant with the size suffixes included in this one?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From encoder's perspective that's a desirable change. I needed a type that essentially encodes a tuple (branch_type, branch_immediate_value_size) hence ZydisEncodableBranchType was born. Merging these types would make translation between them unnecessary and that's a welcome change.

Now let's look at decoder impact. In decoder branch_type provides branch type ONLY and if anyone needs information about its size it must be extracted separately. Only downside is that if you are interested in just branch type e.g. NEAR you would need to test for NEAR16 | NEAR32 | NEAR64. Is that a big deal? Probably not.

Only piece of code that uses branch_type in decoder is this one:

                instruction->meta.branch_type = definition->branch_type;
                ZYAN_ASSERT((instruction->meta.branch_type == ZYDIS_BRANCH_TYPE_NONE) ||
                        ((instruction->meta.category == ZYDIS_CATEGORY_CALL) ||
                         (instruction->meta.category == ZYDIS_CATEGORY_COND_BR) ||
                         (instruction->meta.category == ZYDIS_CATEGORY_UNCOND_BR) ||
                         (instruction->meta.category == ZYDIS_CATEGORY_RET)));

So no major problems here as well. I'd say I'm in favor of such change for the sake of simplicity. However I will need to count on you guys because this requires generator change (definition->branch_type as seen above).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be fine with converting the decoder flag enum to flags adding a combined mask for backwards compatibility.

We just have to check if the decoder always easily knows the size (e.g. even in minimal mode). The tables only save the NEAR/FAR/SHORT because some of the instructions are affected by the 66h prefix.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this would still be a breaking change even if defining it as flags, because for the mask to work, you have to do & BRANCH_TYPE whereas existing code will most likely currently use == BRANCH_TYPE.

Unrelated to the comment above, if we do flags, instead of having a mask, we could also represent it like this:

enum ZydisBranchFlags {
  ZYDIS_BRANCH_FLAGS_16 = 1 << 0,
  ZYDIS_BRANCH_FLAGS_32 = 1 << 1,
  ZYDIS_BRANCH_FLAGS_64 = 1 << 2,

  ZYDIS_BRANCH_FLAGS_SHORT = 1 << 3,
  ZYDIS_BRANCH_FLAGS_NEAR = 1 << 4,
  ZYDIS_BRANCH_FLAGS_FAR = 1 << 5,
};

The drawback with this approach is that it would allow for nonsensical combinations of flags like short+64.

@athre0z athre0z added C-enhancement Category: Enhancement of existing features P-high Priority: High labels Nov 6, 2021
@athre0z
Copy link
Copy Markdown
Member

athre0z commented Nov 6, 2021

@mappzor @flobernd what do you think about preparing to merge this PR mostly as-is for now, converting the remaining open points into issues? For me personally, none of them are deal-breakers, just overall clean-up and enhancements. Since this PR is so big, I feel like it would make more sense to have dedicated issues for discussing the remaining stuff properly. If you both agree, I'd probably perform an interactive rebase to clean up the commit history, rebase on master and merge.

@mappzor
Copy link
Copy Markdown
Contributor Author

mappzor commented Nov 6, 2021

@athre0z I'd like to finish size-optimal outputs and regression tests as a part of this PR (there are some other minor improvements as well). I'm really close, hopefully will push this tonight.

From review items I definitely want to implement ZYDIS_ATTRIB_ instead of current encodable attributes. I'd convert branch_type to a separate issue as it will also affect the decoder and I'm not even sure if we really want this change at this point.

If you both agree, I'd probably perform an interactive rebase to clean up the commit history, rebase on master and merge.

Some commits probably can be merged but it's not like we have a ton of "fixed a typo", etc. commits. Having some history for feature-like things and bigger changes is generally a good practice.

@athre0z
Copy link
Copy Markdown
Member

athre0z commented Nov 6, 2021

I'd like to finish size-optimal outputs and regression tests as a part of this PR (there are some other minor improvements as well). I'm really close, hopefully will push this tonight.

Alright!

I'd convert branch_type to a separate issue as it will also affect the decoder and I'm not even sure if we really want this change at this point.

I agree. In fact, this is exactly what caused my suggestion to merge and discuss this in a dedicated issue! :) I typed up an alternative proposal, and before posting it, I figured that it might be better to discuss it separately.

Some commits probably can be merged but it's not like we have a ton of "fixed a typo", etc. commits. Having some history for feature-like things and bigger changes is generally a good practice.

Yes, I intend to keep the majority of them, just filtering out the fix-up commits.

], start=0)


ZydisMnemonic = IntEnum('ZydisMnemonic', [
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible future improvement: at least make ZydisMnemonic and ZydisRegister auto-generated. Ideally whole file can be auto-generated. There are two ways to go about it:

  • use existing generator to do heavy-lifting here (good for mnemonics and registers, but not for other enums)
  • at the cost of extra python dependency it's possible to parse C header files and generate all types at run-time

Copy link
Copy Markdown
Member

@flobernd flobernd Nov 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice work on the regression tests :-)

I personally prefer the second option here as it reduces the amount of places where we have to update enums (especially the ones different to ZydisMnemonic and ZydisRegister). The current generator would only be able to auto-generate ZydisMnemonic. For ZydisRegister there is a second generator, but I never finished it (it was supposed to generate the C enums and as well the Delphi ones for the primary generator from a json file).

@mappzor
Copy link
Copy Markdown
Contributor Author

mappzor commented Nov 7, 2021

Some changes are still under way. I've spent more time than expected on regression tests because my idea has changed drastically. At first I just wanted to upload my crash files and make a simple script to run fuzz binaries on them.

However I thought I'd be great to have a human-readable database of tests, so I invented crash_tool 😀

@athre0z athre0z added this to the v4.0.0 milestone Nov 7, 2021
@mappzor
Copy link
Copy Markdown
Contributor Author

mappzor commented Nov 8, 2021

After today's changes we have size-optimal outputs, better fuzzer, bunch of bugs fixed (resistance against ill-formed encoder requests, found with improved fuzzer), more test cases, ZYDIS_ATTRIB_* stuff and improved tools.

As far as I'm concerned, it's ready to merge :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

C-enhancement Category: Enhancement of existing features P-high Priority: High

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support instruction encoding / assembling

4 participants