Shared object (.so) files contain optimized machine code compiled from various programming languages like C, C++, Rust etc. They implement common library routines utilized in Linux and other Unix-based operating systems to promote code reuse across software projects. But once source code has been compiled into these ubiquitous *.so binary files, is it possible to recover the original code that built them?

Understanding Shared Objects and Dynamic Linking

When compiled into machine code, source code from higher level languages loses all original formatting, comments and naming. But why does this one-way "information loss" occur during compilation? And what exactly happens behind the scenes when generating and linking .so files?

The Compilation Pipeline

Modern compilers for languages like C/C++ perform extensive optimization and translation to convert source code into processor-specific machine code.

Compilation phases from source to machine code

As the diagram shows, its a multi-stage pipeline:

  • Preprocessing: Inserts header contents, expands macros
  • Compilation: Syntax and semantic analysis, then translation to assembler
  • Assembly: Converts assembler mnemonics into opcodes, fixing up addresses
  • Linking: Links libraries, finalizes addresses, produces the target binary

By the end of this pipeline only optimized machine code remains. All comments, formatting, naming and structure from the source are discarded.

Decompilers can attempt to reverse the compilation process, but won‘t recover the exact original formatting and naming without embedded source snippets.

Static vs Dynamic Linking

Linked libraries in C/C++ can either be statically or dynamically linked:

  • Static linking embeds library machine code directly into the target binary at compile time. This makes executables self-contained but also larger in size.

  • Dynamic linking defers resolving library references until load time, linking them dynamically from external *.so files. This saves space but introduces runtime dependencies.

The shared object (*.so) files used for dynamic linking contain code intended to be loaded and linked at application runtime rather than compile time. This allows multiple apps to reuse common dynamically linked libraries instead of embedding redundant copies of static code into each app.

.SO Files in Linux Distros

Dynamic linking via shared objects offers several advantages in Linux distributions, leading to widespread adoption. For example, in Ubuntu 22.04 LTS:

  • Over 2550 .so system libraries provided in base image
  • 40-60% size reduction for dynamically linked executables
  • Security and bug fixes get propagated via central .so files
  • Easier for apps to use latest libraries at runtime
  • Major components like glibc dynamically linked

Statistics across distributions like Fedora, RHEL, Debian and openSUSE show similarly high utilization of dynamic linking through .so shared object files.

Analyzing Compiled Machine Code

While original source code can‘t extracted from compiled shared libraries, we can still analyze the machine code using techniques like:

Direct Machine Code Analysis

Using a debugger or hex editor, you can directly inspect the compiled machine instructions and data tables inside a .so file.

This requires understanding the target architecture (x86, ARM etc.) and language of machine code. For example, x86-64 machine code consists of opcodes like:

Machine Code   Mnemonic    Description
---------------------------------------------------
0xB8           MOV         Move immediate value into register
0x89           MOV         Move value from register into register
0xE8           CALL        Call subroutine 
0xC3           RET         Return from subroutine

By tracing raw opcodes and registers, one can deduce data flow and control flow of the program logic. But substantial hardware architecture expertise is needed.

Decompilation to Higher Level Code

Instead of analyzing raw machine code, decompilers can automatically translate it into equivalent C/C++ code. Popular open source decompilers include:

  • Ghidra: NSA developed, advanced analyses, user-friendly UI
  • Radare2: Feature-rich command line tool, Ruby bindings
  • RetDec: Produces clean readable output code
  • Hopper: User-friendly macOS/Linux GUI

For example, here is a trivial C function compiled to x64 assembly and then decompiled back to C using Ghidra:

// Original C Source Code 
int sum(int a, int b){
  return a + b; 
}
# Compiled Assembly Code
    push    rbp
    mov     rbp, rsp
    mov     DWORD PTR [rbp-0x4], edi
    mov     DWORD PTR [rbp-0x8], esi
    mov     eax, DWORD PTR [rbp-0x4]
    add     eax, DWORD PTR [rbp-0x8]
    pop     rbp
    ret
// Ghidra Decompilation Output
int sum(int param_1,int param_2)

{
  return param_1 + param_2;
}

The logic matches the original C code, but artifacts like naming and formatting are lost in compilation.

Decompilers utilize techniques like control flow analysis, data flow analysis and type inference to recover code and data constructs present in the original source. But the output code is only an estimation intended to match functionality, not recreate original styling.

Dynamic and Static Analysis

By utilizing tools like ldd, strace and ltrace we can dynamically analyze .so dependencies, function calls and library usage at runtime as an application executes.

Complementary static analysis examines the binary without execution. This reveals compile-time dependencies and capabilities independent of runtime inputs.

Analyzing both static and dynamic aspects gives complete picture of a shared library‘s functionality.

Behavioral Analysis

Similar to dynamic analysis, we can treat the .so file as a black box and simply observe the functional behavior it produces when utilized by client programs.

Monitoring inputs and outputs gives insight into purpose and design even without seeing actual code.

Reverse engineering malware utilizes a combination of static examination and behavioral monitoring to understand threat capabilities beyond relying on source code availability.

Decompilation Challenges

While decompilers can approximate original logic, producing readable source code from machine code remains an imperfect craft, plagued with challenges like:

  • Stripped symbol tables lose naming information entirely
  • Higher level types and constructs can‘t be reliably inferred
  • Dynamic dispatch and jumping obfuscates control flow
  • Cryptographic routines contain few semantic clues
  • Code packed via protectors defy analysis attempts

As compiler optimizations and code obfuscation techniques advance, decompilers struggle to cope without any higher context. Modern machine learning techniques show promise by learning statistical patterns from large codebases to better reconstruct original logic. But fundamental limitations persist given compilation‘s lossy singular directionality.

Security Implications

The inability to recover source code from compiled shared object files has security implications:

  • Increased risk and effort to audit code without source availability
  • Easier to hide malicious code lacking transparency
  • Patches take longer to implement blindly fixing binary defects
  • Code visibility helps identify backdoors, logic bombs

Proprietary software users place full trust in obscure closed-source binaries. This introduces opportunities for trojans and supply chain attacks. Recent software bill-of-materials standards help track component provenance — but fall short of mandating public source code.

Legal Risks of Reverse Engineering

Attempting to decompile commercial closed-source software without explicit permission raises potential legal concerns:

  • DMCA provisions against circumventing technical protections
  • Copyright law violations deducing non-public APIs
  • Patents granted on implementation techniques
  • Trade secret protections of sensitive logic

Many license agreements specifically prohibit reverse engineering. While interoperability exceptions exist in some countries, legal guidance should be sought before decompiling or analyzing third party code without consent.

Ethical considerations also come into play when responsibly utilizing decompilation output or reporting discovered vulnerabilities. Customers paying for proprietary software reasonably expect authors to protect their intellectual property using obfuscation and legal protections.

Conclusion

In summary, source code compiled into Linux shared object (*.so) files can‘t be perfectly recovered or recreated into its original pre-compiled form. Machine code generation across multi-stage compilation pipelines permanently discards most human-readable source formatting, naming and code structure.

However, through adept technical expertise and advanced tooling, determined engineers can successfully analyze binaries, understand implemented logic, and approximate equivalent functionality using decompilation. This allows virtual examination of software components without initial source availability. But inherent challenges and risks exist when reverse engineering compiled code lacking explicit permissions.

Understanding these fundamental software principles helps illustrate why publishing open source remains vital for transparency, security and trust across our technology infrastructure.

Similar Posts