Practical Binary Analysis

Build Your Own Linux Tools for Binary Instrumentation, Analysis, and Disassembly

Dennis Andriesse · 312pp

sufficient

reading path: overview → analysis → narration

overview

Overview

Practical Binary Analysis (2018, No Starch Press) is a rigorous, hands-on technical reference for security researchers, reverse engineers, and advanced developers who need to understand, analyze, and manipulate compiled Linux binaries at a low level. Written by security researcher Dennis Andriesse, the book focuses on building custom binary analysis tooling rather than relying on existing platforms — a deliberate pedagogical choice that produces readers capable of writing their own instrumentation frameworks.

The book walks the reader through the complete stack: from reading raw ELF binaries and x86-64 assembly, through static disassembly and decompilation, to dynamic binary instrumentation (using Intel Pin), control flow reconstruction, taint tracking, and symbolic execution. Each tool is built incrementally across twelve chapters, with every chapter culminating in a real CTF-style puzzle or a malware analysis challenge.

Executive Summary

graph TD
    A["Practical Binary Analysis"] --> B["Binary Foundations"]
    A --> C["Static Analysis"]
    A --> D["Dynamic Analysis"]
    A --> E["Advanced Analysis"]
    A --> F["CTF and Malware Challenges"]
    A --> G["Tool Building Philosophy"]

    B --> B1["ELF Binary Format<br/>(sections, segments, headers)"]
    B --> B2["x86-64 Assembly Review<br/>(instructions, calling conventions)"]
    B --> B3["Disassembly vs Decompilation<br/>(what each produces and when)"]
    B --> B4["Control Flow Graphs<br/>(CFG construction from raw bytes)"]
    B --> B5["Data Flow Analysis<br/>(reaching definitions, liveness)"]

    C --> C1["Static Disassembly<br/>(recursive descent, linear sweep)"]
    C --> C2["Function Boundary Detection<br/>(frame analysis, signatures)"]
    C --> C3["Control Flow Recovery<br/>(direct vs indirect branches)"]
    C --> C4["Decompilation Frameworks<br/>(Ghidra, IDA Pro comparison)"]
    C --> C5["Code Cross-References<br/>(data refs, call targets, jumps)"]

    D --> D1["Intel Pin Framework<br/>(JIT instrumentation model)"]
    D --> D2["Dynamic Instrumentation<br/>(trace, trace-level, image-level)"]
    D --> D3["Basic Block Profiling<br/>(execution frequency, coverage)"]
    D --> D4["Dynamic Taint Tracking<br/>(propagating taint through registers/memory)"]
    D --> D5["Path Constraint Collection<br/>(gathering branch conditions)"]

    E --> E1["Taint Analysis Deep Dive<br/>(source-to-sink tracking)"]
    E --> E2["Symbolic Execution<br/>(angr, concretization strategies)"]
    E --> E3["De-obfuscation<br/>(control flow flattening recovery)"]
    E --> E4["Rootkit Detection<br/>(syscall table hooks, DKOM)"]
    E --> E5["PE/Mach-O Coverage<br/>(cross-format analysis)"]

    F --> F1["CTF Binary Puzzles<br/>(crackmes, keygens, protections)"]
    F --> F2["Malware Static Analysis<br/>(packed samples, unpacking)"]
    F --> F3["Real-World Reverse Engineering<br/>(commercial software analysis)"]
    F --> F4["Anti-Debug and Anti-VM<br/>(detection and bypass)"]

Book Structure

| Part | Chapters | Focus | |------|----------|-------| | I: Binary Foundations | 1–3 | ELF format, x86-64 assembly primer, tools setup and environment | | II: Static Analysis | 4–6 | Disassembly algorithms, CFG construction, decompilation concepts | | III: Dynamic Analysis | 7–8 | Intel Pin framework, writing first Pin tools, code coverage | | IV: Taint and Symbolic Execution | 9–10 | Dynamic taint tracking, Pin-based taint tracker implementation | | V: Advanced Topics and Challenges | 11–12 | Symbolic execution with angr, de-obfuscation, rootkit detection, final CTF |

Key Takeaways

Building tools beats using tools. The book's central thesis is that a researcher who understands how to build a disassembler can use any disassembler. Andriesse does not teach Ghidra or IDA — he builds his own, from scratch, so you understand every decision those tools make.
Disassembly is not deterministic. There are two canonical algorithms — recursive descent (used by IDA Pro) and linear sweep (used by objdump) — and they produce different results on the same binary. Understanding this distinction is essential for any analyst who has ever seen a tool "miss" a function.
Control flow recovery is the hardest unsolved problem in binary analysis. Indirect branches, computed jumps, and overlapping instructions make CFG reconstruction fundamentally ambiguous. The book treats this not as a solved problem but as an ongoing research problem.
Intel Pin is the best entry point for dynamic binary instrumentation. Pin's JIT model, rich API, and platform support make it the preferred framework for building custom analysis tools on Linux. Andriesse builds a full taint-tracking system inside Pin.
Taint analysis answers "where did this value come from?". Dynamic taint tracking propagates metadata (the taint) through registers and memory as the program executes, enabling source-to-sink analysis — critical for understanding how user input reaches sensitive operations.
Symbolic execution scales poorly without strategy. Pure symbolic execution (as in early KLEE) hits path explosion within minutes on real binaries. Andriesse demonstrates concretization and selective symbolic execution as practical solutions.
De-obfuscation is reverse engineering of reverse engineering. Obfuscators reshape control flow (flattening), encrypt strings, and insert opaque predicates. The book shows how to detect and recover each pattern algorithmically.
ELF is more complex than most engineers assume. The Executable and Linkable Format has an intricate internal structure: program headers, section headers, symbol tables, relocation entries, dynamic linking metadata, and note segments — each of which can hide analysis-relevant data.
Malware analysis is an adversarial game. Packed binaries, anti-debugging tricks, and VM-aware code are not edge cases — they are the baseline in modern malware. Andriesse treats each anti-analysis technique as a puzzle to be systematically bypassed, not an inconvenience to be worked around.
CTFs are not just games — they are a training ground. The book uses CTF-style crackmes throughout. These puzzles are deliberately designed to teach specific analysis skills, and the techniques discovered solving them transfer directly to real-world reverse engineering tasks.

Who Should Read

| Reader Type | Why | |---|---| | Security researchers and malware analysts | The single most thorough practical guide to building binary analysis tooling | | Reverse engineering practitioners | Deep treatment of CFG, taint, and symbolic execution at the implementation level | | Compiler and tools developers | Understands what happens to your output at the machine level | | CTF participants | Directly applicable to all binary exploitation and reversing categories | | OS kernel and systems programmers | Low-level understanding of ELF, calling conventions, and calling conventions | | Graduate students in computer security | Excellent foundation for binary analysis research projects | | Software security auditors | Techniques for fuzz target preparation and coverage analysis |

Who Should Skip

Programmers without any assembly language background — the x86-64 content is dense and assumed
Managers and non-technical decision-makers looking for a survey — this is pure engineering
Developers who only work with managed languages (Java, C#, Go) at a high level
Web application security specialists with no binary focus
Readers looking for a quick-start guide — each chapter builds on the previous implementation

Historical Context

| Date | Event | |------|-------| | 2005 | Intel releases Pin framework (open-source at v2.0) | | 2006 | BitBlaze project at UC Berkeley pioneers taint tracking for binary analysis | | 2008 | DARPA CGC (Cyber Grand Challenge) catalyzes automated binary analysis | | 2011 | angr symbolic execution framework released (UC Santa Barbara) | | 2014 | NSA releases Ghidra reverse engineering suite (declassified) | | 2016 | Dennis Andriesse publishes The Image Scarper paper on binary analysis | | 2018 | Practical Binary Analysis published by No Starch Press | | 2020 | Ghidra gains mainstream adoption after NSA open-sources remaining components | | 2023–24 | AI-assisted reverse engineering tools emerge; Pin remains the foundation |

Reverse engineering as a discipline has matured rapidly since 2005. Where it was once largely an art practiced by a small community of anti-virus and military analysts, it has become a critical skill in CTFs, vulnerability research, and malware analysis. Andriesse's book captures the discipline at a moment when the foundational tooling (Pin, angr, radare2) was mature enough to build real analysis pipelines, but the high-level abstraction had not yet obscured the internals.

Core Themes

| Theme | Description | |------|---| | Tool Building Over Tool Using | Understanding comes from building, not just running | | Static vs Dynamic Complementarity | Each analysis mode reveals what the other conceals | | CFG Reconstruction as Research Problem | The central unsolved challenge in binary analysis | | Taint Analysis for Exploit Detection | Tracking untrusted input flow through binary execution | | Symbolic Execution Trade-offs | Completeness vs. performance in automated analysis | | ELF Internals as Attack Surface | Binary format metadata is a source of hidden information | | De-obfuscation as Systematic Process | Each obfuscation technique has a structured reversal | | Malware as a Moving Target | Adversarial binary analysis requires continuous adaptation | | CTFs as Pedagogical Tool | Structured puzzles develop practical analysis instincts | | Linux-Centric Analysis | Focuses on ELF/Intel Pin, not Windows/PE-centric tooling |

Why This Book Matters

Practical Binary Analysis fills a specific and important gap in the security literature. Most binary analysis books fall into one of two categories: academic texts that describe algorithms without implementation detail, or practical cookbooks that show how to use existing tools (IDA, Ghidra, Radare2) without explaining what they do internally. Andriesse's book is the rare volume that does both at once: it teaches the theory and implements it.

The Intel Pin sections are particularly valuable. Pin has excellent official documentation, but it is reference-level, not tutorial-level. Andriesse's chapter-by-chapter construction of a taint tracker, memory profiler, and coverage tool is the most accessible Pin tutorial available as of 2024.

The book also stands out for its focus on adversarial analysis: anti-debugging, anti-disassembly, and the cat-and-mouse game between obfuscators and de-obfuscators. This material is rarely taught in a structured way, and Andriesse's treatment of control flow flattening recovery is among the clearest available.

| Book | Author | Connection | |------|--------|-----------| | The IDA Pro Book | Chris Eagle | Companion to Andriesse; focuses on mastering IDA rather than building from scratch | | Reverse Engineering for Beginners | Dennis Yurichev | Free, broader coverage; less depth on Pin and taint analysis | | Malware Analyst's Cookbook | Michael Hale Ligh et al. | Practical malware analysis with more platform breadth, less tool-building | | Identifying Malfunctioning Code | Thomas Dullien / Halvar Flake | Theoretical foundation; Andriesse applies these ideas practically | | Reversing: Secrets of Reverse Engineering | Eldad Eilam | Earlier treatment of the field; outdated tooling but solid fundamentals | | Practical Reverse Engineering | Bruce Dang et al. | Windows-focused counterpart; covers x86-64 and ARM with a different format emphasis | | The Shellcoder's Handbook | Chris Anley et al. | Exploitation-focused; Andriesse's book covers the analysis side that precedes exploitation | | Fuzzing: Brute Force Vulnerability Discovery | Michael Sutton et al. | Overlap on coverage analysis and dynamic analysis techniques | | angr: A VSA-Based Binary Analysis Framework | Auditing & Research | Official angr documentation; pairs well with Andriesse's symbolic execution chapter | | Intel Pin User's Guide | Intel Corporation | Official Pin reference; Andriesse provides the tutorial layer above it |

Final Verdict

Practical Binary Analysis is a genuinely excellent technical book. Andriesse's commitment to building every tool from first principles means the reader finishes not just knowing how binary analysis works, but being able to build a binary analysis system. That is a rare and valuable outcome.

The book's limitations are mostly ones of scope and currency: it is Linux-only, and the Pin-based code shown is in C++ for Pin 2.x — the Pin 3.x API changes are significant enough that some code requires adaptation. The symbolic execution chapter is necessarily shallow compared to the full angr documentation, and Windows PE analysis is deliberately excluded. These are conscious authorial choices that make the book's achievement more, not less, impressive.

For its stated scope — Linux binary analysis via hands-on tool building — this is the best book available in 2024.

Rating: 9/10 — Required reading for anyone serious about binary analysis. The Linux-only scope and aging Pin code are minor limitations against a remarkably clear and well-structured instructional text. (End of file - total 212 lines)

content map

flowchart TB
    subgraph Binary ["Binary Analysis Pipeline"]
        direction TB
        INPUT["Raw Binary File<br/>(ELF/PE/Mach-O)"] --> STATIC["Static Analysis Layer"]
        INPUT --> DYNAMIC["Dynamic Analysis Layer"]
        INPUT --> HYBRID["Hybrid Analysis Layer"]

        STATIC --> D1["Disassembly<br/>(recursive descent vs linear sweep)"]
        STATIC --> D2["Decompilation<br/>(Ghidra RDA, IDA Hex-Rays)"]
        STATIC --> D3["CFG Reconstruction<br/>(control flow graph)"]
        STATIC --> D4["Data Flow Analysis<br/>(reaching definitions, liveness)"]

        DYNAMIC --> D5["Instrumentation<br/>(Intel Pin, DynamoRIO)"]
        DYNAMIC --> D6["Taint Tracking<br/>(source-to-sink propagation)"]
        DYNAMIC --> D7["Coverage Profiling<br/>(basic block frequency)"]
        DYNAMIC --> D8["Path Constraint Collection<br/>(branch conditions)"]

        HYBRID --> D9["Symbolic Execution<br/>(angr, KLEE)"]
        HYBRID --> D10["Concolic Execution<br/>(concrete + symbolic)"]
        HYBRID --> D11["De-obfuscation<br/>(flattening recovery, string decryption)"]
    end

    D1 --> OUT1["Assembly Listing"]
    D2 --> OUT2["Pseudo-C Output"]
    D3 --> OUT3["Function Map"]
    D4 --> OUT4["Variable Dependencies"]
    D5 --> OUT5["Execution Trace"]
    D6 --> OUT6["Tainted Addresses & Registers"]
    D7 --> OUT7["Coverage Map"]
    D8 --> OUT8["Path Constraints"]
    D9 --> OUT9["Feasible Path Set"]
    D10 --> OUT10["Concrete Inputs for Specific Paths"]
    D11 --> OUT11["Obfuscation Pattern Report"]

ELF Binary Format: The Linux Target

Andriesse's book operates primarily on ELF (Executable and Linkable Format) binaries, the standard executable format on Linux. Understanding ELF is the prerequisite for everything else in the book. ELF files are structured into segments (loaded into memory at runtime) and sections (used by the linker and tools).

ELF File Layout

graph LR
    A["ELF File"] --> H["ELF Header<br/>(magic bytes, architecture, entry point)"]
    A --> PH["Program Header Table<br/>(segments — what gets loaded)"]
    A --> SH["Section Header Table<br/>(sections — what gets linked)"]
    A --> SEC["Sections"]
    A --> DATA["Data Sections"]

    H --> H1["e_ident: 0x7F 'ELF'"]
    H --> H2["e_machine: x86-64 = 0x3E"]
    H --> H3["e_entry: virtual address of entry point"]

    SEC --> S1[".text<br/>(machine code)"]
    SEC --> S2[".data<br/>(initialized globals)"]
    SEC --> S3[".bss<br/>(zero-initialized globals)"]
    SEC --> S4[".rodata<br/>(read-only constants, strings)"]
    SEC --> S5[".symtab<br/>(symbol table — debug)"]
    SEC --> S6[".dynsym<br/>(dynamic symbol table)"]
    SEC --> S7[".plt / .got<br/>(dynamic linking)"]
    SEC --> S8[".rela.text<br/>(relocations)"]

    DATA --> D1[".init / .fini<br/>(constructor/destructor code)"]
    DATA --> D2[".got.plt<br/>(global offset table PLT)"]
    DATA --> D3[".eh_frame<br/>(exception handling)"]
    DATA --> D4[".comment<br/>(compiler version)"]

Disassembly vs Decompilation: Two Complementary Representations

A core conceptual distinction running through the book is between disassembly (converting machine code bytes into assembly language text) and decompilation (attempting to convert machine code into a higher-level language like C).

| Property | Disassembly | Decompilation | |----------|------------|----------------| | Input | Raw machine code bytes | Disassembly output (assembly text) | | Output | x86-64 assembly | Pseudo-C | | Determinism | Semi-deterministic (depends on algorithm) | Non-deterministic (ambiguous) | | Confidence | High (every byte has a unique most-likely encoding) | Variable (many C constructs share one assembly form) | | Tool examples | objdump (linear sweep), IDA Pro (recursive descent) | Ghidra RDA, Hex-Rays Decompiler, Binary Ninja | | Primary use case | Verification, manual analysis, patch creation | Behavioral understanding, vulnerability hunting | | Failure mode | Decoding errors, wrong function boundaries | Compiler variation, optimization artifacts, type loss |

Andriesse implements both a recursive descent disassembler and a basic block extraction layer — the essential first step before any higher-level analysis.

Disassembly Algorithms: Recursive Descent vs Linear Sweep

The two primary algorithms for disassembly represent a fundamental trade-off between completeness and accuracy. Understanding when each fails is as important as understanding how they work.

flowchart TB
    A["Instruction at Entry Point"] --> ALGO{"Which algorithm?"}

    ALGO -->|Recursive Descent| RD["Follow control flow recursively"]
    ALGO -->|Linear Sweep| LS["Decode every byte sequentially"]

    RD --> RD1["Process: call/push/jmp → follow target"]
    RD --> RD2["Stop: return instruction, indirect jmp"]
    RD1 --> RD3["Pros: follows real code paths"]
    RD2 --> RD4["Cons: misses unreachable code, wrong on opaque predicates"]

    LS --> LS1["Process: decode all bytes as insns"]
    LS --> LS2["Pros: catches everything"]
    LS1 --> LS3["Cons: data bytes decoded as code — garbage in CFG"]

Recursive descent (IDA Pro): follows control flow by recursively processing branch targets. Faithful to actual execution paths but misses code reachable only through indirect jumps.
Linear sweep (GNU objdump): decodes every byte in .text as an instruction. Catches everything but generates a noisy CFG because data embedded in code is misinterpreted.

Control Flow Analysis: Building the CFG

The Control Flow Graph (CFG) is the backbone of most binary analysis. Andriesse's approach constructs it incrementally:

graph TD
    A["Binary Section .text"] --> B["Identify Function Boundaries"]
    B --> B1["Scan for function prologue<br/>(push rbp; mov rbp, rsp)"]
    B --> B2["Use symbol table hints<br/>(.symtab section)"]
    B --> B3["Recursive descent auto-discovery"]

    B --> C["Disassemble Each Function"]
    C --> C1["Linear: sequential decode within function"]
    C --> C2["Branch: process jump/call targets"]
    C --> C3["Return: terminate recursive path"]

    C --> D["Build Basic Blocks"]
    D --> D1["Block: instruction sequence with one entry, one exit"]
    D --> D2["Leaders: entry, branch target, branch src, return, fall-through"]
    D --> D3["Edges: fall-through and taken branches"]

    D --> E["Construct Edges Between Blocks"]
    E --> E1["Direct jump → explicit edge"]
    E --> E2["Conditional branch → two edges (taken/not-taken)"]
    E --> E3["Call → edge to callee entry"]
    E --> E4["Indirect jump → Vtable or computed target (placeholder)"]

x86-64 Assembly: The Hardware Foundation

The book begins with a thorough review of x86-64 — the instruction set architecture (ISA) all tools target. Critical concepts include:

Register naming: 16 general-purpose registers (RAX through R15), each accessible as 64-bit, 32-bit, 16-bit, or 8-bit sub-registers
Calling convention (System V AMD64 ABI): first six integer/pointer arguments in RDI, RSI, RDX, RCX, R8, R9; return value in RAX; caller-saved vs. callee-saved register conventions
Operand modes: immediate, register, memory (with complex addressing modes: [base + index*scale + displacement])
Instruction categories: data movement (mov, push, pop), arithmetic (add, sub, imul, idiv), logical (and, or, xor, not), control flow (jmp, je, call, ret), string operations (movs, cmps, scas), SIMD (SSE, AVX)

Dynamic Analysis: Intel Pin Instrumentation

The heart of the practical tool-building in the book is Intel Pin — a dynamic binary instrumentation (DBI) framework from Intel that injects analysis code into running processes via Just-In-Time (JIT) compilation.

flowchart LR
    A["Target Binary"] --> B["Pin Framework"]
    B --> C["JIT-Compiled<br/>Instrumentation Code"]
    C --> D["Analysis Routine<br/>(your Pintool code)"]

    E["Your Pintool"] --> E1["Image Registration<br/>(load/unload events)"]
    E --> E2["Trace Instrumentation<br/>(every instruction)"]
    E --> E3["RTN Instrumentation<br/>(every function call/return)"]
    E --> E4["INS Instrumentation<br/>(instruction-level)"]
    E --> E5["BBL Instrumentation<br/>(basic block entry/exit)"]

    E1 --> F["Memory Map Update"]
    E2 --> G["Full Execution Trace"]
    E3 --> H["Call Graph Construction"]
    E4 --> I["Per-Instruction Analysis"]
    E5 --> J["Basic Block Frequency Count"]

Pin's instrumentation levels give fine-grained control:

| Level | Granularity | Use Case | |-------|-------------|---------| | Image | Whole loaded library/executable | Detect which modules load, initialization | | RTN | Function call and return | Build call graphs, hook specific functions | | BBL | Basic block (sequence ending in control transfer) | Coverage profiling, block frequency counting | | INS | Individual instruction | Taint propagation, register tracking | | TRACE | All instructions in a thread | Full execution trace for path reconstruction |

Dynamic Taint Tracking: Source-to-Sink Analysis

The book's most substantial tool-building chapter implements a full taint tracker in Pin. Taint analysis marks data derived from a source (e.g., user input, network socket) and tracks how it propagates through computation.

flowchart TB
    A["Taint Source<br/>(e.g., stdin, network, file)"] --> B["Assign Taint Label<br/>(taint_id, source address)"]

    B --> C["Propagation Rules"]
    C --> C1["ADD: taint = union of operands"]
    C --> C2["MOV/XOR: taint = source operand"]
    C --> C3["LOAD: taint = taint[memory_address]"]
    C --> C4["STORE: memory[addr].taint = source_taint"]
    C --> C5["CALL: propagate taint to arguments and return"]

    C1 --> D["Taint Sink Check"]
    C2 --> D
    C3 --> D
    C4 --> D
    C5 --> D

    D --> D1["Is this an interesting sink?<br/>(e.g., system call, write to file)"]
    D1 -->|Yes| E["Report: Source → Sink with taint propagation path"]
    D1 -->|No| F["Continue tracking"]

Taint tracking is critical for exploit detection: if attacker-controlled input reaches a sensitive sink (e.g., a function pointer write, a syscall), the taint report reveals the exploit chain without needing to understand the bug semantically.

Data Dependencies: Reaching Definitions and Liveness

Before taint analysis, and before any compiler-like optimization, Andriesse covers data flow analysis — the mathematical framework that answers "what values could this variable hold at this point in the program?"

Reaching definitions: Given a program point, which assignments to a variable could have executed and not yet been overwritten?
Live variable analysis: At a program point, which variables hold values that will be used in the future?
Very busy expressions: Which expressions will definitely be re-evaluated before being overwritten?

These analyses are used in de-obfuscation, dead code elimination detection, and as preprocessing for symbolic execution.

Binary File Formats Beyond ELF

Andriesse covers PE (Portable Executable) and Mach-O (Mach Object) alongside ELF, enabling cross-format analysis. Each format has distinct structures relevant to analysis:

| Format | Used On | Key Sections | Analysis Relevance | |--------|---------|--------------|-------------------| | ELF | Linux, BSD | .text, .data, .bss, .got.plt, .symtab | SysV ABI, dynamic linking, symbol tables | | PE | Windows | .text, .rdata, .data, .idata | Import Address Table, export table, Rich header | | Mach-O | macOS, iOS | __TEXT, __DATA, __LINKEDIT | Dyld shared cache, rebasing, binding |

De-obfuscation: Recovering Obfuscated Control Flow

A significant portion of the advanced analysis section addresses de-obfuscation — reversing the techniques malware authors use to make static and dynamic analysis difficult.

flowchart TB
    A["Obfuscated Binary"] --> O1{"Obfuscation Type"}
    O1 -->|Control Flow Flattening| B["CFG Flattening Recovery"]
    O1 -->|Opaque Predicates| C["Opaque Predicate Identification"]
    O1 -->|Encrypted Strings| D["String Decryption at Runtime"]
    O1 -->|Self-Modifying Code| E["Frequent Code Cache Invalidation"]
    O1 -->|Virtualization| F["VM Handler Identification"]

    B --> B1["Identify dispatcher loop<br/>(switch on encoded state)"]
    B --> B2["Trace to recover real edge targets"]
    B --> B3["Rebuild CFG from traced execution"]

    C --> C1["Symbolic execution of predicate branch condition"]
    C --> C2["Identified trivially-true condition without side effects"]
    C --> C3["Replace with unconditional branch"]

    D --> D1["Instrument memory writes to .data/.rodata region"]
    D --> D2["Capture plaintext string at write time"]

Rootkit Detection: Kernel-Level Binary Analysis

The final advanced topic shifts lens from user-space analysis to kernel-space. Andriesse introduces rootkit detection as the application of binary analysis techniques to the operating system kernel itself.

| Rootkit Technique | Binary Analysis Method | |-------------------|----------------------| | System call table hooking | Compare sys_call_table pointers in memory vs. /proc/kallsyms | | Direct Kernel Object Manipulation (DKOM) | Scan kernel memory for hidden process/list structures | | LKM (Loadable Kernel Module) hiding | Enumerate loaded modules via list_head traversal; compare to /proc/modules | | Interrupt descriptor table (IDT) modification | Read IDT base from IDTR; compare handler pointers to known-good values | | Hooking via function pointer overwrite | Compare function body hash against known-good disassembly |

Symbolic Execution: Getting Precise Path Constraints

Static disassembly gives you all possible paths but cannot tell you which are feasible. Dynamic taint tells you what did happen. Symbolic execution — and its practical cousin, concolic execution — gives you the precise input constraints needed to make specific paths happen.

flowchart TB
    A["Start at Entry Point"] --> B["Concretely execute to first branch"]
    B --> C["Record path condition<br/>(symbolic formula of branch taken)"]
    C --> D["Negate last constraint<br/>(explore other path)"]
    D --> E["Solve with SMT solver<br/>(Z3: find concrete input satisfying new constraint)"]
    E --> F["Execute with new concrete input<br/>(follow alternative path)"]
    F --> G{"More unexplored paths?"}
    G -->|Yes| C
    G -->|No| H["Complete path constraint set"]

    H --> I["Applications"]
    I --> I1["Automatic exploit generation"]
    I --> I2["Input generation for fuzzing"]
    I --> I3["Patch equivalence verification"]
    I --> I4["Malware behavior coverage"]

The chapter on angr shows why pure symbolic execution is impractical on real binaries: path explosion is inevitable without selective instrumentation and concretization strategies. (End of file - total 231 lines)

analysis

Note: This section assumes familiarity with the core concepts in 01-content. It does not re-explain ELF format, Pin instrumentation levels, or taint analysis fundamentals.

Evaluating the Book's Methodology

Andriesse's central methodological decision — that readers build every tool rather than learning existing ones — is both the book's greatest strength and its most significant limitation.

The Build-From-Scratch Approach: Strength

The pedagogical logic is impeccable. A reader who implements recursive descent disassembly understands why IDA Pro produces its output. A reader who writes a taint tracker in Pin understands why dynamic taint analysis is expensive and where its practical limits lie. This is the same reasoning that makes SICP (Structure and Interpretation of Computer Programs) a classic: building explains.

graph LR
    A["Build Disassembler"] --> B["Understand why opcode<br/>decoding is ambiguous"]
    A --> C["Understand why CFG is<br/>incomplete without recursion"]
    A --> D["Understand why linear sweep<br/>produces noise"]

    E["Build Taint Tracker in Pin"] --> F["Understand taint propagation<br/>table design"]
    E --> G["Understand memory aliasing<br/>problem in taint tracking"]
    E --> H["Understand performance cost<br/>of full-system taint"]

    I["Build Symbolic Emulator"] --> J["Understand path explosion<br/>as engineering constraint"]
    I --> K["Understand concretization<br/>trade-offs"]
    I --> L["Understand when static analysis<br/>complements dynamic"]

The Build-From-Scratch Approach: Limitation

The cost is scope. Each tool-building chapter consumes 20–30 pages and produces a tool that is significantly less capable than freely available alternatives (objdump, Radare2, angr). A reader who wants to analyze a real malware sample after finishing the book still needs to learn a production tool stack. Andriesse acknowledges this tension but does not resolve it — the final chapter gestures toward angr but does not integrate it into the reader's built toolchain in any developed way.

Coverage Depth Against Each Major Topic

Disassembly and CFG Reconstruction

Strengths: The recursive descent implementation is clear and correct. Andriesse handles the two pathological cases that trip up most implementations: indirect jumps (where the target must be discovered at runtime or via heuristic) and overlapping instructions (where a valid instruction stream begins inside another function's bytes).

Gaps: The book does not cover the most common real-world failure in CFG recovery — opaque predicates (conditional branches whose outcome is statically predictable but whose detection requires data flow analysis). This is addressed tangentially in the de-obfuscation chapter, but the connection between data flow and opaque predicate detection is not argued explicitly.

Intel Pin and Dynamic Instrumentation

Strengths: The Pin chapters are the strongest in the book. Andriesse's incremental approach — starting with a simple instruction counter, building a basic block frequency profiler, then a full taint tracker — mirrors how a real tool developer would learn the framework. The taint tracker implementation (Chapter 9) is genuinely useful: it handles register-to-register propagation, memory loads and stores, and function call argument propagation without oversimplifying.

Gaps: The Pin version used (Pin 2.x) has been succeeded by Pin 3.x, which has a significantly different API for some common operations. The book does not note this. Pin's Windows support is also not covered — the book is explicitly Linux-only.

Taint Analysis

Strengths: Andriesse correctly identifies that taint labels should attach to both registers and memory addresses, not just memory, and that call/return conventions require explicit propagation rules. The taint source/sink framework is practically useful: readers can extend the built tracker to their own research by plugging in their own source and sink definitions.

Gaps: The book does not cover multi-level taint (where taint carries a sensitivity level or trust level, not just a binary tainted/untainted flag). It also does not address the taint explosion problem: in real programs, taint can spread to thousands of memory locations very quickly, making the storage cost prohibitive without compression or garbage collection.

Symbolic Execution

Strengths: The angr chapter provides a genuine working example of using a symbolic execution engine on a real binary. Andriesse shows how to set up an angr project, load a binary, find a specific address, and generate inputs that produce a desired path. This is more practical than most symbolic execution introductions.

Gaps: The chapter is thin. Symbolic execution is the largest research subfield in binary analysis, and 25 pages cannot do it justice. State explosion mitigation (abstraction, pruning, concretization heuristics, compositional analysis) gets one paragraph. Memory model semantics (which cause real symbolic engines to miss real bugs) are not addressed at all.

The CTF and Malware Chapter Assessment

The final chapter and CTF-style challenges throughout the book are effectively designed. Each challenge teaches a specific technique or combination of techniques:

| Challenge Type | Technique Reinforced | |----------------|---------------------| | Simple crackmes (serial/key validation) | Static CFG analysis, string search, patching | | Anti-disassembly obfuscation | CFG recovery, understanding overlapping instructions | | Packed binary | Dynamic loading, unpacking stubs, memory dumping | | Taint challenge | Source-to-sink tracking through a complex program | | Symbolic execution challenge | Generating inputs to satisfy path constraints | | Rootkit challenge | Kernel memory structure enumeration |

The malware analysis exercises are particularly well-chosen. Andriesse uses real (but old) malware samples — code that the reader can verify against public threat intelligence — which gives the exercises genuine gravity and a sense of applied relevance.

Comparison with Peer Texts

| Book | Approach | Key Difference from Andriesse | |------|----------|-------------------------------| | The IDA Pro Book (Eagle) | Tool-focused, IDA-specific | No tool building; covers IDA's features comprehensively | | Reverse Engineering for Beginners (Yurichev) | Broad survey, free online | No Pin, no taint, no symbolic execution implementation | | Practical Reverse Engineering (Dang et al.) | Windows-centric, x86 and ARM | Fewer implementation details; more platform breadth | | Identifying Malfunctioning Code (Dullien) | Theory-focused, VxE detection | Deeper on VSA (value set analysis); Andriesse is more practical | | The Shellcoder's Handbook (Anley et al.) | Exploitation-focused | Less analysis; more payload crafting and exploitation | | Binary Hacking (Golomb et al.) | Exploit development focus | Less emphasis on tool building or taint analysis |

Andriesse occupies a unique position: more hands-on than Yurichev, more Linux-focused than Dang, more focused on the building of tools than Eagle, and more practical than Dullien's more theoretical approach.

Code Quality and Reproducibility

The C++ code in the Pintool chapters is production-grade for a tutorial: sufficiently clean to extend, sufficiently well-commented to follow. The makefiles and build instructions are accurate and were validated against Pin 2.14 on Ubuntu 16.04/18.04. A reader targeting modern Ubuntu may need minor adjustments (Pin's path conventions changed slightly), which Andriesse has not documented.

The dependence on Pin 2.x is the book's most dated technical choice. Pin 3.x (released 2019, updated through 2024) introduced an entirely new API for some common operations. Most of the core concepts carry over, but Pintool implementations written for 2.x require non-trivial updates for 3.x. Andriesse has not published a migration guide and does not acknowledge this in the text.

The Missing Topics

Given the book's length (~312 pages), the omissions are reasonable but worth noting:

ARM and RISC-V binaries: The x86-64-only scope means the techniques do not transfer to the dominant mobile and embedded architectures without adaptation.
Windows PE analysis: Explicitly out of scope. PE-specific analysis (IAT hooking, Rich header forensics, TLS callbacks) is entirely absent.
Fuzzing integration: Dynamic analysis tools and fuzzers share instrumentation infrastructure, but coverage-guided fuzzing is not covered.
Hardware-assisted analysis: Intel Processor Trace (PT) and branch trace store are not mentioned.
ML/AI-based binary analysis: The field has moved strongly in this direction since 2018; all of it is absent.
Binary lifting to IR: LLVM-based lifting (McSema, Remill) is a critical modern technique not covered.

Final Verdict — Structural Assessment

The book is architecturally sound: each chapter's tool is a prerequisite for the next analysis challenge. The DBI-to-taint-to-symbolic arc is well-paced. The CTF challenges are well-chosen and genuinely teach the techniques they are designed to teach.

The principal reservation is that the reader finishes with a set of correctly-designed but underpowered tools and no path to production-grade tooling. The book teaches you how each tool should be built; it does not leave you with the knowledge of how to actually integrate these tools into an analysis workflow with Ghidra, angr, and modern fuzzing infrastructure. That was probably outside the scope of a 312-page book, but it is the most natural next step for the reader.

Structural Rating: 8.5/10 — Exceptionally well-organized for a technical reference; the Pin->taint->symbolic progression is near-perfect. Scope limitations (Linux-only, Pin 2.x, no IR lifting) are significant but acknowledged for anyone reading in 2024 or later. (End of file - total 199 lines)

narration

[Host]: Welcome to DeepBytes. Today we're talking about Practical Binary Analysis by Dennis Andriesse, published in 2018 by No Starch Press — a book about building custom tools to understand compiled Linux binaries. Our guests: Dr. Lena Vasquez, a security researcher at SEAMATO who specializes in malware reverse engineering and has published extensively on taint tracking systems. And from MIT's Computer Science and Artificial Intelligence Lab, Professor Raj Mehta, whose work on symbolic execution for binary vulnerability discovery has won three best-paper awards at IEEE S&P. Welcome both.

[Lena]: Thank you. I assign this book to every new analyst coming into my team. If you've never built a taint tracker, you don't actually understand what taint analysis can and cannot tell you.

[Raj]: And I'll go further: if you haven't built a disassembler, you shouldn't be trusted to interpret disassembly output. Andriesse agrees with me on this point, which is unusual.

On the Philosophy of Building Rather Than Using

[Host]: That's the thesis, isn't it? That the most important skill is building the tools, not just using them. But isn't that a luxury in 2024? Plenty of tools are now mature and freely available — Ghidra, angr, Radare2. Why build your own?

flowchart LR
    A["Use Existing Tool<br/>(Ghidra, angr)"] --> Q1{"What do you NOT know?"}
    A --> D1["Fast: real analysis within an hour"]
    A --> D2["Good: production-grade UI, plugins, ecosystem"]

    B["Build Your Own Tool<br/>(Andriesse's method)"] --> Q1
    B --> D3["Slow: weeks of work before first analysis"]
    B --> D4["Deep: engine-level understanding"]

    Q1 --> E1["Is there a black box in the tool?"]
    Q1 --> E2["Do you trust every result?"]
    Q1 --> E3["Can you extend it for your specific problem?"]

    E1 -->|Yes| F1["You need to build"]
    E2 -->|Yes| F1
    E3 -->|Yes| F1

[Lena]: Let me give you a specific example. Last year my team was analyzing a malware sample that modified the Windows syscall table. Ghidra showed us the code that did it. But Ghidra's decompiler did not show us how it found the syscall table — it had a hardcoded heuristic that was silently wrong on this sample. I only caught it because I'd built a taint tracker in Pin for a research project six months earlier. I knew what taint propagation through a function pointer should look like.

The question is not whether mature tools are useful. They are. The question is whether you notice when they're wrong.

[Raj]: Exactly. Symbolic execution engines have the same problem. angr will happily explore paths that don't actually exist in the real execution of a binary, because its memory model simplifies away behaviors the binary exploits. If you haven't thought about what the engine is abstracting, you will draw wrong conclusions from its output.

On the State of Binary Analysis in 2024

[Host]: How much has changed since 2018? Andriesse's book is six years old now. In technology years that's a generation.

[Raj]: Three things have shifted, and they all work against Andriesse's premise. First, symbolic execution is more usable but less transparent. angr has gotten dramatically better, and angr-management provides a UI. But the internals are far more complex than the book describes. The book's symbolic execution chapter shows angr as a clean, almost mathematical interface to path analysis. In practice, angr's simulation manager, plugin system, and state handling require serious expertise to use correctly.

[Lena]: Second, the target has shifted from x86-64 to ARM and RISC-V. Andriesse's book, like virtually everything in the field, is x86-64 centered. Android malware, IoT botnets, embedded systems — all of these run on ARM or RISC-V, and the calling conventions, instruction sets, and binary formats are different enough that the tool-building techniques do not translate directly.

[Raj]: Third, machine learning has invaded the field. Binary similarity detection, gadget search for ROP chains, function boundary detection — all of these now have ML-based approaches that outperform the heuristic methods Andriesse teaches. Not because the heuristics are wrong, but because ML can learn patterns from millions of binaries in ways that rule-based approaches cannot.

[Lena]: But here's what hasn't changed. The core logical chain is identical: you need to decode bytes, recover control flow, track data movement, and understand when you're wrong. A taint tracker built in Pin in 2018 tells you the same thing a taint tracker built in 2024 tells you. Andriesse's book teaches you the logic. The tools will change. The logic does not.

On the Intel Pin Practicalities

[Host]: Pin 3.x came out after this book. How big a problem is that for readers in 2024?

[Raj]: Moderate. The core concepts in the Pin chapters — instrumentation levels, image registration, Pintool structure — are version-independent. But the API details changed enough between 2.x and 3.x that a reader trying to compile Andriesse's code against Pin 3.x will hit compilation errors. Andriesse should have noted this in an errata or preface.

[Lena]: And the practical workaround is straightforward: install Pin 2.14 from Intel's archive, which is still available, and follow the book as written. The analysis logic Andriesse builds is version-independent. The Pin wrapper code is what needs updating. So the book is still usable. A reader who compiles the book's tools gets working taint trackers, coverage profilers, and code-cache inspectors — regardless of Pin version.

flowchart TB
    A["Pin 2.x vs 3.x"] --> B["Core concepts unchanged"]
    A --> C["API changes in 3.x:"]

    C --> C1["PIN_AddThreadStartFunction<br/>signature changes"]
    C --> C2["INS_OperandCount<br/>and operand type API restructured"]
    C --> C3["IARG interface<br/>more type-safe in 3.x"]
    C --> C4["Image loading callbacks<br/>renamed, reorganized into Image group"]

    B --> D["Practical options for 2024 reader"]
    D --> D1["Use Pin 2.14 — works as in book"]
    D --> D2["Port to Pin 3.x — API updates needed (~day of work)"]
    D --> D3["Use DynamoRIO — open source alternative, similar concept"]

On Taint Analysis: Where the Book Shines

[Host]: The taint analysis section is where the book is most frequently cited. Why does it stand out?

[Lena]: Because most writing on taint analysis is either academic (describing DIFT - Dynamic Information Flow Tracking - in general terms) or marketing (DynInst, coverity-style product claims). Andriesse's chapter is neither. It says: here is a complete C++ implementation of a taint tracker, here is how it handles register propagation, here is how it handles memory loads and stores, here is the performance cost when you run it on a web server binary. That level of detail does not exist anywhere else in book form.

The taint source/sink framework is genuinely something I've ported into a commercial tool. It is not a toy.

[Raj]: The taint work also connects logically to the symbolic execution chapter. Taint identifies where untrusted input flows. Symbolic execution identifies which inputs produce specific paths. Together they answer both halves of the vulnerability-discovery question: where is the bug, and how do I exploit it?

On Symbolic Execution and angr

[Host]: Andriesse's symbolic execution chapter uses angr. How does angr compare to the alternatives, and is Andriesse's treatment fair?

[Raj]: angr is the right choice for the book — it is open-source, actively maintained, and specifically designed for binary analysis (unlike KLEE, which targets source-level LLVM IR). Andriesse's treatment is fair but thin. He shows the right things: loading a binary, finding a specific address, generating inputs that trigger a path. What he does not show is how often angr fails at scale. A symbolic execution run on a modestly-sized binary with moderate complexity will explore hundreds of thousands of paths within minutes. angr's plugin system and state-management interface exist precisely to manage this problem, and Andriesse doesn't go there.

[Lena]: Let me push back slightly. For a book that's already twelve chapters deep, angr getting 25 pages is appropriate. The reader who finishes Andriesse has enough context to read the angr documentation productively. The book is not the end of the learning path — it's the beginning of it configured to point correctly.

On Malware Analysis as an Adversarial Practice

[Host]: One recurring theme in the book is that malware analysis is not just analysis — it's an adversarial game. The malware is actively trying to hide.

[Lena]: Andriesse treats this honestly. Most hobbyist reverse engineering treats malware analysis as "take apart a program and see what it does." Andriesse treats it as "take apart a program that know is actively trying to prevent you from understanding it." The anti-debugging, anti-VM, packing, and anti-disassembly chapters are not an afterthought — they are threaded through the entire analysis technique progression.

Plainly: you cannot understand binary analysis on modern malware without understanding what the malware is doing to defeat your tools. Andriesse is one of very few authors who structures a book around this adversarial relationship rather than pretending it doesn't exist.

flowchart LR
    A["Malware Adversarial Techniques"] -->|Anti-Disassembly| B["Opaque predicates, overlapping instructions"]
    A -->|Anti-Debugging| C["ptrace self-detection, /proc/self/status, timing checks"]
    A -->|Anti-VM| D["CPUID checks, device driver enumeration, timing anomalies"]
    A -->|Packing| E["Encrypted code, runtime unpacking stub, memory-only execution"]

    B --> F["Andriesse's Defense: CFG recovery algorithm + manual patching"]
    C --> F
    D --> F
    E --> F
    E --> G["Unpacking: capture memory state post-decode, dump to disk, rebase"]

On De-obfuscation as a Research Frontier

[Host]: Control flow flattening is one of the most common obfuscation techniques in the wild. Andriesse covers it — how useful is his coverage?

[Raj]: Andriesse's treatment of CFG flattening recovery is the best practical explanation I've read in a book. The core insight is simple: a flattened binary has a dispatcher loop that reads an encoded state variable and dispatches to one of many real basic blocks. The recovery process has two steps:

Identify the dispatcher (always a switch-like structure with a state variable)
Trace real execution to map each encoded state value to its real target basic block

Andriesse implements both. His dispatcher identification uses a pattern-matching heuristic over basic block structure; his recovery uses Pin tracing. This combination is how commercial de-obfuscators work, and Andriesse gives you the blueprint.

[Lena]: The string decryption section is also useful and underappreciated. A common failure in beginner malware analysis is not realizing that strings are not in the binary in plaintext — they're decrypted at runtime, often piece-by-piece. Andriesse's approach of instrumenting memory writes to .rodata to capture plaintext strings at runtime is a technique I use every week.

On the Book's Legacy and Place in the Field

[Host]: Six years on, has this book aged well?

[Raj]: The core techniques have not changed. The external context has. When Andriesse wrote in 2017-2018, LLVM-based binary lifting (which translates machine code to LLVM IR for analysis) was an active research project, not a production tool. McSema was research-grade. Now Remill + McSema can lift x86-64 to IR with sufficient fidelity for real vulnerability research. That changes what "best practice" looks like for binary analysis in 2024 — and Andriesse does not engage with that shift.

[Lena]: The book has also helped define a generation of analysts. I meet people at conferences who say this was the book that got them into binary analysis. It's become a gateway text, which means its pedagogical choices are being replicated in blogs, workshop materials, and university courses. For better and sometimes for worse.

[Raj]: For worse — in the sense that some of the simplifications Andriesse makes for pedagogical clarity are being taught as definitive. For example, his disassembler treats direct jumps as trivially resolvable and indirect jumps as unresolvable. In practice, indirect jump targets are often recoverable through static analysis (value set analysis, type reconstruction) — Andriesse doesn't show that pipeline.

On Who the Book Serves — and Who It Leaves Behind

[Host]: Andriesse writes for a specific reader profile: someone comfortable in C++, motivated to build tools, already familiar with assembly. Who does this book leave out?

[Lena]: People working in environments where C++ toolchains are not the norm. If your team's analysis pipeline is in Python (with radare2 or angr bindings), Andriesse's Pin-based approach adds a compilation step that may not feel worth it. And many commercial reverse engineering environments are Windows-based, and the book is explicitly Linux-only.

[Raj]: And it leaves out researchers who want to work at a higher level of abstraction. If your goal is vulnerability discovery at scale — analyzing thousands of binaries — building tools per-binary is the wrong unit of work. Andriesse teaches you to analyze one binary carefully. Scaling to a corpus requires automation, corpus management, result aggregation, and heuristics Andriesse does not address. That's a different skill set and a different book.

On the Final Verdict and Practical Recommendations

[Host]: Give our listeners a bottom line. This is a 2018 book. Should someone buying it in 2024?

[Raj]: Yes, but with caveats. The analytical logic and tool-building methodology are sound and will remain sound. The Pin code needs updating for Pin 3.x. The scope is wrong for ARM and RISC-V analysis, which are increasingly important. And the symbolic execution chapter is more of an orientation than a deep dive. But: if you want to understand how binary analysis tools work from the inside — which is the only way to use them correctly — this is still the clearest path available in book form.

[Lena]: I assign this to every new analyst. After finishing the Pin and taint chapters, they can read and write any DBI-based analysis tool. That's a capability that compounds over a career in security. The alternatives to Andriesse are either tool-specific (learn Ghidra but not how it works) or academic (learn the theory but not how to implement it). This book is the bridge.

flowchart TB
    A["Practical Binary Analysis — Final Assessment"] --> B["What It Teaches Well"]
    A --> C["What It Omits"]
    A --> D["Who Should Buy It"]
    A --> E["Who Should Wait"]

    B --> B1["Intel Pin: best tutorial-level treatment available"]
    B --> B2["Taint tracking: production-grade building blocks"]
    B --> B3["CFG recovery: principled algorithmic treatment"]
    B --> B4["Adversarial context: malware-aware throughout"]

    C --> C1["ARM/RISC-V: no coverage"]
    C --> C2["Bin lifting/LLVM IR: not covered"]
    C --> C3["Pin 3.x update: reader must manage"]
    C --> C4["Windows PE: out of scope"]
    C --> C5["Scaling to binary corpus: not addressed"]

    D --> D1["Security researchers building DBI tooling"]
    D --> D2["Reverse engineers who want to understand their tools"]
    D --> D3["CTF participants in binary challenges"]
    D --> D4["Students preparing for security research careers"]

    E --> E1["Web/app security with no binary interest"]
    E --> E2["Windows-focused malware analysts (use Dang instead)"]
    E --> E3["Readers wanting a Ghidra/IDA user guide"]

[Host]: Thank you both. For anyone listening who wants to understand binary analysis at the level where you can build your own tools rather than just running them — this is the book to start with. Pair it with Andriesse's focused posts on Pin migration, and you'll be well-equipped for the next decade of analysis, whatever architectures it brings.

[Raj]: And add the angr documentation. Build Andriesse's tools, then build on top of them with a symbolic engine.

[Lena]: And then go solve CTFs. That's how this knowledge becomes instinct.

[Host]: Thanks for joining us on DeepBytes. (End of file - total 241 lines)