Skip to content

health_score returns 0 for 5/11 penalty dimensions and is capped on 2/11 at large-monorepo scale #260

@OmerGronich

Description

@OmerGronich

What happened?

On a large multi-package TypeScript monorepo, fallow health --hotspots --score returns a health_score in the B grade band while only 4 of 11 penalty dimensions track reality. The other 7 are mathematically incapable of firing at this scale:

Dimension Cap State
dead_files 15 ✅ honest
dead_exports 15 ✅ honest
complexity 20 ✅ honest
duplication 10 ✅ honest
p90_complexity 10 ⚫ silent (p90_cyc well below the > 10 trigger)
maintainability 15 ⚫ silent (MI_avg well above the < 70 trigger)
hotspots 10 ⚫ silent (max ranked score reaches a fraction of the 50.0 filter)
unit_size 10 ⚫ silent (very_high_risk % below the ≥ 5 % floor)
coupling 5 ⚫ silent (p95_fan_in well below the > 30 trigger)
unused_deps 10 🔴 saturated (actual count an order of magnitude over the cap)
circular_deps 10 🔴 saturated (actual count well over an order of magnitude over)
total 130 score lands in the B band

One pattern explains all 7 broken dimensions: scale-blind aggregations + low absolute caps.

  • The 5 silent dimensions aggregate per-function/per-file metrics with mean / p90 / fixed-percentage operators, then trigger on a fixed threshold tuned for small/medium projects. At scale, the long tail is mathematically swallowed by the bulk of trivial code (most TS files are tiny utility/barrel/model files; most functions are 1-CC getters and lambdas), so the aggregation never crosses the floor:
    • Tens of thousands of functions live above p90, but p90 itself sits well below > 10.
    • A meaningful absolute count of files have MI < 70, but they're a tiny fraction of the total, so the mean is near 100.
    • Thousands of functions exceed 60 LOC, but they're below the 5 % floor of the function-count denominator.
    • Thousands of files are ranked as hotspots, but the within-project max-norm formula at compute_hotspot_score ((churn/max_churn) × (density/max_density) × 100) is structurally bounded — see §"Related" below.
    • p95_fan_in lands in the single digits because the bottom 95 % of files are barely imported; the actually-coupled barrels live above p99.
  • The 2 saturated dimensions use min(count, 10) on per-repo counts. Reasonable for a single-package project; a no-op in any workspace where N packages multiply the count linearly. The formula treats n=11 and n=1000 identically.

Net: ~38 % of the penalty budget (50/130 pts) is silently zero, ~15 % (20/130) is pinned at the cap regardless of magnitude. A codebase with thousands of fat functions, hundreds of cycles, and hundreds of unused deps reads as B / mostly healthy.

Per-dimension evidence

p90_complexityvital_signs.rs:319

  • clamp(p90_cyclomatic − 10, 0, 10). At large function-population sizes, the bulk are trivial; complex functions live above p99. A p99_cyclomatic (same trigger) or functions_with_cc_above_20 / 1k_functions would survive.

maintainabilityvital_signs.rs:323-325

  • min((70 − MI_avg).max(0) × 0.5, 15). Over 98 % of files have MI ≥ 70, dragging the mean above the trigger. The actionable signal is the small absolute count with MI < 70 — invisible to a mean. maintainability_p10 or count(MI < 70) would survive.

hotspotsvital_signs.rs:331-340 + scores.rs:4

  • Penalty: min(hotspot_count / total_files × 200, 10) where hotspot_count = files with score ≥ HOTSPOT_SCORE_THRESHOLD (= 50.0).
  • Score: (weighted_commits / max_weighted) × (complexity_density / max_density) × 100.
  • Max-norm caps every score at 1.0 × 1.0 × 100 = 100 only if a single file is both max-churned and max-density. In practice the top-churned file has moderate density and vice-versa, so the product is structurally bounded well below 50.0. Top-ranked hotspot reaches less than half of the threshold → hotspot_count = 0 even though thousands of files are ranked. Either expose the threshold or count "top N % of the within-project ranking".

unit_sizevital_signs.rs:359-365

  • min((very_high_risk_pct − 5).max(0) × 0.5, 10), very_high_risk = % of functions > 60 LOC. A substantial absolute inventory of functions over 60 LOC stays invisible because it's a small fraction of a large function-count denominator. Lower the floor (~1 %) or switch to functions_over_60_loc / 1k_functions.

couplingvital_signs.rs:368-373

  • min((p95_fan_in − 30).max(0) × 0.25, 5). Fan-in is heavy-tailed. p95 is in the single digits because the bottom 95 % is barely imported — not because there are no hubs. p99_fan_in (same trigger) or the already-computed coupling_high_pct (vital_signs.rs:285) would work.

unused_deps & circular_deps (saturated) — vital_signs.rs:343-356

  • min(count, 10) for both. unused_dep_count exceeds the cap by an order of magnitude; circular_dep_count by well over an order of magnitude. Counts grow ~linearly with workspace package count; the cap was reasonable for a single-package project but is a no-op in any monorepo. Recommended replacement: per-1k-files density.

Recommended fix: scale-invariant aggregations as the new default

A metric should ask "what fraction of your code is bad?" — not "are you big enough to dilute the bad code below a threshold?"

Dimension Current scale-blind aggregator Scale-invariant replacement
complexity avg_cyclomatic (mean over all functions) count(cc ≥ critical) / 1k functions
p90_complexity p90_cyclomatic > 10 (subsumed by complexity tail metric — drop)
maintainability mean(MI) < 70 % of files with MI < 70
hotspots count(score ≥ 50) / total_files × 200 top 1 % of within-project hotspot ranking / total_files × 200
unit_size % of functions > 60 LOC, trigger > 5 % count(functions > 60 LOC) / 1k functions
coupling p95_fan_in − 30 coupling_high_pct (already computed)
unused_deps min(count, 10) count / 1k files × 0.5, cap 25
circular_deps min(count, 10) count / 1k files × 0.5, cap 25

Every replacement is scale-invariant by construction — bigger codebases neither get a leniency dividend nor a size penalty. A 1K-file project and a 100K-file project with the same density of bad code score identically.

A small project (e.g. 1K files, single-digit unused deps, 1-2 cycles) sees a small improvement under the new densities, not a regression — density-based aggregators are simultaneously small-project-friendly and monorepo-honest.

Fallback ask (if changing defaults is too invasive)

If shipping these as new defaults moves every existing user's grade, the minimum useful change is expose the scale-invariant primitives as new vital_signs fields alongside the existing scale-blind ones — so dashboards and CI gates can compute honest scores externally:

vital_signs.functions_above_critical_cc_per_k   // replaces avg_cyc + p90_cyc
vital_signs.functions_above_60_loc_per_k        // replaces unit_size very_high_risk
vital_signs.maintainability_pct_below_70        // replaces maintainability_avg
vital_signs.hotspots_top_pct_count              // replaces hotspot_count
vital_signs.unused_deps_per_k_files             // replaces saturated unused_dep_count
vital_signs.circular_deps_per_k_files           // replaces saturated circular_dep_count

Doesn't move any grade, lets large monorepos compute honest scores externally. Strictly worse than fixing the defaults (fallow's own health_score would still report B when the data says D), but the smallest useful change.

Configurability audit (none of this is tunable today)

HealthConfig: the only score-relevant knob is health.ignore (denominator filter). All seven broken-dimension constants are hardcoded:

Constant Location Value
HOTSPOT_SCORE_THRESHOLD scores.rs:4 50.0
MI_DENSITY_MIN_LINES scores.rs:24 50.0
Per-dimension caps + floors vital_signs.rs:295-404 inline
Aggregator choice (mean / p90 / p95) vital_signs.rs:69-288 inline
Hotspot half-life crates/core/src/churn.rs (HALF_LIFE_DAYS = 90) 90

HealthConfig.maxCyclomatic / maxCognitive / maxCrap only affect finding emission, not the score — confirmed in compute_health_score, which never reads them. CLI flags --since / --min-commits widen the hotspot window but don't affect HOTSPOT_SCORE_THRESHOLD or the max-norm. No .fallowrc.json or CLI combination can move this score from B to its honest grade — scale-blindness lives in source-level constants.

Related: upstream signal defects

Two broken dimensions have defects in the upstream signal, not just in how the score consumes them. Even with compute_health_score() fixed, these will remain silent until the upstream signal is also addressed. Happy to file as companion issues.

  1. Hotspot scoring algorithm has a structural ceiling well below the threshold. compute_hotspot_score returns (weighted_commits / max_weighted) × (complexity_density / max_density) × 100. To reach 100 (or even 50), one file must be both max-churned and max-density. In real codebases the top-churned file has moderate density and vice-versa, so the product is structurally bounded well below the HOTSPOT_SCORE_THRESHOLD = 50.0 filter at vital_signs.rs:131. On any sufficiently large repo, top-ranked hotspots reach only a fraction of 50 → hotspot_count is always 0. A percentile-based filter ("files in the top 1 % of the within-project hotspot ranking") would survive max-norm compression.

  2. MI per-file formula's small-file dampening pushes most files to MI ≥ 70. compute_maintainability_index is 100 − density × 30 × dampening − dead_ratio × 20 − min(ln1p(fan_out) × 4, 15) where dampening = min(lines / MI_DENSITY_MIN_LINES, 1.0). Files under 50 LOC (barrels, models, utility) get density damped toward 0, pinning their MI near 100 regardless of internal complexity. Result: well over 98 % of scored files end up with MI ≥ 70 on any TS-heavy codebase. Fixing the score-formula aggregator alone helps but per-file MI is still inflated.

Why this matters

scores.rs:26-49 describes health_score as a comprehensible 0–100 summary suitable for dashboards and CI gates. With 5/11 dimensions silently 0 and 2/11 saturated, the score is structurally unable to communicate "really, really bad" for any sufficiently large project. The underlying data is excellent; the problem is in how the score formula aggregates it.

Reproduction

The bug is deterministic in the formula — given inputs in the shape produced by any large TS monorepo, compute_health_score() returns a B-band score with five 0.0 penalties and two saturated 10.0 penalties. No real codebase required.

Easiest: drop a unit test into fallow's own test suite

Following the existing pattern in vital_signs.rs:1135+ (health_score_perfect, etc.):

#[test]
fn health_score_silent_and_saturated_at_monorepo_scale() {
    // Inputs in the shape produced by any large multi-package TS monorepo.
    // Small perturbations don't change the qualitative result.
    let total_files: usize = 25_000;
    let vs = VitalSigns {
        // honest dimensions
        dead_file_pct:      Some(4.0),
        dead_export_pct:    Some(9.0),
        avg_cyclomatic:     2.3,
        duplication_pct:    Some(6.0),

        // silent dimensions — every value is "long-tail-hidden"
        p90_cyclomatic:     4,
        maintainability_avg:Some(91.0),  // mean dominated by small files
        hotspot_count:      Some(0),     // none cross HOTSPOT_SCORE_THRESHOLD = 50
        unit_size_profile:  Some(RiskProfile { very_high_risk: 2.3, ..Default::default() }),
        p95_fan_in:         Some(7),

        // saturated dimensions — counts grow with workspace package count
        unused_dep_count:   Some(180),
        circular_dep_count: Some(450),
        ..Default::default()
    };
    let score = compute_health_score(&vs, total_files);
    let p = &score.penalties;

    // 4 honest dimensions
    assert!(p.dead_files.unwrap()   > 0.0 && p.dead_files.unwrap()   < 5.0);
    assert!(p.dead_exports.unwrap() > 0.0 && p.dead_exports.unwrap() < 5.0);
    assert!(p.complexity            > 0.0 && p.complexity            < 10.0);
    assert!(p.duplication.unwrap()  > 0.0 && p.duplication.unwrap()  < 5.0);

    // 5 silent dimensions
    assert_eq!(p.p90_complexity,            0.0);
    assert_eq!(p.maintainability.unwrap(),  0.0);
    assert_eq!(p.hotspots.unwrap(),         0.0);
    assert_eq!(p.unit_size.unwrap(),        0.0);
    assert_eq!(p.coupling.unwrap(),         0.0);

    // 2 saturated dimensions
    assert_eq!(p.unused_deps.unwrap(),      10.0);
    assert_eq!(p.circular_deps.unwrap(),    10.0);

    assert_eq!(score.grade, "B");
}

Self-contained, runs in milliseconds. Same test with the recommended scale-invariant aggregators should drop the score by roughly one and a half letter grades (B → D).

End-to-end: synthetic monorepo generator

The script below produces a fully synthetic TS workspace whose vital_signs reproduces the broken-dimension pattern end-to-end. Defaults generate ~21K files in ~3.5 min (mostly git churn); --commits-per-fat-file=2 runs in under a minute with the same pattern. Smaller --packages / --files-per-pkg produce the partial pattern (3-4 silent dimensions).

node generate-monorepo.mjs ./repro                      # defaults: 80 pkgs × 250 files
cd ./repro && fallow health --hotspots --score --format json --quiet \
  | jq '.health.health_score.{score, grade, penalties}, .health.vital_signs'

Expected at defaults: score in the C band (~65), 5 of 11 penalties at 0.0 (p90_complexity, maintainability, unit_size, coupling, plus dead_files / dead_exports since synthetic data has no deads), 2 saturated at 10.0 (unused_deps, circular_deps). Bumping --fat-fns-per-pkg past 5 silences hotspots and lifts the score into B.

generate-monorepo.mjs — click to expand
#!/usr/bin/env node
// Reproduces fallow's health_score scale-blindness pattern (5 silent + 2 saturated).
// Usage: node generate-monorepo.mjs <out-dir> [--packages=80] [--files-per-pkg=250]
//        [--fat-fns-per-pkg=5] [--cycles-per-pkg=6] [--unused-deps-per-pkg=3]
//        [--commits-per-fat-file=8]

import { mkdirSync, writeFileSync, existsSync, rmSync } from 'node:fs';
import { execSync } from 'node:child_process';
import { join } from 'node:path';

const args = Object.fromEntries(process.argv.slice(2).filter(a => a.startsWith('--'))
  .map(a => { const [k, v] = a.replace(/^--/, '').split('='); return [k, v ?? true]; }));
const outDir = process.argv.find((a, i) => i > 1 && !a.startsWith('--')) ?? './repro';
const PACKAGES            = Number(args.packages              ?? 80);
const FILES_PER_PKG       = Number(args['files-per-pkg']      ?? 250);
const FAT_FNS_PER_PKG     = Number(args['fat-fns-per-pkg']    ?? 5);
const CYCLES_PER_PKG      = Number(args['cycles-per-pkg']     ?? 6);
const UNUSED_DEPS_PER_PKG = Number(args['unused-deps-per-pkg']?? 3);
const COMMITS_PER_FAT     = Number(args['commits-per-fat-file'] ?? 8);

if (existsSync(outDir)) rmSync(outDir, { recursive: true, force: true });
mkdirSync(outDir, { recursive: true });

writeFileSync(join(outDir, 'package.json'), JSON.stringify({
  name: 'fallow-repro', private: true,
  workspaces: Array.from({ length: PACKAGES }, (_, i) => `packages/pkg-${i}`),
}, null, 2));
writeFileSync(join(outDir, 'tsconfig.json'), JSON.stringify({
  compilerOptions: { target: 'ES2022', module: 'ESNext', moduleResolution: 'bundler', strict: true, skipLibCheck: true },
}, null, 2));

// Trivial file = 1 trivial fn (1-CC). Drives p90_cyc mean, very_high_risk %, MI mean.
const trivial = (p, i) =>
  `// pkg-${p} v${i}\nexport function get_v${i}_${p}(): number { return ${i} + ${p}; }\nexport const v${i}_${p} = ${i * (p + 1)};\n`;

// Fat file = 1 nested-switch fn → high CC, > 60 LOC.
const fat = (p, idx) => {
  const branches = Array.from({ length: 12 }, (_, b) => `    case ${b}: { switch (mode) {
      case 'a': return ${b} * 2 + ${p}; case 'b': return ${b} + 1 - ${idx};
      case 'c': return ${b} - 1 * ${p}; case 'd': return ${b} ** 2 + ${idx};
      default: return ${b} + ${p}; } }`).join('\n');
  return `export function fatFn_${p}_${idx}(input: number, mode: 'a'|'b'|'c'|'d'): number {
  switch (input) {\n${branches}\n    default: return input + ${p};\n  }\n}\n`;
};

// Cycle pair = two intra-package files importing each other. One pair = one cycle.
const cycA = (p, i) => `import { b_${p}_${i} } from './cycle-${i}-b';\nexport const a_${p}_${i} = b_${p}_${i} + ${p};\n`;
const cycB = (p, i) => `import { a_${p}_${i} } from './cycle-${i}-a';\nexport const b_${p}_${i} = a_${p}_${i} + ${i};\n`;

const barrel = (p) => {
  const L = [];
  for (let i = 0; i < FILES_PER_PKG; i++)   L.push(`export * from './v${i}';`);
  for (let f = 0; f < FAT_FNS_PER_PKG; f++) L.push(`export * from './fat-${f}';`);
  for (let c = 0; c < CYCLES_PER_PKG; c++)  { L.push(`export * from './cycle-${c}-a';`); L.push(`export * from './cycle-${c}-b';`); }
  return L.join('\n') + '\n';
};

// Pool of public packages claimed as deps but never imported.
const POOL = ['lodash', 'rxjs', 'date-fns', 'uuid', 'chalk', 'yargs', 'minimist', 'zod', 'axios', 'commander'];

for (let p = 0; p < PACKAGES; p++) {
  const dir = join(outDir, 'packages', `pkg-${p}`);
  mkdirSync(join(dir, 'src'), { recursive: true });
  const devDeps = {};
  for (let u = 0; u < UNUSED_DEPS_PER_PKG; u++) devDeps[POOL[(p + u) % POOL.length]] = '*';
  writeFileSync(join(dir, 'package.json'), JSON.stringify({
    name: `pkg-${p}`, version: '0.0.0', main: './src/barrel.ts', types: './src/barrel.ts',
    devDependencies: devDeps,
  }, null, 2));
  for (let f = 0; f < FILES_PER_PKG;   f++) writeFileSync(join(dir, 'src', `v${f}.ts`),    trivial(p, f));
  for (let f = 0; f < FAT_FNS_PER_PKG; f++) writeFileSync(join(dir, 'src', `fat-${f}.ts`), fat(p, f));
  for (let c = 0; c < CYCLES_PER_PKG;  c++) {
    writeFileSync(join(dir, 'src', `cycle-${c}-a.ts`), cycA(p, c));
    writeFileSync(join(dir, 'src', `cycle-${c}-b.ts`), cycB(p, c));
  }
  writeFileSync(join(dir, 'src', 'barrel.ts'), barrel(p));
}

// Hotspots need git history: commit-burst on each fat file.
const sh = (cmd) => execSync(cmd, { cwd: outDir, stdio: ['ignore', 'ignore', 'inherit'] });
sh('git init -q -b main && git config user.email s@s && git config user.name s && git add . && git -c commit.gpgsign=false commit -q -m init');
let date = new Date('2024-01-01T00:00:00Z').getTime();
for (let p = 0; p < PACKAGES; p++) for (let f = 0; f < FAT_FNS_PER_PKG; f++) {
  const path = `packages/pkg-${p}/src/fat-${f}.ts`;
  for (let c = 0; c < COMMITS_PER_FAT; c++) {
    sh(`printf '\\n// tweak ${c}\\n' >> "${path}"`);
    sh(`git -c commit.gpgsign=false -c user.email=s@s -c user.name=s commit -q --allow-empty-message --date="${new Date(date).toISOString()}" -am tweak`);
    date += 6 * 60 * 60 * 1000 + Math.floor(Math.random() * 6 * 60 * 60 * 1000);
  }
}
console.log(`Done. cd ${outDir} && fallow health --hotspots --score --format json --quiet | jq '.health.health_score'`);

Knob → score-formula-input mapping:

Knob Drives
--packages Workspace package count → unused_deps & circular_deps saturation
--files-per-pkg Total file count → silences unit_size, maintainability, hotspots
--fat-fns-per-pkg Fat-function tail (invisible to mean / p90 / fixed-percent)
--cycles-per-pkg Intra-package cycle count → circular_dep_count
--unused-deps-per-pkg unused_dep_count per package
--commits-per-fat-file Hotspot churn distribution

Optional: against a real codebase

cd <large-ts-monorepo>
git fetch --unshallow      # so hotspots have a real distribution (default --since 6m)
fallow health --hotspots --score --format json --quiet \
  | jq '.health.health_score.penalties, .health.vital_signs'

Look for: dimensions reporting 0.0 in .penalties paired with non-zero "bulk" inputs in .vital_signs (p90_cyclomatic > 0, maintainability_avg > 0, unit_size_profile.very_high_risk > 0, p95_fan_in > 0), plus unused_deps / circular_deps pegged at 10.

Expected behavior

The formula behaves exactly as written, the case for changing it: at large scale the calibration produces structurally false signal — not "wrong by a few points" but "5 of 11 dimensions cannot fire under any input distribution this codebase shape will produce", and "11 vs 1,000 unused deps score identically".

health_score should:

  1. Fire on a codebase containing thousands of fat functions, a measurable absolute count of files with MI < 70, and thousands of ranked hotspots.
  2. Differentiate small vs medium vs large vs catastrophic dep / cycle counts rather than collapsing them all to 10 pts.
  3. Produce different letter grades for repos with order-of-magnitude differences in bad-code volume.

Fallow version

fallow 2.62.0

Operating system

macOS

Configuration

default

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions