Skip to content

Commit 613336d

Browse files
committed
feat(API): reject writes when disk space drops below 512 MiB
SQLite returns SQLITE_FULL mid-transaction when the underlying filesystem runs out of space, which can corrupt the WAL and leave the V1/V2 databases inconsistent. Add a defensive guard that periodically inspects free space on the cadt/v1 and cadt/v2 data directories and: - rejects POST/PUT/PATCH writes with HTTP 507 Insufficient Storage when free space drops below 512 MiB (sized for the existing 100 MB filestore upload limit + typical WAL growth during batch operations) - emits a debounced log warning when free space drops below 1 GiB so operators have advance notice before writes are rejected - always permits GET reads and DELETE so operators can free space without restarting the service Status (severity, freeBytes, threshold) is exposed under the diskSpace field on /health, /v1/health, and /v2/health for monitoring. /health uses a non-blocking peek of the cached statfs result so liveness probes never wait on a slow filesystem (a synchronous statfs round trip can trip the default k8s livenessProbe timeout precisely when /health most needs to respond). Severity transitions (ok->warn, warn->block, etc.) drive a single log line each. The earlier time-windowed debounce was vulnerable to /health scrape traffic refreshing the timestamp and silencing the operator-facing block alert when a real write first hit the threshold. Mirrors the frsize ?? bsize byte math used by the existing /diagnostics endpoint so the two views agree on free space. Thresholds are intentionally not user-configurable; the values are sized to the existing upload limits and lowering them risks the SQLITE_FULL the guard is designed to prevent.
1 parent 6e1636c commit 613336d

6 files changed

Lines changed: 668 additions & 0 deletions

File tree

README.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,15 @@ CADT and Chia system usage will depend on many factors, including how busy the b
4444

4545
ARM and x86 systems are supported. While Windows, MacOS, and all versions of Linux are supported, Ubuntu Linux is the recommended operating system as it is used most in testing and our internal hosting.
4646

47+
#### Disk space guard
48+
49+
CADT defensively rejects write requests when the filesystem holding the V1/V2 SQLite databases drops below a low-space threshold, so a full disk cannot corrupt the database mid-transaction. Operators should monitor for this and free space (or expand the volume) before it triggers.
50+
51+
* When free space falls below **1 GiB** (2³⁰ bytes), CADT logs a warning. Writes are still served.
52+
* When free space falls below **512 MiB** (2²⁰ × 512 bytes), CADT rejects `POST`, `PUT`, and `PATCH` requests with HTTP `507 Insufficient Storage` and logs an error. `GET` reads and `DELETE` requests are still served so operators can free space without restarting the service. Note that a very large `DELETE` (e.g., a cascading delete that touches many rows) writes to the SQLite WAL during the transaction; if the disk is critically low even an allowed `DELETE` may run out of space before the next checkpoint, so prefer freeing files outside the database first.
53+
* Log lines are emitted only on **severity transitions** (`ok``warn`, `warn``block`, recovery to `ok`, etc.) — not on every observation — so monitoring scrapes against `/health` cannot drown out the alert when free space first drops below a threshold.
54+
* Current status is exposed under the `diskSpace` field on the `/health`, `/v1/health`, and `/v2/health` endpoints for monitoring. The field is `null` until the first probe completes, otherwise an object with `severity` (`ok` / `warn` / `block` / `unknown`), `freeBytes` (the lowest free-space figure observed across the V1/V2 data directories, or `null` when every probe failed), and the `blockBytes` / `warnBytes` thresholds. The 507 response body itself only contains `{message, error: "INSUFFICIENT_DISK_SPACE", success: false}` — exact byte counts are reserved for `/health` so anonymous callers (the disk-space gate runs before the `CADT_API_KEY` check) cannot enumerate operational state.
55+
4756
### Linux
4857

4958
A binary file that can run on all Linux distributions on x86 hardware can be found for each tagged release named `cadt-linux-x64-<version>.zip`. This zip file will extract to the `cadt-linux-64` directory by default, where the `cadt` file can be executed to run the API.

src/middleware.js

Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,14 @@ import { OrganizationsV2 } from './models/v2/index.js';
2121
import { logger } from './config/logger.js';
2222
import { sendReadOnlyError } from './utils/read-only-response.js';
2323
import { getRateLimitRetryAfterSeconds } from './utils/rate-limit.js';
24+
import {
25+
checkDiskSpace,
26+
peekDiskSpaceStatus,
27+
refreshDiskSpaceStatus,
28+
logDiskSpaceStatus,
29+
buildInsufficientDiskSpaceError,
30+
isWriteRejectedByDiskGuard,
31+
} from './utils/disk-space.js';
2432

2533
const { USE_SIMULATOR } = getConfig().APP;
2634

@@ -183,6 +191,35 @@ app.use(async function (req, res, next) {
183191
return sendReadOnlyError(res);
184192
}
185193

194+
// Defensive disk-space guard. SQLite (the V1/V2 backing store) returns
195+
// SQLITE_FULL mid-transaction when the underlying filesystem runs out
196+
// of space, which can corrupt the WAL and leave the DB inconsistent.
197+
// Reject POST/PUT/PATCH below the hardcoded BLOCK_BYTES threshold
198+
// (see src/utils/disk-space.js) so reads stay available and operators
199+
// can free space via DELETE without restarting. Runs after READ_ONLY
200+
// (configuration wins over transient state) but before the wallet/
201+
// datalayer RPC checks below so we don't pay an RPC cost on a request
202+
// we already know we can't persist.
203+
if (isWriteRejectedByDiskGuard(req.method)) {
204+
let diskStatus = null;
205+
try {
206+
diskStatus = await checkDiskSpace();
207+
logDiskSpaceStatus(diskStatus);
208+
} catch (diskErr) {
209+
// Fail open: a bug in the guard itself must not take down writes.
210+
// computeStatus() swallows per-directory statfs errors and the
211+
// dir-resolution helpers don't throw, so reaching this branch
212+
// means something unexpected happened (e.g., getChiaRoot blew up
213+
// because os.homedir() failed under a misconfigured PID-1).
214+
logger.error(
215+
`disk-space-guard: unexpected check failure: ${diskErr.message}`,
216+
);
217+
}
218+
if (diskStatus && diskStatus.severity === 'block') {
219+
return res.status(507).json(buildInsufficientDiskSpaceError());
220+
}
221+
}
222+
186223
await assertChiaNetworkMatchInConfiguration();
187224
await assertDataLayerAvailable();
188225
if (req.method !== 'GET') {
@@ -507,9 +544,34 @@ app.use(async function (req, res, next) {
507544
});
508545

509546
app.get('/health', (req, res) => {
547+
// Surface disk-space status so monitoring can alert before writes get
548+
// blocked. Use the non-blocking peek (cached value only) and kick off
549+
// an async refresh in the background — synchronously awaiting statfs
550+
// here can cause k8s liveness probes (default 1 s timeout) to fail
551+
// when the filesystem is slow or wedged, which is exactly when /health
552+
// most needs to stay responsive.
553+
let diskSpace = null;
554+
try {
555+
const status = peekDiskSpaceStatus();
556+
if (status) {
557+
diskSpace = {
558+
severity: status.severity,
559+
freeBytes: status.freeBytes,
560+
blockBytes: status.blockBytes,
561+
warnBytes: status.warnBytes,
562+
};
563+
// Drive the debounced log from /health too so a mostly-idle CADT
564+
// (no writes in flight) still surfaces warn/block transitions.
565+
logDiskSpaceStatus(status);
566+
}
567+
refreshDiskSpaceStatus();
568+
} catch (err) {
569+
logger.debug(`disk-space-guard: /health peek failed: ${err.message}`);
570+
}
510571
res.status(200).json({
511572
message: 'OK',
512573
timestamp: new Date().toISOString(),
574+
diskSpace,
513575
});
514576
});
515577

src/routes/v1/index.js

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,12 @@
33
import express from 'express';
44
import { getWalletHealthResponse } from '../wallet-health.js';
55
import { getConfig } from '../../utils/config-loader';
6+
import {
7+
peekDiskSpaceStatus,
8+
refreshDiskSpaceStatus,
9+
logDiskSpaceStatus,
10+
} from '../../utils/disk-space.js';
11+
import { logger } from '../../config/logger.js';
612
const V1Router = express.Router();
713

814
import {
@@ -19,9 +25,30 @@ import {
1925
} from './resources';
2026

2127
V1Router.get('/health', (req, res) => {
28+
// Non-blocking: use cached disk-space status only and trigger an async
29+
// refresh. Synchronously awaiting statfs would risk timing out k8s
30+
// liveness probes when the filesystem is slow. See bare /health in
31+
// src/middleware.js for full rationale.
32+
let diskSpace = null;
33+
try {
34+
const status = peekDiskSpaceStatus();
35+
if (status) {
36+
diskSpace = {
37+
severity: status.severity,
38+
freeBytes: status.freeBytes,
39+
blockBytes: status.blockBytes,
40+
warnBytes: status.warnBytes,
41+
};
42+
logDiskSpaceStatus(status);
43+
}
44+
refreshDiskSpaceStatus();
45+
} catch (err) {
46+
logger.debug(`disk-space-guard: /v1/health peek failed: ${err.message}`);
47+
}
2248
res.status(200).json({
2349
message: 'V1 API is running',
2450
timestamp: new Date().toISOString(),
51+
diskSpace,
2552
});
2653
});
2754

src/routes/v2/index.js

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,13 +28,40 @@ import { OrganizationsV2Router } from './resources/organizations-v2.js';
2828
import { AuditV2Router } from './resources/audit-v2.js';
2929
import { OfferV2Router } from './resources/offer-v2.js';
3030
import { FilestoreV2Router } from './resources/filestore-v2.js';
31+
import {
32+
peekDiskSpaceStatus,
33+
refreshDiskSpaceStatus,
34+
logDiskSpaceStatus,
35+
} from '../../utils/disk-space.js';
36+
import { logger } from '../../config/logger.js';
3137

3238
const V2Router = express.Router();
3339

3440
V2Router.get('/health', (req, res) => {
41+
// Non-blocking: use cached disk-space status only and trigger an async
42+
// refresh. Synchronously awaiting statfs would risk timing out k8s
43+
// liveness probes when the filesystem is slow. See bare /health in
44+
// src/middleware.js for full rationale.
45+
let diskSpace = null;
46+
try {
47+
const status = peekDiskSpaceStatus();
48+
if (status) {
49+
diskSpace = {
50+
severity: status.severity,
51+
freeBytes: status.freeBytes,
52+
blockBytes: status.blockBytes,
53+
warnBytes: status.warnBytes,
54+
};
55+
logDiskSpaceStatus(status);
56+
}
57+
refreshDiskSpaceStatus();
58+
} catch (err) {
59+
logger.debug(`disk-space-guard: /v2/health peek failed: ${err.message}`);
60+
}
3561
res.status(200).json({
3662
message: 'V2 API is running',
3763
timestamp: new Date().toISOString(),
64+
diskSpace,
3865
});
3966
});
4067

0 commit comments

Comments
 (0)