Nightingale on Rails: Bayesian Adaptive Psychiatric Screening

Inspiration

All members of the team personally struggle with a combination of mild depression, anxiety, and rumination, and we are hardly the only ones. It's trite to say that poor mental health has become an epidemic. It's equally trite to point out that resources for better mental health are scarce. For example, at UofT, free mental health support does not equal prompt mental health support (from personal experience) if there were a way for us to reduce the workload on psychiatrists while simultaneously increasing the efficiency of diagnosis.

What it does

Nightingale on Rails (NoR) is named after Florence Nightingale, a doctor and statistician famous for her realization that good data surrounding a disease is crucial to its treatment. Similarly, we identified that one way to help psychiatrists is to assist them in collecting data more efficiently and effectively. After asking patients to complete a short questionnaire, we ask them to talk to a custom LLM that does not provide therapy but instead listens and asks probing questions aimed at shedding light on the specifics, context, and scope of the patient's mental health issues through their answers. Then, through Bayesian analysis, we attempt to correlate the data to form possible mental health disorders that may be of interest, and ship them all in one package to the doctor. This means, for example, long interviews between the psychiatrist and the patient can be shortened and the psychiatrist can instead rely on the summaries from the transcript of the conversation between the custom LLM and the patient.

How It Works (Tech)

Free-text intake. The patient describes what's going on in their own words.
Broad screening (Stage 1). The AI selects a question to ask based on what the patient said. The system administers items that load across multiple psychiatric dimensions — the kind of questions that appear on every screening instrument because they're relevant to everything (sleep, energy, worry, concentration). Each response updates a Bayesian model over 8 psychiatric spectra simultaneously, with correlated conditions getting updated together. The system picks the most informative next question at each step using information-theoretic item selection and the AI chooses among the top candidates based on the conversation context. The AI also auto-scores items that the patient volunteers.
Targeted disambiguation (Stage 2). Once the broad shape is clear (e.g., "mostly internalizing, some compulsivity features"), the system switches to condition-specific items from specialized instruments (Y-BOCS for OCD, PCL-5 for PTSD, etc.) that disambiguate between conditions within flagged psychiatric spectra. Most patients need 20–30 total items to reach confident estimates across all relevant conditions.
Report and follow-up. The system generates a dimensional profile with calibrated uncertainty, flags conditions worth discussing with a provider, and notes what was and wasn't assessed. The patient can ask follow-up questions. The report is presented in two tabs:
- Tab 1: A list of conditions flagged with advice to take to a provider, with color-coded certainties (e.g., Depression: Likely).
- Tab 2: A hierarchical model with the user's mean for each spectrum described in magnitude (e.g., Internalizing: High) on the first layer, followed by each condition (abbreviated with tooltips) on the layer below with likelihood described in magnitude, including an option for "not screened" shown in grey.

How It Works (Psychiatry)

The short answer: Psychiatric conditions are often diagnosed by asking patients questions. Mental health conditions are highly clustered. We use a model that takes into account the strength of the relationship between each questionnaire and cluster (or "spectrum") to diagnose patients with fewer questions.

The long answer:

How we built it

Next.js + Tailwind CSS + shadcn/ui + Supabase PostgreSQL + Auth. LLM built through the Anthropic platform.

Technical Model

Overview

A Bayesian model for adaptive psychiatric screening that uses validated questionnaire items to update beliefs about a patient's position across multiple psychiatric dimensions simultaneously.

The system administers items in two stages: broad screening items first (which load on many dimensions and have extensive published psychometric data), then narrow diagnostic items (which disambiguate between specific conditions the patient is most likely to have).

An LLM handles conversational administration of items, deduplication of overlapping items across instruments, and gap-filling for sparsely studied item loadings. The statistical backbone is deterministic linear algebra applied to parameters derived from published psychometric literature.

Model Definition

State Space

Prior: z ~ N(μ₀, Σ₀)

Symbol	What it represents
z	Liability scores across all psychiatric dimensions — the thing we're trying to estimate
μ₀	Prior mean liability for each dimension, encoding base rates for the target clinical population (not general population — psychiatric clinic base rates are much higher than community prevalence, and this matters enormously)
Σ₀	Prior covariance matrix encoding all pairwise correlations between dimensions. Diagonal entries are prior uncertainty per dimension; off-diagonal entries are how correlated each pair of dimensions is
N(·,·)	Multivariate normal distribution — a probability cloud in n-dimensional space centered at μ₀ with shape and orientation determined by Σ₀

Σ₀ must be positive semidefinite — meaning all the pairwise correlations must be mutually consistent. If A correlates with B at 0.9 and B correlates with C at 0.9, then A must correlate with C fairly strongly too — you can't set it to 0. If correlations are assembled from separate studies they may violate this, so project to the nearest consistent matrix (a standard matrix operation).

Observation Model

yᵢ = hᵢᵀz + εᵢ, where εᵢ ~ N(0, σᵢ²)

Symbol	What it represents
yᵢ	Observed response to item i (normalized to a common scale across instruments)
hᵢ	Loading vector for item i — how strongly item i indexes each dimension. Mostly zeros: a sleep disturbance item loads on internalizing and maybe somatoform, zeros elsewhere
hᵢᵀz	Dot product of loadings and true liabilities — the model's prediction of how this person would respond to item i if there were no noise. Each entry of hᵢ gets multiplied by the corresponding entry of z and summed
εᵢ	Random noise in the response
σᵢ²	Noise variance for item i. Floor is 1 − rᵢ where rᵢ is the item's test-retest reliability. Inflated upward (1.5–2×) to hedge against unmodeled residual correlations with previously administered items

Bayesian Updates

Upon observing response yᵢ:

Kalman gain:

$$K = \frac{\Sigma h_i}{h_i^\top \Sigma h_i + \sigma_i^2}$$

Symbol	What it represents
K	Kalman gain vector — how much to shift each dimension's estimate per unit of surprise
Σhᵢ (numerator)	Current uncertainty projected through the item's relevance profile. Large for dimensions you're uncertain about and the item loads on
hᵢᵀΣhᵢ + σᵢ² (denominator)	Total predicted variance of this item's response — how spread out you expect yᵢ to be given current uncertainty plus item noise. Normalizes the update so noisy items get downweighted

Mean update:

$$\mu \leftarrow \mu + K(y_i - h_i^\top \mu)$$

Symbol	What it represents
yᵢ − hᵢᵀμ	Surprise — difference between observed response and predicted response given current beliefs
K × surprise	The surprise distributed across all dimensions, proportional to both the item's loading profile and the current correlation structure. Correlated dimensions get dragged along

Covariance update:

$$\Sigma \leftarrow \Sigma - K(h_i^\top \Sigma)$$

Symbol	What it represents
K(hᵢᵀΣ)	The uncertainty removed by this observation. Largest for dimensions the item loads on heavily

This covariance update doesn't depend on yᵢ — uncertainty shrinks the same amount regardless of the actual response. Only μ depends on the answer.

Two-Stage Item Selection

Stage 1: Broad Screening

Select item i that maximizes tr(Σ) − tr(Σ_new(i)).

Symbol	What it represents
tr(Σ)	Trace — sum of all diagonal entries of Σ, i.e., total uncertainty across all dimensions
tr(Σ) − tr(Σ_new(i))	Total uncertainty removed by item i. Computable before administering because the Σ update doesn't depend on the response

Stage-one items have high loadings across multiple dimensions (high general psychopathology / p-factor loading). They are diagnostically nonspecific — they tell you that something is wrong and roughly where in the space of conditions it lives, but not precisely what. These items appear on many instruments and have the most extensive published cross-loading data.

Stage 2: Targeted Disambiguation

Select items whose hᵢ aligns with the eigenvectors of Σ corresponding to its largest eigenvalues.

Symbol	What it represents
Eigenvalues of Σ	How much uncertainty exists along each independent axis of the remaining uncertainty ellipsoid. Large eigenvalue = a direction you're still very unsure about
Eigenvectors of Σ	The directions those axes point — e.g., "the OCD-vs-anxiety distinction"
"Aligns with"	The item's loading vector points in the same direction as your remaining uncertainty, so it maximally resolves it

Stage-two items have narrow, strong loadings on one or two dimensions. They come from specialized instruments targeting specific conditions. They're the ones that disambiguate.

Stage Transition Criterion

Switch from stage one to stage two when the ratio of the largest to smallest eigenvalue of Σ exceeds a threshold.

Ratio ≈ 1: equally uncertain about everything → keep asking broad questions
Ratio >> 1: uncertainty concentrated in one direction → switch to targeted items

Or when broad-spectrum items' expected information gain drops below a floor.

Termination Criterion

Stop when all Σⱼⱼ (diagonal entries — per-dimension uncertainty) fall below a confidence threshold, or no remaining item offers meaningful variance reduction, or maximum items reached.

Output

Dimensional profile: μ (liability estimate per dimension) and diag(Σ) (remaining uncertainty per dimension).

Categorical diagnostic mapping:

$$P(z_j > \tau_j) = 1 - \Phi!\left(\frac{\tau_j - \mu_j}{\sqrt{\Sigma_{jj}}}\right)$$

Symbol	What it represents
τⱼ	Clinical threshold for dimension j (above which the condition is considered present)
μⱼ	Current liability estimate for dimension j
Σⱼⱼ	Remaining uncertainty for dimension j
(τⱼ − μⱼ) / √Σⱼⱼ	How many standard deviations the threshold sits above the current estimate
Φ	Cumulative normal distribution function — converts a z-score to a probability
1 − Φ(·)	Probability that true liability exceeds the threshold, i.e., probability of meeting diagnostic criteria

Parameter Sourcing

Correlation Matrix Σ₀

The most important parameter and the most available. Meta-analyses of comorbidity structure (especially those feeding into HiTOP validation) publish exactly this. The Ringwald et al. meta-analysis pooled 120,596 participants across 35 studies assessing 23 DSM diagnoses, estimated a meta-analytic correlation matrix, and found that five transdiagnostic dimensions fit well with published factor loadings for each diagnosis onto each spectrum. That correlation matrix and those loadings are the structural skeleton of Σ₀.

If modeling at the HiTOP spectrum level (6 dimensions), Σ₀ is a 6×6 matrix with 15 unique off-diagonal correlations — very manageable. If modeling at the condition level (say 15 conditions), it's a 15×15 matrix with 105 correlations, reconstructable from the published spectrum-level structure plus within-spectrum comorbidity data.

Base Rates μ₀

From epidemiological literature, adjusted for the target population. Community prevalence rates are the wrong prior for a screening tool — psychiatric clinic base rates are much higher and shift the entire model. Published survey data (e.g., SAMHSA national surveys) provide condition-level prevalence for various populations.

Loading Vectors h

Assembled by chaining two types of published data:

Within-instrument item-level factor loadings. Published for essentially every major instrument (PHQ-9, GAD-7, PCL-5, Y-BOCS, AUDIT, CAPE, etc.) across dozens of studies.
Instrument-to-dimension mappings. Published at the scale level in HiTOP alignment studies (e.g., Wendt et al. 2023 on mapping established scales onto HiTOP).

Chaining: if PHQ-9 item 1 ("little interest or pleasure") loads 0.8 on anhedonia within the PHQ-9, and the PHQ-9 depression construct loads 0.85 on the internalizing spectrum, multiply through for a crude but defensible item-level loading onto internalizing.

Most entries in each h vector are zero by construction. A Y-BOCS ritual item has no meaningful loading on thought disorder or antagonistic externalizing. You only need nonzero values for the 1–3 spectra each item plausibly touches. The sparsity drastically reduces the number of parameters to estimate.

Loading vectors can also be weighted by how commonly each questionnaire is used for each condition as a proxy for clinical validity. This captures the fact that the PHQ-9 is the primary instrument for depression (high weight) but is sometimes used to flag anxiety (lower weight) and is never used for psychosis (zero weight).

Noise Variances σᵢ²

The easiest parameters. Item-level test-retest reliability is published for all major instruments. Set σᵢ² = 1 − rᵢ as a floor, then inflate by a factor (1.5–2×) to hedge against unmodeled residual correlations between items.

Open Data Sources

HiTOP-DAT: A 405-item battery across existing instruments covering most HiTOP traits and components, with community normative data. All instruments freely available with author permission.
HiTOP-friendly measures: A curated list of instruments consistent with the HiTOP framework, many free to use.
Published meta-analyses: The Ringwald et al. meta-analysis provides the factor structure; individual instrument validation studies provide item-level parameters.
Bifactor studies: PHQ-9 × GAD-7 bifactor analyses (co-administered in thousands of studies) provide the cross-loading data needed for stage-one items.
Many, many more.

Two-Stage Logic: Why It Works

Stage One: Broad Items First

Items like sleep disturbance, concentration problems, fatigue, and persistent worry are common across instruments because they load onto multiple dimensions. They're diagnostically promiscuous. Administering them first gives maximum information per question about the broad shape of someone's profile — is this mostly internalizing? Is there externalizing involvement? Thought disorder?

These items also have the best-studied cross-loadings precisely because they appear everywhere and have been factor-analyzed extensively. The cross-loading data problem is smallest for items where it matters most.

Stage Two: Narrow Items to Disambiguate

Once the broad shape is established (e.g., "mostly internalizing, probably not externalizing or thought disorder"), uncertainty is concentrated rather than diffuse. Specific items — compulsive rituals, flashback intrusions, grandiosity, voice-hearing — have narrow, strong loadings on one or two dimensions and barely touch anything else. Each one has high leverage because it's targeted at exactly where the remaining ambiguity is.

For these items, you don't need cross-loading data because they essentially load on one thing. Y-BOCS ritual items don't need a schizophrenia loading; you just set it to zero.

The Correlation Web Reduces Total Questions Needed

You never directly ask about most conditions — you infer them through the correlation structure. A few well-chosen broad questions plus a few well-chosen specific ones triangulate a position in a high-dimensional space.

Double-Counting Protection

The primary defense against double-counting is architectural: the item selection step naturally avoids redundancy. After administering several internalizing items, uncertainty about internalizing is already low. Another internalizing item's expected variance reduction is therefore low — the system won't select it because it wouldn't learn much. It will instead pick something targeting whichever dimension has the highest remaining uncertainty.

For content-overlapping items across instruments (e.g., a sleep item on both PHQ-9 and another instrument), the LLM flags duplicates before administration, and overlapping items are either merged or their σ² is inflated to reflect the shared content.

What Makes It Different

Unified cross-diagnostic model. Every item response updates every spectrum or condition simultaneously through empirical correlation structure. A high score on a depression item also partially updates anxiety and bipolar estimates — because those conditions co-occur in predictable ways.
Two-stage design that matches data availability to model complexity. Broad items have extensively published cross-loadings across spectra — so we use them first, where the data is strong. Narrow items load on one condition — so we use them second, where cross-loading data isn't needed. This sidesteps the fundamental problem of needing cross-instrument item correlations that don't exist in the literature.
AI assistant. The AI handles follow-up clarification and auto-scores symptoms mentioned in free text. The statistical model does 90% of the decision-making; the AI handles the 10% that requires conversational judgment.
Open and inspectable. Every parameter in the model (correlation matrices, item loadings, base rates, noise variances) is sourced from published psychometric literature with full provenance tracking. The model is open-source and can be audited.

Core Innovation

The key insight is that the psychometric data needed to build this system already exists — scattered across thousands of published validation studies, meta-analyses, and the HiTOP consortium's work on the empirical structure of psychopathology. What didn't exist was the architecture to stitch it together into a unified adaptive screening model, or the interface technology (LLMs) to administer it conversationally.

Challenges we ran into

Tuning the LLM to perform Bayesian analysis to suggest related mental health disorders was particularly difficult, as it tested not only our technical fluency but also relevant applications of statistical models.

Accomplishments that we're proud of

Every member of the team would be happy to use this for their next mental health services because it alleviates the painful issue of delay that all healthcare systems, especially those in Canada, face. We are pleased with the potential applications of NoR in the real system; not only does it work well, but it's a technology whose integration into the existing system poses virtually no legal hurdles.

What we learned

We learned that mental health is a complex problem with many intertwining factors to consider. With the proliferation of LLMs, it's tempting to be on the polarizing ends of their application in mental health diagnosis. However, we learned to take an "in-the-middle" stance that balances the limitation of LLMs with its benefits by not setting its purpose to diagnose but to analyze large amounts of data and, to an extent, replace psychiatrists by conducting virtual interviews.