Fast robust estimation of location and scale in very small samples.
Classical estimators of location (mean) and scale (standard deviation) can be destroyed by a single outlier when the sample size is small. In experimental sciences---analytical chemistry, materials testing, clinical trials---measurements are often repeated only 3 to 8 times per condition, making this vulnerability a practical concern rather than a theoretical one.
Rousseeuw & Verboven (2002) developed M-estimators using a logistic psi function that remain well-behaved at these sample sizes. Their estimators decouple location and scale estimation to avoid the positive-feedback instability of Huber's Proposal 2, use the MAD as a fixed auxiliary scale, and fall back gracefully when the sample is too small even for iteration. The revss package by Avraham Adler provides a faithful pure-R implementation of these estimators.
robscale is a C++17/Rcpp reimplementation of the same three estimators---adm,
robLoc, and robScale---with an API-compatible interface. It produces
numerically identical results to revss (verified across 5,400 systematic
comparisons) while running 6--33x faster, with the largest gains in the target
regime of
Beyond compiled execution, robscale departs from revss in three algorithmic
choices: it replaces the scoring fixed-point iteration for location with true
Newton--Raphson, it exploits the algebraic identity std::nth_element beyond) for median
computation instead of
install.packages("robscale")
# Development version:
# install.packages("remotes")
# remotes::install_github("davdittrich/robscale")library(robscale)
x <- c(2.0, 3.1, 2.7, 2.9, 3.3) # clean measurements
mean(x) # 2.8000
sd(x) # 0.5000
robLoc(x) # 2.8471
robScale(x) # 0.3837
x[5] <- 100 # recording error
mean(x) # 22.1400 -- destroyed
sd(x) # 43.5270 -- destroyed
robLoc(x) # 2.9184 -- stable
robScale(x) # 0.4729 -- stableThe classical estimators are pulled toward the outlier. The robust estimators barely move.
| Function | Purpose | Key arguments |
|---|---|---|
adm(x) |
Average distance to the median | center, constant |
robLoc(x) |
Robust M-estimate of location | scale |
robScale(x) |
Robust M-estimate of scale | loc, implbound |
All three functions accept na.rm (default FALSE).
Computes the mean absolute deviation from the median, scaled by a consistency constant for asymptotic normality under the Gaussian model:
where center is
supplied, it replaces the median.
The ADM is not itself robust against outliers (explosion breakdown
adm(c(1, 2, 3, 5, 7, 8))
adm(c(1, 2, 3, 5, 7, 8), constant = 1) # without consistency correctionM-estimator for location using Newton--Raphson iteration with the logistic psi
function (Rousseeuw & Verboven 2002, Eq. 21). Starting value:
scale). See Algorithmic innovations
for the iteration formula.
Fallback logic: When scale is unknown and scale is
known and median(x) without iteration.
Providing a known scale lowers the minimum sample size from 4 to 3 because the
MAD (which is unreliable at
robLoc(c(1, 2, 3, 5, 7, 8))
robLoc(c(1, 2, 3), scale = 1.5) # known scale enables n = 3flowchart TD
A[Set minobs: 3 if scale known, 4 if unknown] --> B[med = median x]
B --> C{n < minobs?}
C -- Yes --> D([Return med])
C -- No --> E{Scale known?}
E -- Yes --> F[s = scale]
E -- No --> G[s = MAD x]
F --> H{s = 0?}
G --> H
H -- Yes --> I([Return med])
H -- No --> J[t = med]
J --> K["psi_i = tanh( (x_i - t) / 2s )"]
K --> L["sum_psi = Sum psi_i, sum_dpsi = Sum (1 - psi_i^2)"]
L --> M["t += 2s * sum_psi / sum_dpsi"]
M --> N{"abs(v) <= tol?"}
N -- No --> K
N -- Yes --> O([Return t])
robScale(x, loc = NULL, implbound = 1e-4, na.rm = FALSE, maxit = 80L, tol = sqrt(.Machine$double.eps))
M-estimator for scale using multiplicative iteration with the rho function---the square of the logistic psi (Rousseeuw & Verboven 2002, Eq. 27):
where
Fallback logic: When
- If
$\text{MAD}(x) \leq$ implbound: returnadm(x)(implosion fallback) - Otherwise: return
MAD(x)
Providing a known loc centers the data at that value and uses the
median-distance-to-zero ($1.4826 \cdot \text{median}(|x_i - \mu|)$) as the
initial scale, lowering the minimum sample size from 4 to 3. The flowchart
below shows the full control flow including the known/unknown branching,
implosion fallback, and multiplicative iteration loop.
robScale(c(1, 2, 3, 5, 7, 8))
robScale(c(1, 2, 3, 5, 7, 8), loc = 5) # known locationflowchart TD
A{Location known?}
A -- Yes --> B1[w = x - loc]
B1 --> B2["s = 1.4826 * median( abs(w) )"]
B2 --> B3[t = 0, minobs = 3]
A -- No --> C1[med = median x]
C1 --> C2[s = MAD x, t = med]
C2 --> C3[minobs = 4]
B3 --> D{n < minobs?}
C3 --> D
D -- Yes --> E{"MAD <= implbound?"}
E -- Yes --> F([Return ADM x])
E -- No --> G([Return MAD x])
D -- No --> H{s = 0?}
H -- Yes --> I([Return 0])
H -- No --> J["psi_i = tanh( (x_i - t) / (2cs) )"]
J --> K["sum_rho = Sum psi_i^2"]
K --> L["v = sqrt( 2 * sum_rho / n ), s *= v"]
L --> M{"abs(v - 1) <= tol?"}
M -- No --> J
M -- Yes --> N([Return s])
Both robscale and revss implement the estimators of Rousseeuw & Verboven
(2002). They solve the same estimating equations and produce the same numerical
results. The two packages differ in how the computation is carried out. This
section describes the algorithmic choices in robscale that diverge from the
revss implementation, and their performance consequences.
The logistic psi function central to both estimators is:
revss evaluates this as 2 * plogis(u) - 1, which calls R's plogis
(the logistic CDF, $1/(1 + e^{-x})$). The computation requires one call to
exp() followed by two arithmetic operations, plus the overhead of R's
vectorised dispatch, intermediate vector allocation, and garbage-collection
pressure.
The algebraic identity
is immediate from the definition of the hyperbolic tangent. robscale exploits
this identity to reduce tanh call. This is not
merely a cosmetic rewrite:
-
Branch elimination. A direct implementation of
$(e^x - 1)/(e^x + 1)$ overflows for large$|x|$ , requiring a sign-based branch to keep intermediate values bounded. Thetanhfunction handles this internally with a single code path. -
Platform vectorization. Standard library
tanhimplementations are available in vectorized form on major platforms. On macOS,robscalecalls Apple Accelerate'svvtanh, which processes the entire argument array in a single SIMD pass. On x86_64 Linux with GCC or Clang,#pragma omp simdhints allow the compiler to generate SIMD-vectorizedtanhwithout requiring-ffast-mathor the OpenMP runtime library. On Windows, the scalarstd::tanhfallback still benefits from branch elimination.
revss iterates the location estimator using the scoring fixed-point iteration
(Rousseeuw & Verboven 2002, Eq. 21):
where
robscale instead applies Newton--Raphson to the estimating equation
where
Newton--Raphson achieves quadratic convergence near the solution: the number
of correct digits approximately doubles per iteration. The practical effect is
a reduction from 4--8 iterations (scoring) to 3 iterations (Newton--Raphson)
for reaching the same tolerance of
| Scoring iterations | Newton--Raphson iterations | |
|---|---|---|
| 4 | 7 | 3 |
| 5 | 8 | 3 |
| 8 | 7 | 3 |
| 20 | 6 | 3 |
| 100 | 5 | 3 |
At small
Both the median and the MAD require computing a quantile---the median of the
data, and the median of the absolute deviations. revss uses R's median()
and mad(), which call sort() internally: an
robscale uses a tiered
For -O2---with the minimum number of comparisons for each
| Comparators | |
|---|---|
| 3 | 3 |
| 4 | 5 |
| 5 | 9 |
| 6 | 12 |
| 7 | 16 |
| 8 | 19 |
For std::nth_element (introselect),
which provides log/exp/sqrt computation only at scale.
Each estimator requires working arrays: a copy of the input (for destructive
selection), absolute deviations (for MAD), and a temporary buffer (for bulk
tanh arguments). revss allocates these as R vectors, incurring R's SEXPREC
header overhead and adding garbage-collection pressure.
robscale uses a stack-allocated arena of 512 doubles per segment. For
new[]/delete[].
The constants constexpr, allowing the compiler to
replace divisions in the iteration loop with multiplications. On ARM64, this
avoids the ~10-cycle fdiv instruction in favour of a ~3-cycle fmul.
Values that are constant across iterations---inv_s = 1.0 / s,
half_inv_s = 0.5 / s, inv_n = 1.0 / n---are computed once before the loop.
The revss implementation recomputes (x - t) / s as a fresh R vector each
iteration, traversing the interpreter for every vectorised operation.
Platform: R 4.5.2, Apple clang 17.0.0, macOS (Darwin 25.3.0), Apple Silicon
(aarch64). Median of 10,000 microbenchmark iterations per cell (
| Function |
revss ( |
robscale ( |
Speedup | |
|---|---|---|---|---|
| 3 | adm |
8.86 | 0.86 | 10x |
| 3 | robLoc |
21.77 | 1.56 | 14x |
| 3 | robScale |
48.50 | 1.68 | 29x |
| 4 | adm |
10.37 | 0.86 | 12x |
| 4 | robLoc |
56.05 | 1.72 | 33x |
| 4 | robScale |
93.28 | 3.03 | 31x |
| 5 | adm |
8.53 | 0.82 | 10x |
| 5 | robLoc |
43.95 | 1.64 | 27x |
| 5 | robScale |
95.94 | 3.20 | 30x |
| 8 | adm |
10.33 | 0.86 | 12x |
| 8 | robLoc |
36.24 | 1.68 | 22x |
| 8 | robScale |
79.62 | 2.87 | 28x |
| 20 | adm |
10.58 | 0.90 | 12x |
| 20 | robLoc |
37.56 | 1.80 | 21x |
| 20 | robScale |
90.53 | 3.94 | 23x |
| 100 | adm |
11.07 | 1.03 | 11x |
| 100 | robLoc |
53.18 | 2.58 | 21x |
| 100 | robScale |
113.41 | 8.61 | 13x |
| 500 | adm |
14.74 | 1.76 | 8x |
| 500 | robLoc |
103.14 | 6.19 | 17x |
| 500 | robScale |
288.64 | 35.10 | 8x |
| 1000 | adm |
18.16 | 3.16 | 6x |
| 1000 | robLoc |
163.14 | 10.82 | 15x |
| 1000 | robScale |
484.78 | 65.56 | 7x |
Interpretation. Speedups are largest in the target regime (robLoc speedup exceeds what compiled
execution alone would predict because the Newton--Raphson iteration converges in
3 steps where the scoring iteration requires 4--8 (see
Newton--Raphson iteration).
At larger adm function has no iteration loop,
so its speedup (6--12x) reflects only the compiled loop and robScale narrowing at large tanh vs interpreted
plogis() becomes the dominant factor, and both scale as
The revss timings include R function-call dispatch, vectorised plogis()
allocation, and garbage-collection pressure. robscale eliminates all three.
The test suite (inst/tinytest/test_cross_check.R) compares robscale and
revss across 5,400 randomly generated inputs:
- Sample sizes
$n = 3, 4, \ldots, 20$ - 100 replicates per
$n$ (uniform on$[-100, 100]$ , seed 42) - All three functions:
adm,robLoc,robScale - Tolerance:
$\sqrt{\epsilon_{\text{mach}}} \approx 1.49 \times 10^{-8}$
All 5,400 comparisons pass. The Newton--Raphson iteration converges to the same fixed point as the scoring iteration---it solves the same estimating equation---so the results differ only by rounding at the level of the convergence tolerance.
The estimators follow Rousseeuw & Verboven (2002). A brief summary of the key definitions:
Logistic psi function (Eq. 23):
Bounded in
Decoupled estimation. Location and scale are estimated separately with a
fixed auxiliary estimate, breaking the positive-feedback loop of Huber's
Proposal 2. robLoc fixes scale at robScale fixes
location at
Rho function for scale (Eq. 26):
where
Key constants (full double precision):
| Symbol | Value | Definition |
|---|---|---|
0.413241928283814 |
|
|
0.37394112142347236 |
Solution to |
|
1.2533141373155001 |
|
|
1.4826 |
|
This package reimplements the algorithms from the
revss package by Avraham Adler.
The API is intentionally identical: adm(), robLoc(), and robScale() accept
the same arguments and return the same values. Code that uses revss can switch
to robscale by changing only the library() call.
Users who do not need compiled performance---or who prefer a dependency-free
pure-R package---should use revss directly. It is mature, well-tested, and
available on CRAN.
Rousseeuw, P.J. and Verboven, S. (2002). Robust estimation in very small samples. Computational Statistics & Data Analysis, 40(4), 741--758. doi:10.1016/S0167-9473(02)00078-6
Floyd, R.W. and Rivest, R.L. (1975). Expected time bounds for selection. Communications of the ACM, 18(3), 165--172. doi:10.1145/360680.360691
Nair, K.R. (1947). A Note on the Mean Deviation from the Median. Biometrika, 34(3/4), 360--362. doi:10.2307/2332448
Dennis Alexis Valin Dittrich (ORCID)
MIT. Copyright 2026 Dennis Alexis Valin Dittrich.