robscale

Fast robust estimation of location and scale in very small samples.

Overview

Classical estimators of location (mean) and scale (standard deviation) can be destroyed by a single outlier when the sample size is small. In experimental sciences---analytical chemistry, materials testing, clinical trials---measurements are often repeated only 3 to 8 times per condition, making this vulnerability a practical concern rather than a theoretical one.

Rousseeuw & Verboven (2002) developed M-estimators using a logistic psi function that remain well-behaved at these sample sizes. Their estimators decouple location and scale estimation to avoid the positive-feedback instability of Huber's Proposal 2, use the MAD as a fixed auxiliary scale, and fall back gracefully when the sample is too small even for iteration. The revss package by Avraham Adler provides a faithful pure-R implementation of these estimators.

robscale is a C++17/Rcpp reimplementation of the same three estimators---adm, robLoc, and robScale---with an API-compatible interface. It produces numerically identical results to revss (verified across 5,400 systematic comparisons) while running 6--33x faster, with the largest gains in the target regime of $n \leq 20$.

Beyond compiled execution, robscale departs from revss in three algorithmic choices: it replaces the scoring fixed-point iteration for location with true Newton--Raphson, it exploits the algebraic identity $\psi_{\log}(x) = \tanh(x/2)$ to enable platform-vectorized transcendental evaluation, and it uses $O(n)$ selection (sorting networks for $n \leq 8$, std::nth_element beyond) for median computation instead of $O(n \log n)$ sorting.

Installation

install.packages("robscale")

# Development version:
# install.packages("remotes")
# remotes::install_github("davdittrich/robscale")

Motivating example

library(robscale)

x <- c(2.0, 3.1, 2.7, 2.9, 3.3)   # clean measurements

mean(x)                              # 2.8000
sd(x)                                # 0.5000
robLoc(x)                            # 2.8471
robScale(x)                          # 0.3837

x[5] <- 100                         # recording error

mean(x)                              # 22.1400  -- destroyed
sd(x)                                # 43.5270  -- destroyed
robLoc(x)                            # 2.9184   -- stable
robScale(x)                          # 0.4729   -- stable

The classical estimators are pulled toward the outlier. The robust estimators barely move.

API reference

Function	Purpose	Key arguments
`adm(x)`	Average distance to the median	`center`, `constant`
`robLoc(x)`	Robust M-estimate of location	`scale`
`robScale(x)`	Robust M-estimate of scale	`loc`, `implbound`

All three functions accept na.rm (default FALSE).

`adm(x, center, constant = 1.2533141373155001, na.rm = FALSE)`

Computes the mean absolute deviation from the median, scaled by a consistency constant for asymptotic normality under the Gaussian model:

$$\text{ADM}(x) = C \cdot \frac{1}{n}\sum_{i=1}^{n} |x_i - \text{med}(x)|$$

where $C = \sqrt{\pi/2} \approx 1.2533$ (Nair, 1947). When center is supplied, it replaces the median.

The ADM is not itself robust against outliers (explosion breakdown $1/n$), but it is highly resistant to implosion (implosion breakdown $(n{-}1)/n$). It serves as the fallback scale estimator when the MAD collapses to zero.

adm(c(1, 2, 3, 5, 7, 8))
adm(c(1, 2, 3, 5, 7, 8), constant = 1)   # without consistency correction

`robLoc(x, scale = NULL, na.rm = FALSE, maxit = 80L, tol = sqrt(.Machine$double.eps))`

M-estimator for location using Newton--Raphson iteration with the logistic psi function (Rousseeuw & Verboven 2002, Eq. 21). Starting value: $T^{(0)} = \text{median}(x)$. Auxiliary scale: $S = \text{MAD}(x)$ (or the user-supplied scale). See Algorithmic innovations for the iteration formula.

Fallback logic: When scale is unknown and $n < 4$, or when scale is known and $n < 3$, the function returns median(x) without iteration. Providing a known scale lowers the minimum sample size from 4 to 3 because the MAD (which is unreliable at $n = 3$) is no longer needed. The flowchart below shows the full control flow including fallbacks and the Newton--Raphson loop.

robLoc(c(1, 2, 3, 5, 7, 8))
robLoc(c(1, 2, 3), scale = 1.5)   # known scale enables n = 3

flowchart TD
    A[Set minobs: 3 if scale known, 4 if unknown] --> B[med = median x]
    B --> C{n < minobs?}
    C -- Yes --> D([Return med])
    C -- No --> E{Scale known?}
    E -- Yes --> F[s = scale]
    E -- No --> G[s = MAD x]
    F --> H{s = 0?}
    G --> H
    H -- Yes --> I([Return med])
    H -- No --> J[t = med]
    J --> K["psi_i = tanh( (x_i - t) / 2s )"]
    K --> L["sum_psi = Sum psi_i, sum_dpsi = Sum (1 - psi_i^2)"]
    L --> M["t += 2s * sum_psi / sum_dpsi"]
    M --> N{"abs(v) <= tol?"}
    N -- No --> K
    N -- Yes --> O([Return t])

`robScale(x, loc = NULL, implbound = 1e-4, na.rm = FALSE, maxit = 80L, tol = sqrt(.Machine$double.eps))`

M-estimator for scale using multiplicative iteration with the rho function---the square of the logistic psi (Rousseeuw & Verboven 2002, Eq. 27):

$$S^{(k)} = S^{(k-1)} \cdot \sqrt{2 \cdot \frac{1}{n}\sum \psi_{\log}^2!\left(\frac{x_i - T}{c \cdot S^{(k-1)}}\right)}$$

where $c = 0.37394112142347236$ and $T = \text{median}(x)$ is held fixed. Starting value: $S^{(0)} = \text{MAD}(x)$.

Fallback logic: When $n$ is below the minimum for iteration (4 for unknown location, 3 for known), the function checks for MAD implosion:

If $\text{MAD}(x) \leq$ implbound: return adm(x) (implosion fallback)
Otherwise: return MAD(x)

Providing a known loc centers the data at that value and uses the median-distance-to-zero ($1.4826 \cdot \text{median}(|x_i - \mu|)$) as the initial scale, lowering the minimum sample size from 4 to 3. The flowchart below shows the full control flow including the known/unknown branching, implosion fallback, and multiplicative iteration loop.

robScale(c(1, 2, 3, 5, 7, 8))
robScale(c(1, 2, 3, 5, 7, 8), loc = 5)   # known location

flowchart TD
    A{Location known?}
    A -- Yes --> B1[w = x - loc]
    B1 --> B2["s = 1.4826 * median( abs(w) )"]
    B2 --> B3[t = 0, minobs = 3]
    A -- No --> C1[med = median x]
    C1 --> C2[s = MAD x, t = med]
    C2 --> C3[minobs = 4]
    B3 --> D{n < minobs?}
    C3 --> D
    D -- Yes --> E{"MAD <= implbound?"}
    E -- Yes --> F([Return ADM x])
    E -- No --> G([Return MAD x])
    D -- No --> H{s = 0?}
    H -- Yes --> I([Return 0])
    H -- No --> J["psi_i = tanh( (x_i - t) / (2cs) )"]
    J --> K["sum_rho = Sum psi_i^2"]
    K --> L["v = sqrt( 2 * sum_rho / n ), s *= v"]
    L --> M{"abs(v - 1) <= tol?"}
    M -- No --> J
    M -- Yes --> N([Return s])

Algorithmic innovations

Both robscale and revss implement the estimators of Rousseeuw & Verboven (2002). They solve the same estimating equations and produce the same numerical results. The two packages differ in how the computation is carried out. This section describes the algorithmic choices in robscale that diverge from the revss implementation, and their performance consequences.

1. The tanh identity for the logistic psi function

The logistic psi function central to both estimators is:

$$\psi_{\log}(x) = \frac{e^x - 1}{e^x + 1}$$

revss evaluates this as 2 * plogis(u) - 1, which calls R's plogis (the logistic CDF, $1/(1 + e^{-x})$). The computation requires one call to exp() followed by two arithmetic operations, plus the overhead of R's vectorised dispatch, intermediate vector allocation, and garbage-collection pressure.

The algebraic identity

$$\psi_{\log}(x) = \tanh(x/2)$$

is immediate from the definition of the hyperbolic tangent. robscale exploits this identity to reduce $\psi_{\log}$ to a single tanh call. This is not merely a cosmetic rewrite:

Branch elimination. A direct implementation of $(e^x - 1)/(e^x + 1)$ overflows for large $|x|$, requiring a sign-based branch to keep intermediate values bounded. The tanh function handles this internally with a single code path.
Platform vectorization. Standard library tanh implementations are available in vectorized form on major platforms. On macOS, robscale calls Apple Accelerate's vvtanh, which processes the entire argument array in a single SIMD pass. On x86_64 Linux with GCC or Clang, #pragma omp simd hints allow the compiler to generate SIMD-vectorized tanh without requiring -ffast-math or the OpenMP runtime library. On Windows, the scalar std::tanh fallback still benefits from branch elimination.

2. Newton--Raphson iteration for location

revss iterates the location estimator using the scoring fixed-point iteration (Rousseeuw & Verboven 2002, Eq. 21):

$$T^{(k+1)} = T^{(k)} + S \cdot \frac{\frac{1}{n}\sum \psi_{\log}!\left(\frac{x_i - T^{(k)}}{S}\right)}{\alpha}$$

where $\alpha = \int \psi_{\log}'(u),d\Phi(u) \approx 0.4132$ is the normalization constant. This is a fixed-point iteration with linear convergence: each step reduces the error by a constant factor.

robscale instead applies Newton--Raphson to the estimating equation $\sum \psi_{\log}((x_i - T)/S) = 0$, yielding:

$$T^{(k+1)} = T^{(k)} + \frac{2,S\sum \psi!\left(\frac{x_i - T^{(k)}}{2S}\right)}{\sum \left[1 - \psi^2!\left(\frac{x_i - T^{(k)}}{2S}\right)\right]}$$

where $\psi(\cdot) = \tanh(\cdot)$. This follows from observing that the derivative of the logistic psi satisfies $\psi_{\log}'(x) = 1 - \psi_{\log}^2(x) = 1 - \tanh^2(x/2)$. Since $\tanh$ values have already been computed for the numerator, the denominator requires only squaring and subtraction---no additional transcendental function calls.

Newton--Raphson achieves quadratic convergence near the solution: the number of correct digits approximately doubles per iteration. The practical effect is a reduction from 4--8 iterations (scoring) to 3 iterations (Newton--Raphson) for reaching the same tolerance of $\sqrt{\epsilon_{\text{mach}}} \approx 1.49 \times 10^{-8}$:

$n$	Scoring iterations	Newton--Raphson iterations
4	7	3
5	8	3
8	7	3
20	6	3
100	5	3

At small $n$, the scoring iteration count is higher because the starting value (the median) can be far from the M-estimate in units of the auxiliary scale. Newton--Raphson absorbs this gap in fewer steps.

3. $O(n)$ median selection

Both the median and the MAD require computing a quantile---the median of the data, and the median of the absolute deviations. revss uses R's median() and mad(), which call sort() internally: an $O(n \log n)$ operation.

robscale uses a tiered $O(n)$ median selection strategy. For even $n$, a single linear scan over the upper partition locates the $(k{+}1)$th element needed for averaging.

For $n \leq 8$---the core target regime---the selection step uses optimal sorting networks (Knuth, TAOCP Vol. 3, Sec. 5.3.4). These are conditional compare-and-swap sequences---typically compiled to branchless machine code at -O2---with the minimum number of comparisons for each $n$:

$n$	Comparators
3	3
4	5
5	9
6	12
7	16
8	19

For $9 \leq n < 600$, the code delegates to std::nth_element (introselect), which provides $O(n)$ worst-case selection with median-of-three pivot selection. For $n \geq 600$, the Floyd--Rivest algorithm (Floyd & Rivest, 1975) applies a statistical narrowing step that reduces the active window to $O(n^{2/3})$ elements before partitioning---a constant-factor improvement that amortizes the overhead of its log/exp/sqrt computation only at scale.

4. Arena allocation on the stack

Each estimator requires working arrays: a copy of the input (for destructive selection), absolute deviations (for MAD), and a temporary buffer (for bulk tanh arguments). revss allocates these as R vectors, incurring R's SEXPREC header overhead and adding garbage-collection pressure.

robscale uses a stack-allocated arena of 512 doubles per segment. For $n \leq 512$---which covers the target regime and far beyond---there is zero heap allocation. For $n > 512$, the code falls back to new[]/delete[].

5. Compile-time reciprocal constants

The constants $1/\alpha = 1/0.413241928283814$ and $1/c = 1/0.37394112142347236$ are declared constexpr, allowing the compiler to replace divisions in the iteration loop with multiplications. On ARM64, this avoids the ~10-cycle fdiv instruction in favour of a ~3-cycle fmul.

6. Loop-invariant hoisting

Values that are constant across iterations---inv_s = 1.0 / s, half_inv_s = 0.5 / s, inv_n = 1.0 / n---are computed once before the loop. The revss implementation recomputes (x - t) / s as a fresh R vector each iteration, traversing the interpreter for every vectorised operation.

Benchmarks

Platform: R 4.5.2, Apple clang 17.0.0, macOS (Darwin 25.3.0), Apple Silicon (aarch64). Median of 10,000 microbenchmark iterations per cell ($n \leq 100$); 2,000 iterations for $n > 100$.

$n$	Function	`revss` ($\mu$ s)	`robscale` ($\mu$ s)	Speedup
3	`adm`	8.86	0.86	10x
3	`robLoc`	21.77	1.56	14x
3	`robScale`	48.50	1.68	29x
4	`adm`	10.37	0.86	12x
4	`robLoc`	56.05	1.72	33x
4	`robScale`	93.28	3.03	31x
5	`adm`	8.53	0.82	10x
5	`robLoc`	43.95	1.64	27x
5	`robScale`	95.94	3.20	30x
8	`adm`	10.33	0.86	12x
8	`robLoc`	36.24	1.68	22x
8	`robScale`	79.62	2.87	28x
20	`adm`	10.58	0.90	12x
20	`robLoc`	37.56	1.80	21x
20	`robScale`	90.53	3.94	23x
100	`adm`	11.07	1.03	11x
100	`robLoc`	53.18	2.58	21x
100	`robScale`	113.41	8.61	13x
500	`adm`	14.74	1.76	8x
500	`robLoc`	103.14	6.19	17x
500	`robScale`	288.64	35.10	8x
1000	`adm`	18.16	3.16	6x
1000	`robLoc`	163.14	10.82	15x
1000	`robScale`	484.78	65.56	7x

Interpretation. Speedups are largest in the target regime ($n \leq 20$): 10--33x for individual functions. The robLoc speedup exceeds what compiled execution alone would predict because the Newton--Raphson iteration converges in 3 steps where the scoring iteration requires 4--8 (see Newton--Raphson iteration).

At larger $n$, the advantage narrows. The adm function has no iteration loop, so its speedup (6--12x) reflects only the compiled loop and $O(n)$ median selection. The robScale narrowing at large $n$ (7--8x) is expected: the iteration count is the same in both implementations (multiplicative iteration is retained), so the per-iteration cost of vectorized tanh vs interpreted plogis() becomes the dominant factor, and both scale as $O(n)$.

The revss timings include R function-call dispatch, vectorised plogis() allocation, and garbage-collection pressure. robscale eliminates all three.

Numerical equivalence

The test suite (inst/tinytest/test_cross_check.R) compares robscale and revss across 5,400 randomly generated inputs:

Sample sizes $n = 3, 4, \ldots, 20$
100 replicates per $n$ (uniform on $[-100, 100]$, seed 42)
All three functions: adm, robLoc, robScale
Tolerance: $\sqrt{\epsilon_{\text{mach}}} \approx 1.49 \times 10^{-8}$

All 5,400 comparisons pass. The Newton--Raphson iteration converges to the same fixed point as the scoring iteration---it solves the same estimating equation---so the results differ only by rounding at the level of the convergence tolerance.

Mathematical background

The estimators follow Rousseeuw & Verboven (2002). A brief summary of the key definitions:

Logistic psi function (Eq. 23):

$$\psi_{\log}(x) = \frac{e^x - 1}{e^x + 1} = \tanh(x/2)$$

Bounded in $(-1, 1)$, smooth ($C^\infty$), strictly monotone. Boundedness provides robustness; smoothness avoids the corner artifacts of Huber's psi at small $n$.

Decoupled estimation. Location and scale are estimated separately with a fixed auxiliary estimate, breaking the positive-feedback loop of Huber's Proposal 2. robLoc fixes scale at $\text{MAD}(x)$; robScale fixes location at $\text{median}(x)$.

Rho function for scale (Eq. 26):

$$\rho_{\log}(x) = \psi_{\log}^2(x / c)$$

where $c = 0.37394112142347236$ is the constant that yields 50% breakdown point.

Key constants (full double precision):

Symbol	Value	Definition
$\alpha$	`0.413241928283814`	$\int \psi_{\log}'(u),d\Phi(u)$; scoring normalization constant
$c$	`0.37394112142347236`	Solution to $\int \rho_{\log}(u),d\Phi(u) = 0.5$; scale rho constant
$C_{\text{ADM}}$	`1.2533141373155001`	$\sqrt{\pi/2}$; ADM consistency constant
$C_{\text{MAD}}$	`1.4826`	$1/\Phi^{-1}(3/4)$; MAD consistency constant

Relation to revss

This package reimplements the algorithms from the revss package by Avraham Adler. The API is intentionally identical: adm(), robLoc(), and robScale() accept the same arguments and return the same values. Code that uses revss can switch to robscale by changing only the library() call.

Users who do not need compiled performance---or who prefer a dependency-free pure-R package---should use revss directly. It is mature, well-tested, and available on CRAN.

References

Rousseeuw, P.J. and Verboven, S. (2002). Robust estimation in very small samples. Computational Statistics & Data Analysis, 40(4), 741--758. doi:10.1016/S0167-9473(02)00078-6

Floyd, R.W. and Rivest, R.L. (1975). Expected time bounds for selection. Communications of the ACM, 18(3), 165--172. doi:10.1145/360680.360691

Nair, K.R. (1947). A Note on the Mean Deviation from the Median. Biometrika, 34(3/4), 360--362. doi:10.2307/2332448

Author

Dennis Alexis Valin Dittrich (ORCID)

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
R		R
inst		inst
man		man
src		src
tests		tests
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.md		README.md
cleanup		cleanup
configure		configure
cran-comments.md		cran-comments.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

robscale

Overview

Installation

Motivating example

API reference

`adm(x, center, constant = 1.2533141373155001, na.rm = FALSE)`

`robLoc(x, scale = NULL, na.rm = FALSE, maxit = 80L, tol = sqrt(.Machine$double.eps))`

`robScale(x, loc = NULL, implbound = 1e-4, na.rm = FALSE, maxit = 80L, tol = sqrt(.Machine$double.eps))`

Algorithmic innovations

1. The tanh identity for the logistic psi function

2. Newton--Raphson iteration for location

3. $O(n)$ median selection

4. Arena allocation on the stack

5. Compile-time reciprocal constants

6. Loop-invariant hoisting

Benchmarks

Numerical equivalence

Mathematical background

Relation to revss

References

Author

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

robscale

Overview

Installation

Motivating example

API reference

adm(x, center, constant = 1.2533141373155001, na.rm = FALSE)

robLoc(x, scale = NULL, na.rm = FALSE, maxit = 80L, tol = sqrt(.Machine$double.eps))

robScale(x, loc = NULL, implbound = 1e-4, na.rm = FALSE, maxit = 80L, tol = sqrt(.Machine$double.eps))

Algorithmic innovations

1. The tanh identity for the logistic psi function

2. Newton--Raphson iteration for location

3. $O(n)$ median selection

4. Arena allocation on the stack

5. Compile-time reciprocal constants

6. Loop-invariant hoisting

Benchmarks

Numerical equivalence

Mathematical background

Relation to revss

References

Author

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`adm(x, center, constant = 1.2533141373155001, na.rm = FALSE)`

`robLoc(x, scale = NULL, na.rm = FALSE, maxit = 80L, tol = sqrt(.Machine$double.eps))`

`robScale(x, loc = NULL, implbound = 1e-4, na.rm = FALSE, maxit = 80L, tol = sqrt(.Machine$double.eps))`

Packages