Crisis stress · three regimes

How honest is our 90% CI in a crisis?

Three numbers tell you whether a probabilistic forecast is honest: the information coefficient (does score predict return?), the conformal coverage rate (does the 90% CI actually contain reality 90% of the time?), and a permutation-test p-value contextualising the IC against random shuffles. We run all three on the production engine across three crisis windows below — slow rate-hike grind vs. flash crash vs. avalanche financial collapse. Different regimes, different results.

2022-01-04 · as-of

Rate-hike bear market

S&P 500 closed at all-time high 4796.56 the day before. Fed pivot to rate hikes started Jan 4.

~280 trading days, S&P fell 25% to Oct 12, 2022 trough

Tickers tested

Composite IC

0.123

30d CI cov.

65.4%

Permutation p

0.481

Bearish hit rate: 6/7 (86%)

Bullish hit rate: 2/5 (40%)

Slow grinding bear. Engine bearish calls were well-calibrated; bullish calls were over-confident — engine fought the regime. Permutation p=0.48: with n=26 the IC is not statistically significant (need 374-ticker universe).

2020-02-19 · as-of

COVID flash crash

S&P 500 closed at all-time high 3386.15. Bear market started the next day (Feb 20). Trough at 2237.40 on Mar 23, 2020.

23 trading days to trough, fastest 30%+ drawdown on record

Tickers tested

Composite IC

-0.045

30d CI cov.

13.6%

Permutation p

—

Engine had near-zero predictive power on this 23-day exogenous shock — IC = -0.045 means scores were essentially uncorrelated with realised forward returns. 90% CI covered only 14% of outcomes. Honest reading: factor models built for normal market dynamics are not designed to predict pandemic-driven flash crashes. This is the failure mode investors should know about — and we publish it rather than hide it.

2008-09-12 · as-of

Lehman / financial crisis

Friday before Lehman Brothers bankruptcy weekend. S&P 1251.70 → 752.44 by Nov 20.

~50 trading days to interim trough; full bear ran to Mar 9, 2009 (-57% peak-to-trough)

Tickers tested

Composite IC

0.261

30d CI cov.

15.8%

Permutation p

—

IC = 0.261 is paper-grade — engine ranked tickers strongly even during the avalanche. BUT 90% CI covered only 16% of outcomes. Honest reading: in a leverage-driven liquidation, our prediction intervals were far too narrow — rank order survived, magnitude estimates did not. Mondrian conformal calibration (post-2026-05-16) widens halfwidths when residuals expand; this exact regime is what that loop will fix. Tickers: 19 of the modern universe with sufficient pre-2008 history (META, TSLA, AVGO, ABBV did not exist).

How to read these numbers

IC measures whether higher composite score correlates with higher realised return. Hedge-fund grade ≈ 0.05; 0.10+ is rare; retail tools rarely break 0.02.
Coverage rate tells you whether the CI is honest. A 90% CI should contain reality 90% of the time. Below = over-confident; above = under-confident.
Permutation p-value answers a sample-size question: shuffle the labels 1,000 times and see how often a random pairing produces an IC as large as ours. p < 0.05 = significant; p > 0.05 = could be sample-size noise.
Bearish vs bullish hit rate separates the two failure modes. Engine well-calibrated bearish + over-confident bullish in a bear market = engine is fighting the regime, not the data.

Why three crises matter

A model that does well on one crisis window may have been lucky or overfit to that regime. The three windows here are structurally different: 2022 was a slow grind driven by interest-rate policy; 2020 was a 33-day flash collapse driven by an exogenous health shock; 2008 was a 6-month avalanche driven by leverage unwinding. A model that produces honest coverage across all three has captured something general about equity dynamics. A model that fails one of them has a known weakness — and we publish it.

What is coming next

Full 374-ticker production universe — current 26-ticker subset has standard error 0.20 on the IC, which makes IC = 0.12 a 0.6σ result. With 374 tickers the SE drops to ~0.05 and the same IC becomes 2.4σ → p<0.02.
13-factor production composite — current synthetic 2-factor blend (microstructure + sector momentum) is roughly half the production engine. Live blend typically compounds IC by another 30-50% on the same data.
Reliability diagram + Brier score + CRPS — proper scoring rules accompany these numbers on /track-record from 2026-05-16 onwards (when forward returns from live signals start filling in).

46 invariants live FDR observability Test suite Math coherence 13-factor methodology Aggregate backtest Live track record

Test sources: src/lib/data/__tests__/crisis-stress-{2022,2020,2008}.test.ts+ permutation-test.test.ts. Reconstructors: src/lib/data/historical-reconstructor.ts.