Documentation
How to read a strategy report, interpret the robustness score, navigate the failure-mode taxonomy, and understand the limits of these results.
For the underlying architecture (case catalog, simulator, calibration), see the Methodology page.
Quick Start
The site tests trading strategies against a curated catalog of 63 stress cases — 31 historical market episodes (Lehman 2008, COVID 2020, Luna 2022, etc.) plus 32 synthetic stress probes (50 Monte Carlo replicas each) — and reports per-regime performance via a 10-axis failure-mode taxonomy. As of May 2026 V_RECOVERY is treated as a diagnostic path pattern and decomposed into SHARP_CRASH and TREND_UP for score aggregation, so the score uses 9 effective buckets.
The fastest way to get oriented:
- Open a strategy report. Browse the pre-computed reports listed on the home page, or open one directly via its URL — for example /s/sma-200-crossover or /s/rsi-30-70.
- Read the headline numbers. The robustness score (0–100) and Worst-Case Drawdown at the 95th percentile (WCDD95) summarize the strategy across all evaluated cases.
- Inspect the failure-mode breakdown. The Scenario Dashboard groups the 10 failure modes into 6 buckets (Rising, Sideways, Calm, Volatile, Falling, Crash). Selecting a card pins a representative case in the Inspector below.
- Review the per-scenario activity. The Per-Scenario Activity table lists every case the strategy was tested against — empirical anchors plus synthetic probes — including the number of trades, the realized return, and the maximum drawdown.
To run a custom strategy: enter a plain-English description in the Strategy Lab. The parser converts it to structured JSON; the same engine that powers the catalog reports evaluates it against the case catalog. Supported indicators and operators are listed in the Strategy Input Guide.
Reading a Strategy Report
Every strategy report follows the same layout. From top to bottom:
Headline metrics
The robustness score (0–100), Worst-Case Drawdown (WCDD95), and case-coverage summary. If the strategy uses long-lookback indicators (e.g. SMA-200) and some synthetic cases were skipped due to insufficient tradeable duration, this is reported here transparently.
Scenario Dashboard
Six bucket cards — Rising, Sideways, Calm, Volatile, Falling, Crash — each aggregating one or more failure modes. Each card shows the average return, the ±1σ range across cases in the bucket, and the sample size. Clicking a card pins its primary failure mode in the Inspector below.
Per-Scenario Activity
A table listing every case in the catalog the strategy was tested against, split into Empirical Anchors (one row per case) and Synthetic Stress Probes (one row per profile, with median and quartile range across the 50 replicas). For each case: trade count, total return, maximum drawdown.
Scenario Inspector
Equity curve and trade markers for a representative case in the currently-selected failure mode. The dashboard cards (above) toggle between best, median, and worst examples per regime, drawn from the engine output.
Case Studies
Per-asset performance against the empirical and synthetic case set, including a Buy & Hold reference for direct comparison.
Indicator Warmup & State Initialization
Many trading indicators depend on prior price history. A 200-day simple moving average, for example, is undefined for the first 199 bars and only stabilizes after sufficient further data. In a stress case of 126–252 trading days, an indicator launched from an empty state would produce no actionable signal until most of the case has already elapsed.
Synthetic stress cases on this platform therefore begin with a 250-day antecedent pre-period followed by a short transition window before the stress regime starts. The pre-period exists so that long-lookback indicators have stable values when the stress phase begins. Performance attribution — trades, returns, drawdowns — is computed only from the stress phase onward; the pre-period is used solely to initialize indicator state.
The antecedent regime itself is not the stress regime. It is drawn from an asset-class-typical distribution (e.g. for equity assets: predominantly bull-trend or sideways-low-vol, with a smaller share of weak-bear) so that indicators enter the stress phase from a representative, rather than artificial, prior state.
Antecedent regimes are deterministically assigned per simulation seed, ensuring reproducibility. Results may depend on the assumed antecedent distribution, which defines the initial market state prior to stress onset; this is a modeling choice and should be considered when interpreting results. Cases where the warmup requirement still exceeds the case duration (e.g. very long lookback windows on short cases) are excluded transparently and shown in the report header.
Empirical historical cases (Lehman 2008, COVID 2020, etc.) carry their natural pre-period from the surrounding real market data and do not require this construction.
Understanding the Score
The robustness score (0–100) aggregates the strategy’s behavior across all evaluated cases, weighted by failure-mode coverage. It is a relative measure of consistency — higher reflects more uniform behavior across regimes; lower reflects substantial regime-dependent variability.
The score is not a risk measure, a return forecast, or a recommendation. It cannot be compared in absolute terms across different case catalogs.
| Score Range | Label | Interpretation |
|---|---|---|
| 80–100 | High Robustness | The strategy showed relatively consistent behavior across the evaluated stress regimes in this dataset. Variability remains across specific failure modes — the per-regime breakdown identifies where outcomes diverged. |
| 65–79 | Moderate Robustness | The strategy showed moderate stability across most stress cases, with weaknesses in specific environments. The failure-mode breakdown indicates which regimes drove the variability. |
| 45–64 | Variable Behavior | The strategy shows substantial differences depending on the market regime. Some regimes produced positive outcomes, others showed deeper drawdowns. Interpretation should focus on the per-regime breakdown rather than the aggregate score. |
| 0–44 | High Regime Sensitivity | The strategy shows substantial sensitivity to several stress regimes in the dataset. The aggregated score is dominated by regimes with poor outcomes; the per-regime breakdown identifies which. |
The aggregated score should be read alongside the per-failure-mode breakdown — two strategies with the same score can have very different regime profiles.
Understanding Drawdown
WCDD95 is the 95th-percentile worst-case drawdown — in 95% of observed scenarios the strategy’s drawdown did not exceed this level. It represents downside exposure under the tested stress conditions.
Does not imply future performance. The metric summarizes what was observed in the case catalog; future drawdowns may differ depending on the actual market path.
How to read WCDD95
- Below 15%: drawdowns rarely exceeded this level in the dataset.
- 15–35%: notable downside exposure; common for trend-following and equity-bias strategies in stress regimes.
- Above 35%: substantial downside in tail cases. Typical of strategies tested against deep-crash empirical anchors (Lehman, Dotcom, Luna).
The full drawdown distribution (median, P25, P75, P95) is shown in the report’s “More details” expandable section.
Failure Modes in Practice
Strategies typically don’t fail uniformly across markets — they have specific regime sensitivities. The site uses a 10-axis failure-mode taxonomy:
TREND_UPPersistent upward driftTREND_DOWNPersistent downward driftSIDEWAYSRange-bound, no resolutionVOL_EXPANSIONSustained volatility elevationVOL_COMPRESSIONSustained volatility suppressionSHARP_CRASHAccelerated downside dislocationSLOW_BEARGradual cumulative declineV_RECOVERYDrawdown plus rapid retracement (diagnostic path pattern; decomposed into SHARP_CRASH + TREND_UP for scoring)WHIPSAWRepeated false signalsLIQUIDITY_STRESSMicrostructure breakdownEach case in the catalog is tagged with one or more failure modes — Lehman 2008, for example, contributes to TREND_DOWN, SHARP_CRASH, and LIQUIDITY_STRESS simultaneously. The robustness score aggregates over all tagged contributions.
Operational definitions (drawdown thresholds, time scales, vol multipliers) and historical anchors per failure mode are documented on the Methodology page.
Reading the per-regime breakdown
In the Scenario Dashboard, the per-regime average and range identify which regimes the strategy handled stably and which produced the largest losses. A strategy with a high aggregate score but a deeply negative bucket has a regime-specific weakness — the score alone does not surface this.
Comparing Strategies
Two strategies tested on the same asset see the same case catalog and the same failure-mode taxonomy. Their robustness scores and per-regime breakdowns are directly comparable within this dataset.
Differences between strategies reflect different sensitivities to the specific cases and configurations tested. Results are conditional on the selected cases and configurations.
What comparisons surface
- Different regime profiles: a trend-following score of 70 looks different from a mean-reversion score of 70 once the failure-mode breakdown is examined.
- Indicator-family effects: SMA-based vs. ATR-based trend detection produce different WHIPSAW sensitivities even at similar aggregate scores.
- Hybrid filters: combinations such as “RSI with SMA-200 trend filter” can show distinct regime profiles relative to either component alone.
Cross-asset comparisons within a single strategy are valid but should account for asset-specific volatility baselines (e.g. equity vs. crypto). The site does not produce a single “best strategy” ranking — analysis is per-regime and per-asset.
Interpretation Boundaries
Explicit limits on what the results can and cannot say:
- 01Results are conditional on the selected cases. The case catalog covers a curated set of historical and synthetic stress regimes — different cases would produce different scores.
- 02Synthetic stress probes are model-generated approximations, not statistical replicas of any real instrument. They are designed to isolate specific failure modes that historical data did not deliver cleanly.
- 03Volume-based indicators and conditions are not currently supported. OHLCV inputs include volume columns, but indicators such as OBV, VWAP, MFI, and CMF are not yet implemented in the engine — strategies referencing volume parse correctly, but volume-dependent signals do not fire.
- 04No forward-looking claims are made. The score reflects observed behavior in the tested cases; it is not a prediction of future performance.
- 05Strategy comparisons are valid within this dataset only. Differences between strategies reflect different sensitivities to the specific cases and configurations tested.
- 06Long-lookback strategies (e.g. SMA-200) are evaluated against synthetic cases via a 250-day antecedent pre-period that initializes indicator state before the stress phase begins. Cases where the warm-up requirement still exceeds the available pre-period plus case duration are excluded transparently in the report header.
- 07The robustness score is a relative measure of consistency across regimes — it does not measure absolute return potential or risk-adjusted return.
Regulatory framing: Does not constitute investment advice, a recommendation, or a forecast. Results are based on model-driven simulations under simplified assumptions.
FAQ
What does the strategy score actually measure?
The score aggregates the strategy’s behavior across the case catalog (63 historical and synthetic stress cases — 31 empirical anchors + 32 synthetic probes), weighted by failure-mode coverage. It is a relative measure of consistency — a higher score reflects more uniform behavior across regimes; a lower score reflects substantial regime-dependent variability. The score is not a forecast or a recommendation.
How should I interpret WCDD95?
WCDD95 (Worst-Case Drawdown at the 95th percentile) means that in 95% of observed scenarios, the strategy’s drawdown did not exceed this level. It represents downside exposure under the tested stress conditions. It does not imply that future drawdowns will stay below this level — it summarizes what was observed in the case catalog.
Why does my strategy perform well in some cases but fail in others?
Different cases stress different aspects of a strategy. A trend-following strategy will typically perform well in TREND_UP and TREND_DOWN cases but show weakness in SIDEWAYS or WHIPSAW regimes. The per-failure-mode breakdown in the report identifies which regimes drove the result.
What are failure modes in trading strategies?
Failure modes are categories of market behavior that stress strategies in distinct ways. The site uses a 9-axis failure-mode taxonomy (TREND_UP, TREND_DOWN, SIDEWAYS, VOL_EXPANSION, VOL_COMPRESSION, SHARP_CRASH, SLOW_BEAR, WHIPSAW, LIQUIDITY_STRESS), plus V-Recovery as a diagnostic path pattern (composite of SHARP_CRASH down-leg and TREND_UP recovery, not a separate score bucket since 2026-05). Each historical or synthetic case is classified ex-post against the failure-mode operational definitions, allowing strategy performance to be aggregated by regime type. Detailed definitions are available on the methodology page.
Does a high score mean a strategy is safe?
No. A high score reflects relative consistency across the tested stress cases — it does not measure absolute risk, capital adequacy, or suitability for any specific situation. The case catalog is a curated set of stress regimes, not an exhaustive sample of all possible market conditions.
How are synthetic stress cases generated?
Synthetic cases are produced by a hybrid agent-based market simulator with belief-field dynamics and observer agents calibrated against real CFTC Commitment-of-Traders data per asset. Each case enforces a specific stress regime (e.g. controlled whipsaw, sustained low-vol grind) over 126–252 trading days, preceded by a 250-day antecedent pre-period that initializes indicator state. The antecedent regime is drawn deterministically from the simulation seed using an asset-class-typical distribution; it is not the stress regime. Each case has 50 Monte Carlo replicas. Performance attribution begins only after the antecedent + transition window — the pre-period itself is excluded from trade counts and returns.
How does the platform handle long-lookback indicators like SMA-200?
Synthetic stress cases include a 250-day antecedent pre-period before the stress regime begins. This pre-period initializes the state of long-lookback indicators (SMA-200, EMA-200, ATR-200, etc.) so that they have stable values when the stress phase starts. Performance attribution is computed only from the stress phase onward; the pre-period is used solely to initialize indicator state. The antecedent regime is drawn from an asset-class-typical distribution (e.g. predominantly bull-trend or sideways-low-vol for equity, with a smaller share of weak-bear) so that indicators enter the stress phase from a representative prior state. Results may depend on the assumed antecedent regime distribution; this is a modeling choice and is documented in the methodology.
What is the difference between historical and synthetic cases?
Historical cases (31 in the catalog) are real OHLC slices from documented market episodes — Lehman 2008, COVID 2020, Luna 2022, Volmageddon 2018, etc. They carry built-in realism but are limited to events that actually occurred. Synthetic cases (32 profile-asset combinations, 50 replicas each) are model-generated stress probes for failure modes that real history did not deliver cleanly — including sharp-crash setups, vol-expansion setups, liquidity-stress setups, v-recovery setups, controlled whipsaw, low-vol grind, slow-stagflation, and two distinct slow-decline hardness levels (mid-cycle bear correction and adversarial no-rebound). Both categories contribute to the aggregated score; off-band conformance for individual probes is disclosed on the verifiability page.
Can I use these results to predict future performance?
No. The reports describe how the strategy behaved in the tested historical and synthetic cases. They do not constitute forecasts, predictions, or recommendations. Real market outcomes may differ significantly from simulated results.
Why does my strategy fail in sideways markets?
Trend-following strategies typically underperform in SIDEWAYS and WHIPSAW regimes because directional signals (moving-average crossovers, breakouts) generate frequent false confirmations within a bounded range. Mean-reversion strategies typically perform better in these regimes. The per-failure-mode breakdown identifies how the specific strategy responded.
What does volatility expansion mean in this context?
VOL_EXPANSION refers to a market regime where realized volatility is elevated and persistent — operationalised as median realized vol ≥ 1.5× the asset’s normal-window vol AND at least two distinct (non-overlapping) 30-day windows with vol ≥ 1.5× baseline (persistence requirement). Peak/spike behaviour (≥ 3× baseline) is reported descriptively but does not gate the classification. The persistence requirement is what differentiates VOL_EXPANSION from a single SHARP_CRASH window. Historical examples include Tech Bear 2022 and the 2020 COVID post-crash period; the February 2018 Vol Shock is now classified as SHARP_CRASH after anchor re-validation. Strategies relying on fixed volatility thresholds (e.g. fixed-percentage stop-losses) are typically more sensitive to this regime.
How many simulations are used per stress case?
Empirical historical cases run once each (the realism comes built-in from the actual OHLC sequence). Synthetic stress probes run 50 Monte Carlo replicas per profile per asset, aggregated to median performance with a 25th–75th percentile band. A typical strategy report aggregates 200+ case runs total.
Are the results comparable across strategies?
Yes, within the same case catalog. Two strategies tested on the same asset see the same case set and the same failure-mode taxonomy, so their robustness scores and per-regime breakdowns are directly comparable. Cross-asset comparisons are valid within the same strategy but should account for asset-specific volatility baselines.
Technical Notes
JSON API (read-only)
GET /api/v1/strategies/— list of pre-computed strategiesGET /api/v1/strategies/{slug}— full strategy report (JSON)GET /api/v1/strategies/{slug}/cases?asset={asset}— per-asset case resultsGET /api/v1/strategies/{slug}/cases/{asset}/{case_id}— single case detail
AI / Crawler Endpoints
- /llms.txt — site overview for LLM crawlers (architecture, taxonomy, query examples)
- /sitemap.xml — full URL index for search engines
Schema.org Markup
Each Methodology failure-mode block is published as DefinedTerm JSON-LD. This page embeds FAQPage and HowTo markup for the FAQ and Quick Start sections respectively, supporting structured AI extraction.
Related Pages
- Methodology — failure-mode definitions, simulator architecture, calibration, scope of claims
- Strategy Input Guide — parser syntax: supported indicators, operators, examples
- Demo Tutorial — three-step walkthrough with a worked example