Tokalpha Labs - End-to-end autonomous trading infrastructure

Imagine you're evaluating a new AI trading system. The backtest shows 50% annualized returns over 10 years. The Sharpe ratio is pristine at 3.2. Drawdowns are tiny—never exceeding 5%. You're convinced. You allocate $10 million. Three months later, the strategy loses 12%. Six months in, it's down 28%. By year-end, you've lost $2.8 million. What happened? The backtest was lying to you.

Introduction: The Uncomfortable Truth About Backtests

Three months later, the strategy loses 12%. Six months in, it's down 28%. By year-end, you've lost $2.8 million.

What happened?

The backtest was lying to you. Not intentionally, but systematically. Your AI saw tomorrow's market data while making today's decisions.

This is time-travel contamination, and according to 2025 research, it affects up to 90% of published "profitable" trading backtests. It's the silent killer that separates successful traders from spectacularly failed ones.

This report examines what time-travel contamination is, why it's so pervasive, how to detect it, and most importantly, how to build systems that avoid it.

Part 1: Understanding the Problem

What Is Time-Travel Contamination?

Time-travel contamination (also called "data leakage" or "look-ahead bias") occurs when a model or strategy uses information in the backtest that wouldn't have been available at the time of the actual trade decision.

Simple example:

On January 15, 2023, you decide to buy Apple stock.
But your model "looks at" the stock price on January 16, 2023 (the next day).
It sees the price went up, so it retroactively places the buy order.
In reality, you can't use tomorrow's data to make today's decision.

More subtle example (the real problem):

An LLM-based trading system is trained on financial data up to December 2024. In January 2025, you backtest it on the "out-of-sample" period of January 2023–December 2024.

Here's the problem: the LLM was trained on data that includes January 2023–December 2024. When you ask it "should I buy TSLA on January 15, 2023?", the LLM's training includes the answer: it knows what TSLA actually did in 2023. This isn't reasoning; it's memorization.

Why It's So Pervasive

Time-travel contamination is endemic in AI/ML research because:

1) Convenient Data Slicing

Researchers split data chronologically: train on 2010-2020, test on 2021-2024. But if the test period overlaps with the model's training corpus, contamination exists.

2) Unknown Training Cutoffs

For proprietary models (GPT-4, Claude, Gemini), researchers don't know exactly when the training data cutoff was. A model trained until March 2024 can see data from the testing period.

3) LLM Memorization

Large language models memorize massive amounts of training data. An LLM trained on internet data saw financial news, stock prices, and predictions for the period you're testing. When asked "predict TSLA in January 2023," it can recall that TSLA rose 65% that year from its training data.

4) Multiple Testing Bias

Researchers test 100 strategy variations on the same historical data. By random chance, some will look "profitable." If they then test these profitable variations on "fresh" data that's actually old, contamination multiplies.

5) Subtle API Leakage

A researcher uses an API to fetch "historical" news. But the API includes analyst predictions made in 2022 about 2023 performance—essentially embedding the future into the historical data.

The Magnitude of the Problem

Research in 2025-2026 reveals the scope:

Survey finding (2025):

Up to 90% of published "profitable" backtests in trading research suffer from overfitting, look-ahead bias, or data leakage.

StockBench finding (October 2025):

When researchers applied strict temporal filters to eliminate data contamination, LLM trading agent performance dropped dramatically:

GPT-4 based agent: Reported 45% return with contaminated data → 3.2% with clean data
Some agents: Claims of 50%+ returns with contamination → barely positive returns without it

LLM Contamination Study (December 2025):

Researchers developed statistical tests to detect lookahead bias in LLM forecasts. Finding: Many state-of-the-art models unknowingly benefited from training data that overlapped with "test" periods. For models predicting stock returns, contamination likelihood was extremely high.

The implication: If you're reading a research paper claiming >20% annual returns with a backtest, statistically there's a 80-90% chance that results are contaminated and real performance would be substantially lower.

Part 2: The Five Types of Time-Travel Contamination

Type 1: Direct Look-Ahead Bias

Definition: The model directly uses future information to make current decisions.

Example: Using tomorrow's price to make today's buy decision.

Prevalence: Surprisingly common in amateur code. Less common in professional implementations, but still happens due to careless data indexing.

Detection: Walk-forward testing. If you train on 2020-2022 and test on 2023-2024, and results are too good, suspect look-ahead bias.

Type 2: Training Data Overlap (LLM Problem)

Definition: The model's training data includes the "test" period, so it's not actually testing—it's memorizing.

Prevalence: Extremely common with LLM-based systems.

Real impact: A study of GPT-4 forecasting stock returns showed that when researchers limited the test period to dates after the model's training cutoff, performance dropped by 85-90%. When they tested on dates the model had been trained on, performance was superhuman.

Type 3: Survivorship Bias

Definition: Testing only on assets that survive the period, ignoring ones that failed or were delisted.

Prevalence: Very common in retail and even professional implementations.

Real cost: A study by Dimensional Fund Advisors found that including delisted securities reduced backtested returns by 1.5-3% annually. Over a 30-year period, the compounding effect is massive.

Type 4: P-Hacking and Data Snooping

Definition: Testing so many strategy variations that some will inevitably be "profitable" by pure chance.

Prevalence: Extremely common. With enough parameter combinations, researchers will find profitable strategies that don't generalize.

The math: If you test 100 strategies on 5 years of data, and there's a 5% chance each will appear profitable by chance, you'll find ~5 "profitable" strategies just from randomness.

Type 5: Information Leakage Through Proxies

Definition: The model uses information that's correlated with future outcomes but shouldn't be available.

Prevalence: Common with financial data APIs and news feeds.

Example: A model uses "analyst upgrades" from a financial API. But that API includes analyst upgrades dated in 2022 that were about 2023 expectations. The model "predicts" 2023 returns using analyst predictions about 2023, essentially looking at the future.

Part 3: How to Detect Time-Travel Contamination

Red Flag #1: Backtests That Are Too Good

If your backtest shows:

Annual returns > 40% with Sharpe > 2.0 in normal markets
Sharpe ratio > 3.0 (institutional quality traders rarely exceed 2.0)
Drawdowns that never exceed 5% (real markets have bigger dislocations)
Consistent profitability across all market regimes

Your backtest is likely contaminated. Real trading has volatility, luck, and drawdowns. If the backtest is too smooth and perfect, contamination is the most likely explanation.

Red Flag #2: Performance Drops in Live Trading

The clearest sign: backtest shows 30% returns, but live trading shows 5% returns (or losses). This almost always means: (1) Overfitting, (2) Look-ahead bias, or (3) Survivorship bias. If gap is > 10x, contamination is virtually certain.

Red Flag #3: Model Predicts Information It Shouldn't Know

For LLM-based systems, run this test:

Without internet/training access: Ask the model: "Based only on January 1, 2023 data, predict the S&P 500 return for Q1 2023."

Then check: Does the model's prediction match what actually happened in Q1 2023? If yes with high accuracy, the model has access to 2023 data through its training (contamination). If the model's prediction is random (50/50 on up/down), it's making genuine predictions.

Red Flag #4: Training Cutoff Overlap

Question to ask: When was my model trained? When is my backtest period? If model training was at any point during or after the backtest period, contamination is certain.

Detection Tool: StockBench Framework

In 2025, researchers created StockBench, a framework specifically designed to detect and eliminate contamination:

Strict temporal splits: Training data is completely separate from test data with no overlap
Recent windows: Uses fresh data from 2025 to ensure models haven't seen it
Contamination auditing: Tests whether model outputs correlate with memorized vs. reasoned knowledge
Regime sensitivity: Tests performance in bull markets, bear markets, and volatile periods separately

StockBench is open-source and represents the gold standard for contamination-aware evaluation.

Part 4: How to Prevent Time-Travel Contamination

Best Practice #1: Strict Chronological Separation

Train/Test Split by Date:

Training Set: January 1, 2015 - December 31, 2022
Validation Set: January 1, 2023 - December 31, 2023
Test Set: January 1, 2024 - Present (or future dates)

Key rule: No data from validation/test set touches the training process.

Best Practice #2: Walk-Forward Testing

Instead of training once and testing once, continuously retrain: Day 1-365: Train on 2015-2022, test on Jan 1, 2023. Day 2-366: Train on 2015-2022 + Jan 1, 2023, test on Jan 2, 2023... continuing forward. This simulates real trading where you continuously retrain as new data arrives. If your strategy fails in walk-forward tests, it's less likely to be contaminated.

Best Practice #3: For LLM-Based Systems, Verify Training Cutoffs

Document: Exact LLM model used (e.g., "GPT-4 with training cutoff April 2024"), Backtest date range (e.g., "January 2023 - December 2023"). If test period overlaps with training cutoff, contamination is present.

Solution: Use only test data after the model's training cutoff, or use smaller open-source models you can retrain on clean data.

Best Practice #4: Use Realistic Data

Include: Delisted securities (don't ignore failures), Actual historical constituents of indices (not current ones), Real transaction costs, slippage, market impact, Fee structures matching your deployment environment.

Ignore the temptation to: Use only "good" assets (survivorship bias), Assume zero transaction costs, Test under perfect market conditions. With realistic costs, many "profitable" backtests become barely profitable or unprofitable.

Best Practice #5: Implement Statistical Tests for Overfitting

Key metrics:

Sharpe Ratio Decline: If in-sample Sharpe is 3.0 but out-of-sample drops to 0.5, you have overfitting.
Parameter Sensitivity: If slightly adjusting parameters (47-day to 48-day moving average) causes returns to drop 50%, you're overfitting.
Regime Performance: Test separately on bull, bear, and sideways markets. If the strategy works only in one regime, it's likely curve-fit to that regime's noise.

Best Practice #6: Independent Validation

Have someone else: Implement your strategy from scratch (not from your code), Test on completely different time periods, Verify results match yours. If they can't reproduce your backtest, contamination or implementation bias is likely.

Part 5: Case Studies in Contamination

Case Study 1: The "Earnings Surprise" Algorithm

The claim: A machine learning model predicts stock returns based on earnings surprises with 65% directional accuracy.

The contamination: The model was trained on financial databases that include analyst consensus estimates made after the earnings surprise. So when predicting "will stock go up after earnings surprise?", the model was using data from the future (analysts' updated estimates post-earnings).

The reality: When tested on earnings with no analyst data leakage, accuracy dropped to 52% (barely better than random).

Cost: A fund deployed this system, lost money for 6 months before pulling it.

Case Study 2: The GPT-4 Stock Predictor

The claim: GPT-4 can predict stock returns better than 95% of fund managers.

The result: 68% accuracy—exceptional.

The contamination: The researchers then tested whether GPT-4 could predict 2021 returns (dates it definitely wasn't trained on). Accuracy: 49.5% (random).

The finding: GPT-4 wasn't predicting; it was remembering. Its training data included 2023 financial news and outcomes.

The lesson: For LLMs, always test on dates after the known training cutoff.

Case Study 3: The Survivorship Disaster

The backtest: A value-investing algorithm showed 18% annual returns backtested on the S&P 500.

The deployment: Fund allocated $50M. Real returns were 6%.

The contamination: Backtest used the current S&P 500 constituents. But many companies in the current index didn't exist in the 1995-2015 backtest period. Historical "failures" and delisted companies weren't included.

When the fund tested on the actual historical constituents (including failures), returns dropped to 8%. Delisting and failures matter.

Cost: $2M in unrealized losses vs. realistic expectations.

Part 6: The Tokalpha Labs Approach

At Tokalpha Labs, we've made contamination-free evaluation a core design principle:

Principle 1: Strict Temporal Isolation

Every model we build operates on the assumption: no future data exists in the training set. Training data has a hard cutoff date. Test data comes entirely after that date. No cross-contamination through APIs or proxies.

Principle 2: Continuous Retraining Simulation

We don't backtest once; we simulate continuous retraining: Day 1: Train on data through Day 1, predict Day 2. Day 2: Retrain on data through Day 2, predict Day 3... continuing through the entire period. This simulates real deployment where you continuously improve the model. Results are more realistic than single train-once-test-once approaches.

Principle 3: Regime-Aware Testing

We test performance separately: Bull markets: Do systems maintain alpha in rallies? Bear markets: Do systems preserve capital in downturns? Volatile periods: Can systems handle sudden dislocations? If a system works great in bull markets but fails in bear markets, we flag it. Most backtests hide this regime dependence.

Principle 4: Realistic Cost Simulation

Our backtests include: Actual transaction costs for the institution size, Slippage based on real order book data, Market impact (large orders move prices), Bid-ask spreads matching the asset class. These costs often reduce backtest returns by 5-15%. Better to know this upfront than discover it in live trading.

Conclusion: The Uncomfortable Truth

Time-travel contamination is the reason 90% of published trading backtests are lies.

Not intentional lies—systematic ones. Researchers genuinely believe in their results. But the systems are designed (usually unknowingly) to see the future, and of course a system that sees the future makes money.

The good news: once you understand contamination, you can build systems that avoid it.

The key principles:

Strict temporal separation (no overlap between training and test periods)
Walk-forward testing (continuous retraining, not one-time training)
Realistic costs (include slippage, fees, market impact)
Regime testing (verify performance across market conditions)
Independent validation (have others reproduce your results)

If a backtest claims > 20% annual returns, ask these questions:

Is the test period after the model's training cutoff?
Does performance hold in bear markets?
Are results reproduced independently?
Are realistic costs included?
Have you tested on delisted/failed securities?

If you can't answer "yes" to all five, the results are probably contaminated.

About Tokalpha Labs

Tokalpha Labs is building seven-stage infrastructure for autonomous trading systems specifically designed to eliminate data contamination at every stage.

From Stage 1 (Feature Engineering) through Stage 7 (Governance), our systems maintain strict temporal isolation, continuous retraining, and rigorous contamination detection.

We're not interested in publishing papers with inflated backtest results. We're interested in building systems that actually work in live markets.

Learn more: Collaborate with us

References

Science Direct. (2024). "Backtest Overfitting in the Machine Learning Era: A Comparison of Out-of-Sample Testing Approaches."
AQR Capital Management. (2024). "Can Machines Time Markets? The Virtue of Complexity in Return Prediction."
Emergent Mind. (2025). "StockBench: LLM Agents in Real Markets—A Contamination-Free Benchmark."
arXiv. (2025). "Can LLM Agents Trade Stocks Profitably In Real-world Markets?"
Pick My Trade. (2026). "Backtest Bias Prevention: Essential Guide for Algo Traders in 2026."
arXiv. (2025). "A Test of Lookahead Bias in LLM Forecasts."
MarketCalls. (2025). "Understanding Look-Ahead Bias and How to Avoid It in Trading Strategies."
arXiv. (2023). "Time Travel in LLMs: Tracing Data Contamination in Large Language Models."
Journal of Knowledge Learning and Science Technology. (2023). "Algorithmic Trading Strategies: Real-Time Data Analytics with Machine Learning."
Reddit/Algo Trading. (2026). "Is Overfitting the #1 Reason Most Backtested Strategies Fail Live?"
LinkedIn/Srikanth Manchimchetty. (2025). "StockBench: A New Benchmark for AI in Stock Trading."
arXiv. (2024). "Consistent Time Travel for Realistic Interactions with Historical Data."

Time-Travel Contamination: The Silent Killer of Backtest Results