From First Principles: From AI’s Underlying Logic to AI Trading

I. The Underlying Logic of Large Models

Before delving into AI trading, it’s essential to clarify the computational essence of large models.

Many people treat large language models (LLMs) as black boxes, assuming they “understand” language and can “think” through problems. In reality, when dissected, they operate on a set of vector operations.

Core Idea: Represent Everything with Vectors

Humans use words and grammar to convey meaning. Machines, however, only recognize numbers.

The first step for large models is to map discrete tokens (which can be words or subwords) to a continuous vector space. Each token corresponds to a high-dimensional vector, typically 4096 dimensions or more.

“Today” → [0.12, -0.45, 0.78, 0.23, …] (4096 numbers)
“Today” → [0.12, -0.45, 0.78, 0.23, …] (4096 numbers)

“Weather” → [0.34, -0.12, 0.56, 0.89, …] (4096 numbers)

This mapping is learned through an embedding table. During training, semantically similar words are mapped to nearby positions in the vector space. The vectors for “king” and “queen” are close, while those for “king” and “apple” are far apart.

These relationships aren’t manually defined; the model learns them independently from massive amounts of text.

Core Calculation: Vector Similarity

With vector representations, the next question is: How does the model understand relationships between words?

The Transformer’s answer is the Self-Attention mechanism.

For each position in a sequence, the model asks: Which other positions in the sequence should I pay attention to?

Specific calculations:

Q = X · W_q (Query: What I’m looking for)
K = X · W_k (Key: What I can provide)
V = X · W_v (Value: My actual content)

Attention(Q, K, V) = softmax(Q · Kᵀ / √d) · V

Q · Kᵀ is a dot product operation. The more similar two vectors are (i.e., they point in roughly the same direction), the larger the dot product. This calculates how well the Query at the current position matches the Keys of other positions.

Softmax normalizes these match scores into a probability distribution. This distribution is then used to compute a weighted sum of the Values.

The result: Each position aggregates information from relevant positions in the sequence, with relevance determined by vector similarity.

Stacking and Non-Linearity

A single layer of attention has limited expressive power. Transformers stack multiple attention layers, followed by a feed-forward network (FFN) after each layer:

FFN(x) = activation(x · W₁) · W₂

The activation function is non-linear (e.g., ReLU, GELU, SwiGLU). Without non-linearity, multiple linear transformations would be equivalent to a single layer, negating the model’s depth.

Models like DeepSeek and Qwen use a Mixture of Experts (MoE) architecture: Not all parameters participate in every calculation; instead, computations are dynamically routed to subsets of expert networks. This is an efficiency optimization that doesn’t change the core computational logic.

Output: Probability Distribution

After processing through N layers, the final layer’s vectors are multiplied by a vocabulary matrix to generate scores for each possible token. These scores are normalized into a probability distribution using softmax.

Sampling or selecting the highest probability yields the output token. This token is then added to the input sequence, and the process repeats—this is autoregressive generation.

Differences Between Models

DeepSeek: MoE architecture with Multi-head Latent Attention to compress KV cache, reducing inference costs.
Qwen: Dual product lines (dense + MoE) with SwiGLU activation.
Claude: Architecture not publicly disclosed;推测 to be an optimized dense Transformer.
Gemini: Natively multimodal, with shared attention across images, audio, and text.
GPT-4: Rumored to use MoE, with 8 experts each having 220B parameters.

While architectural details vary, the underlying logic is consistent:

Map inputs to a vector space.
Calculate vector similarity via dot products.
Aggregate information using similarity weights.
Stack multiple layers with non-linear transformations.
Output a probability distribution.

There is no “understanding” or “thinking”—just geometric operations in a high-dimensional space.

Can this logic be applied to financial markets? Let’s explore.

II. Do Large Models “Predict” or “Recognize Patterns”?

Understanding the above computational process raises a key question:

What exactly are large models doing?

On the surface, they “predict the next word.” Given “Today’s weather is,” the probability of outputting “nice” is highest. This looks like prediction.

But breaking down the process, what they’re actually doing is:

Encoding the current context into vectors.
Retrieving patterns in the parameter space that best match these vectors.
Outputting the probability distribution corresponding to these patterns.

They aren’t “predicting what will happen in the future”—they’re “identifying which pattern in the training data the current input most closely resembles.”

What’s the difference?

Prediction involves inferring unknown events. Recognition involves matching known patterns.

Large models work because natural language has strong statistical regularities. Given sufficient context, the entropy (uncertainty) of the next word is low. After “The People’s Republic of,” the next word is almost certainly “China.” Models can output this with high confidence.

Applying This to Financial Markets

A natural extension of this logic to financial markets is:

Encode market data into vectors, retrieve historically similar patterns, and output a probability distribution of subsequent trends.

This sounds feasible, but there’s a fundamental issue:

The statistical properties of a financial market’s “next move” are entirely different from natural language’s “next word.”

Language vs. Markets: Differences in Statistical Regularity

Natural language has extremely strong statistical regularities.

Given enough context, the next word is highly predictable. Models can output results with high confidence.

Short-term financial market trends approximate a random walk.

Given any technical indicators, fundamental data, or on-chain data, the probability that the next candlestick rises or falls is close to 50:50. Signals are drowned out by noise.

Pratas et al. (2023) tested LSTMs on BTC volatility prediction: Models produced smoother curves but failed to capture large spikes. They learned the weak pattern of “mean reversion” but were powerless against critical extreme events.

Non-Stationarity

Language’s statistical regularities are relatively stable. The meaning of “apple” hasn’t changed much in a century.

Market structures evolve continuously. Patterns effective in 2021 may fail in 2024. Regulatory environments, participant structures, and liquidity distributions shift. Patterns learned from historical data face an ever-changing distribution.

Adversarial Nature

Language generation has no counterparty. If you predict the next word is “eat,” no one will deliberately make it “fly.”

Financial markets are zero-sum games. Any identified profitable pattern will be arbitraged away as capital pours in. Markets actively counter all attempts to exploit patterns.

Conclusion

Large models excel at pattern recognition, not prediction.

In language, pattern recognition outputs seem like prediction because linguistic patterns are stable and strong enough.

In financial markets, using the same approach to “predict price movements” will fail. Short-term price direction patterns are too weak, unstable, and vulnerable to adversarial forces.

This doesn’t mean pattern recognition has no value in finance. The question is: What patterns should we recognize?

III. A Different Question: Regime Identification

Predicting price movements has too low a signal-to-noise ratio, but a related question has much higher信噪比:

What state is the market in right now?

Markets are not homogeneous; they switch between regimes:

Low-volatility consolidation: Narrow range, unclear direction
High-volatility consolidation: Sharp swings without a trend
Unilateral uptrend: Sustained gains with shallow pullbacks
Unilateral downtrend: Sustained declines with weak rebounds
Liquidity crisis: Sharp drops with大规模 liquidations

Regimes are persistent. Trend phases may last days or weeks, and consolidation phases too. The autocorrelation over time scales is far more significant than single candlestick movements.

Hamilton’s (1989) regime-switching model pioneered this field. Wang et al. (2020) used HMM to identify bull/bear states in U.S. stocks, effectively avoiding large drawdowns during the 2008 financial crisis and 2020 COVID crash.

Their alpha didn’t come from predicting ups and downs but from reducing exposure during high-risk regimes.

IV. Technical Path: Market State Embedding

Adapting the LLM framework:

LLM: token → vector → similarity calculation → output distribution

Here: market state → vector → similarity calculation → regime classification

Encoder

Goal: Compress high-dimensional heterogeneous features into low-dimensional dense vectors. Constraint: Vectors for similar regimes should be close, while those for different regimes should be far apart.

Referencing TS2Vec (Yue et al., AAAI 2022), a state-of-the-art temporal representation learning method that excels on 150+ UCR/UEA datasets.

Core: Hierarchical contrastive learning with multi-time-scale contrastive losses, learning both timestamp-level and instance-level representations.

Training

Contrastive learning defines positive and negative samples:

Positive samples: Two moments with similar subsequent trends
Negative samples: Two moments with different subsequent trends

We can also reference SoftCLT (ICLR 2024), which uses continuous similarity instead of hard labels.

Output

Clustering: Apply KMeans/GMM to historical embeddings to get K clusters, then manually interpret regime meanings.
Retrieval: Retrieve the top-K historically similar moments to the current embedding and统计 regime distribution.

Retrieval offers stronger interpretability, as it outputs specific historical analogies.

V. Comparison with Traditional Methods

vs. HMM

HMM assumes observations follow a specific distribution (usually Gaussian) and requires predefining the number of states.

Neural networks make no distributional assumptions, handle high-dimensional inputs, and discover naturally occurring regime structures in data.

vs. Technical Indicators

Indicators like ADX, RSI, and Bollinger Bands each capture only one dimension. They struggle to model multi-factor interactions, and thresholds are manually set.

End-to-end learning automatically discovers feature combinations, with data-driven thresholds.

VI. Application Scenarios

The value of regime identification lies in strategy selection and risk control.

Strategy Matching

Low-volatility consolidation → Grid strategies
Trend行情 → Trend-following strategies
High-volatility consolidation → Reduce position size
Liquidity crisis → Flat positions

Risk Management

Wang et al.’s research shows that excess returns from regime-switching strategies mainly come from reducing exposure during adverse regimes.

Rule: Upon identifying a high-risk regime, cut positions by half or liquidate. The goal isn’t to capture every move but to avoid systemic risks.

VII. NoFx: AI Trading Infrastructure

The above covers methodology. Implementation requires infrastructure.

NoFx isn’t a product that “lets LLMs predict price movements.” It’s positioned as the infrastructure layer for AI trading.

Data Layer

Cryptocurrency market data is highly fragmented. CEX APIs have varying formats, on-chain data requires custom parsing, and derivatives data is scattered across sources.

NoFx’s first task: Normalize heterogeneous data and unify access interfaces.

Price data:

Multi-timeframe candlestick OHLCV: 1m / 3m / 5m / 15m / 30m / 1h / 2h / 4h / 6h / 8h / 12h / 1d / 3d / 1w / 1M
Tick-level trades
Volume-Weighted Average Price (VWAP)
Price change percentages: 1m / 5m / 15m / 30m / 1h / 4h / 24h

Volume data:

Raw volume and its moving averages (MA)
Cumulative Volume Delta (CVD): Cumulative aggressive buy volume – cumulative aggressive sell volume
CVD multi-timeframe: 5m / 15m / 1h / 4h / 24h
Taker Buy/Sell Volume
Volume anomaly detection (multiples relative to MA)
Volume-price divergence indicators

Open Interest (OI) data:

OI absolute value
OI changes: 1h / 4h / 24h
OI change rates: 1h / 4h / 24h
OI-weighted price
Long/short position ratio
Large trader position percentage
Leverage distribution statistics

Funding rate data:

Current funding rate
Predicted funding rate
Historical funding rate sequences
Cumulative funding: 24h / 7d / 30d

Liquidation data:

Long liquidation volume (USD)
Short liquidation volume (USD)
Long/short liquidation ratio
Large liquidation events (single > 100K)
Liquidation heatmaps (price range distribution)
Cumulative liquidations: 1h / 4h / 24h

Capital flow data:

Institutional net inflow (futures)
Institutional net inflow (spot)
Retail net inflow (futures)
Retail net inflow (spot)
Large buy/sell orders (configurable thresholds)
Exchange net inflow/outflow
Whale address anomalies

Order book data:

Bid/ask prices and volumes at the top of the book
Spread
Depth snapshots: ±0.1% / ±0.5% / ±1% / ±2%
Bid/ask imbalance
Large order detection
Order book slope

Technical indicators:

EMA: 7 / 13 / 21 / 55 / 100 / 200
SMA: 20 / 50 / 100 / 200
MACD: Standard + custom parameters
RSI: 6 / 14 / 21
Bollinger Bands: 20-period, 2x standard deviation
ATR: 14-period
ADX / DMI
Stochastic RSI
OBV (On-Balance Volume)
Ichimoku Cloud

Volatility data:

Realized volatility: 1h / 4h / 24h / 7d
ATR percentage
Bollinger Band width
Price range (High – Low)

The data update service and API are now live:

Example call:

GET /api/quant-data?symbol=BTCUSDT

Response:

{
“netflow”: {
“institution”: {“future”: 1200000, “spot”: -500000},
“personal”: {“future”: -800000, “spot”: 200000}
},
“oi”: {
“current”: 450000000,
“delta”: {“1h”: 1.2, “4h”: 3.5, “24h”: -2.1}
},
“price_change”: {“1h”: 0.8, “4h”: 2.1, “24h”: -1.5},
“cvd”: {“5m”: 150000, “1h”: 890000, “4h”: -2100000},
“funding_rate”: 0.0001,
“liquidation”: {“long”: 1500000, “short”: 800000}
}

Execution Layer

Exchange API differences aren’t limited to data. Parameters for limit orders, market orders, stop-loss orders, position precision, and leverage configurations vary across platforms.

NoFx abstracts a unified execution interface, currently supporting any exchange market:

Strategy layers don’t need to worry about underlying exchanges.

Decision Layer

Above the data and execution layers, we provide an AI decision framework:

Market data → Feature engineering → AI inference → Risk control filtering → Execution

Supported inference engines: DeepSeek, Claude, GPT, Gemini, Qwen.

AI here isn’t predicting price movements but making structured decisions:

Multi-dimensional market state analysis
Candidate asset screening and ranking
Position management and risk assessment
Entry/exit condition judgment

Outputs are structured JSON: decisions, confidence levels, and chains of thought. Complete context for each decision is fully recorded.

Regime Integration

Regime identification functions as a module in the decision framework:

Market data → Regime identification → Strategy routing → AI decision → Risk control → Execution

AI receives the current regime judgment as context when making specific decisions. Decision aggressiveness, position limits, and stop-loss widths adjust automatically across regimes.

Why Build This?

The bottleneck in AI trading isn’t models—it’s engineering.

A functional system requires: stable data flows, low-latency execution, robust risk control, traceable logs, and flexible strategy configuration. Without infrastructure, even the strongest models are useless.

Most AI trading products on the market are black boxes. Users don’t know what the AI is “thinking,” why it opens a position, or the risk control logic. When issues arise, debugging is impossible.

NoFx’s design principles: Transparency and controllability.

Complete chain-of-thought logs
Configurable risk control (stop-loss, position limits, leverage restrictions)
Open-source, deployable by users
Web UI for parameter tuning, no coding required

AI Trading Layer

Long-term goal: A standardized layer for AI trading.

Three pillars:

First, engineering heritage from traditional quant. Order management, risk engines, backtesting frameworks, execution algorithms—decades of quant trading expertise. Without these, AI is a castle in the air.

Second, cutting-edge AI inference capabilities. LLMs excel at structured analysis, multi-factor judgment, and natural language interaction—tasks beyond traditional rule engines. But AI needs proper constraints; it shouldn’t be left to “predict markets” freely.

Third, rigorous mathematical frameworks. Regime identification, vector similarity, contrastive learning—verifiable, explainable methods. Rejecting the mysticism of “buy because the AI said so.”

Democratization

Technology is just a tool. NoFx’s ultimate goal: Democratizing AI + quant trading.

Current reality: Quant trading has extremely high barriers. You need coding skills, financial knowledge, data engineering expertise, risk management understanding, and access to institutional-grade data and execution channels. Retail traders are excluded.

NoFx offers a visual AI trading orchestration system:

No coding required. Strategy logic, risk rules, and AI parameters are configured via a web UI.
No quant expertise needed. Pre-built strategy templates cover common scenarios—just adjust parameters.
No need to build infrastructure. Data, execution, risk control, and logs are provided by the platform.
Full transparency. Every AI decision’s inputs, reasoning, and outputs are auditable.

Someone with no quant experience should be able to configure their own AI strategy in 5 minutes and understand how it works.

This isn’t reducing professionalism. It’s packaging expertise into tools usable by everyone.

Excel let everyone do data analysis without learning SQL. Figma let everyone design without learning Photoshop.

NoFx lets everyone orchestrate AI trading strategies without becoming a quant engineer.

Open-source is inevitable. Infrastructure must be trustworthy and auditable.

VIII. Limitations

Overfitting: Models may merely memorize historical patterns with questionable generalization. Out-of-sample validation and rolling backtests are essential.
Regime drift: As market structures evolve, historical regime features may become obsolete. Continuous monitoring and regular retraining are needed.
Recognition delay: Regime switches are inevitably identified with lag. A trade-off exists between sensitivity and false positive rates.

This isn’t a predictive holy grail. Its value lies in structured market state descriptions to aid strategy selection and risk control.

IX. About Me

This work starts from first principles.

Instead of chasing the hype around “AI trading,” I first asked: What is the computational essence of AI? Can this essence be applied to financial scenarios? If so, what problems should it solve?

The answer: Vector similarity calculations can be transferred, but the goal shouldn’t be predicting price movements—it should be regime identification.

Another starting point is user-centricity.

No matter how advanced the technology, its impact is limited if only professional quant teams can use it. I want to build something that even traders with no coding skills can use.

Professionalism and usability aren’t contradictory. Professionalism lies in the underlying architecture and methodology; usability lies in product interaction. Making complex things simple is far harder than making simple things complex.

User feedback validates this direction:

9,500+ GitHub stars in two months
1,800+ net new KYC first-time traders brought to Binance in one and a half months
Over 90,000 active users
Sustained growth in trading volume

These numbers show real market demand: Retail traders want professional-grade AI trading capabilities without spending months learning quant programming.

NoFx is packaging institutional-grade data, execution, and risk control into products usable by everyone. Technology should serve the masses, not just a few.

Open-source is a user-centric choice. Users need to see what the code does, deploy it themselves, and modify it to their needs. Black-box products can’t build trust in finance.

Once the regime identification module is validated, it will also be open-sourced and integrated.

NoFx official website:

Github:

References

Temporal Representation Learning:
[1] Yue, Z., et al. (2022). TS2Vec: Towards Universal Representation of Time Series. AAAI 2022.
[2] Soft Contrastive Learning for Time Series. ICLR 2024.
[3] Niroshan, G., et al. (2025). TS2Vec-Ensemble. arXiv.

Regime Detection:
[4] Wang, M., Lin, Y.H., & Mikhelson, I. (2020). Regime-Switching Factor Investing with Hidden Markov Models. Journal of Risk and Financial Management.
[5] Hamilton, J.D. (1989). A New Approach to the Economic Analysis of Nonstationary Time Series and the Business Cycle. Econometrica.
[6] Yuan, Y., & Mitra, G. (2019). Market Regime Identification Using Hidden Markov Models. SSRN.

Cryptocurrency Prediction:
[7] Pratas, T.E., et al. (2023). Forecasting Bitcoin Volatility: Exploring the Potential of Deep Learning. Eurasian Economic Review.
[8] Omole, O., & Enke, D. (2024). Deep Learning for Bitcoin Price Direction Prediction. Financial Innovation.
[9] Huang, Z.C., et al. (2024). Forecasting Bitcoin Volatility Using Machine Learning Techniques. Journal of International Financial Markets.

From Vectors to Volatility: How AI Actually Applies to Trading (Not Magic, Just Math)