From First Principles: From AI’s Underlying Logic to AI Trading
I. The Underlying Logic of Large Models
Before delving into AI trading, it’s essential to clarify the computational essence of large models.
Many people treat large language models (LLMs) as black boxes, assuming they “understand” language and can “think” through problems. In reality, when dissected, they operate on a set of vector operations.
Core Idea: Represent Everything with Vectors
Humans use words and grammar to convey meaning. Machines, however, only recognize numbers.
The first step for large models is to map discrete tokens (which can be words or subwords) to a continuous vector space. Each token corresponds to a high-dimensional vector, typically 4096 dimensions or more.
“Today” → [0.12, -0.45, 0.78, 0.23, …] (4096 numbers)
“Today” → [0.12, -0.45, 0.78, 0.23, …] (4096 numbers)
“Weather” → [0.34, -0.12, 0.56, 0.89, …] (4096 numbers)
This mapping is learned through an embedding table. During training, semantically similar words are mapped to nearby positions in the vector space. The vectors for “king” and “queen” are close, while those for “king” and “apple” are far apart.
These relationships aren’t manually defined; the model learns them independently from massive amounts of text.
Core Calculation: Vector Similarity
With vector representations, the next question is: How does the model understand relationships between words?
The Transformer’s answer is the Self-Attention mechanism.
For each position in a sequence, the model asks: Which other positions in the sequence should I pay attention to?
Specific calculations:
Q = X · W_q (Query: What I’m looking for)
K = X · W_k (Key: What I can provide)
V = X · W_v (Value: My actual content)
Attention(Q, K, V) = softmax(Q · Kᵀ / √d) · V
Q · Kᵀ is a dot product operation. The more similar two vectors are (i.e., they point in roughly the same direction), the larger the dot product. This calculates how well the Query at the current position matches the Keys of other positions.
Softmax normalizes these match scores into a probability distribution. This distribution is then used to compute a weighted sum of the Values.
The result: Each position aggregates information from relevant positions in the sequence, with relevance determined by vector similarity.
Stacking and Non-Linearity
A single layer of attention has limited expressive power. Transformers stack multiple attention layers, followed by a feed-forward network (FFN) after each layer:
FFN(x) = activation(x · W₁) · W₂
The activation function is non-linear (e.g., ReLU, GELU, SwiGLU). Without non-linearity, multiple linear transformations would be equivalent to a single layer, negating the model’s depth.
Models like DeepSeek and Qwen use a Mixture of Experts (MoE) architecture: Not all parameters participate in every calculation; instead, computations are dynamically routed to subsets of expert networks. This is an efficiency optimization that doesn’t change the core computational logic.
Output: Probability Distribution
After processing through N layers, the final layer’s vectors are multiplied by a vocabulary matrix to generate scores for each possible token. These scores are normalized into a probability distribution using softmax.
Sampling or selecting the highest probability yields the output token. This token is then added to the input sequence, and the process repeats—this is autoregressive generation.
Differences Between Models
-
DeepSeek: MoE architecture with Multi-head Latent Attention to compress KV cache, reducing inference costs. -
Qwen: Dual product lines (dense + MoE) with SwiGLU activation. -
Claude: Architecture not publicly disclosed;推测 to be an optimized dense Transformer. -
Gemini: Natively multimodal, with shared attention across images, audio, and text. -
GPT-4: Rumored to use MoE, with 8 experts each having 220B parameters.
While architectural details vary, the underlying logic is consistent:
-
Map inputs to a vector space. -
Calculate vector similarity via dot products. -
Aggregate information using similarity weights. -
Stack multiple layers with non-linear transformations. -
Output a probability distribution.
There is no “understanding” or “thinking”—just geometric operations in a high-dimensional space.
Can this logic be applied to financial markets? Let’s explore.
II. Do Large Models “Predict” or “Recognize Patterns”?
Understanding the above computational process raises a key question:
What exactly are large models doing?
On the surface, they “predict the next word.” Given “Today’s weather is,” the probability of outputting “nice” is highest. This looks like prediction.
But breaking down the process, what they’re actually doing is:
-
Encoding the current context into vectors. -
Retrieving patterns in the parameter space that best match these vectors. -
Outputting the probability distribution corresponding to these patterns.
They aren’t “predicting what will happen in the future”—they’re “identifying which pattern in the training data the current input most closely resembles.”
What’s the difference?
Prediction involves inferring unknown events. Recognition involves matching known patterns.
Large models work because natural language has strong statistical regularities. Given sufficient context, the entropy (uncertainty) of the next word is low. After “The People’s Republic of,” the next word is almost certainly “China.” Models can output this with high confidence.
Applying This to Financial Markets
A natural extension of this logic to financial markets is:
Encode market data into vectors, retrieve historically similar patterns, and output a probability distribution of subsequent trends.
This sounds feasible, but there’s a fundamental issue:
The statistical properties of a financial market’s “next move” are entirely different from natural language’s “next word.”
Language vs. Markets: Differences in Statistical Regularity
Natural language has extremely strong statistical regularities.
Given enough context, the next word is highly predictable. Models can output results with high confidence.
Short-term financial market trends approximate a random walk.
Given any technical indicators, fundamental data, or on-chain data, the probability that the next candlestick rises or falls is close to 50:50. Signals are drowned out by noise.
Pratas et al. (2023) tested LSTMs on BTC volatility prediction: Models produced smoother curves but failed to capture large spikes. They learned the weak pattern of “mean reversion” but were powerless against critical extreme events.
Non-Stationarity
Language’s statistical regularities are relatively stable. The meaning of “apple” hasn’t changed much in a century.
Market structures evolve continuously. Patterns effective in 2021 may fail in 2024. Regulatory environments, participant structures, and liquidity distributions shift. Patterns learned from historical data face an ever-changing distribution.
Adversarial Nature
Language generation has no counterparty. If you predict the next word is “eat,” no one will deliberately make it “fly.”
Financial markets are zero-sum games. Any identified profitable pattern will be arbitraged away as capital pours in. Markets actively counter all attempts to exploit patterns.
Conclusion
Large models excel at pattern recognition, not prediction.
In language, pattern recognition outputs seem like prediction because linguistic patterns are stable and strong enough.
In financial markets, using the same approach to “predict price movements” will fail. Short-term price direction patterns are too weak, unstable, and vulnerable to adversarial forces.
This doesn’t mean pattern recognition has no value in finance. The question is: What patterns should we recognize?
III. A Different Question: Regime Identification
Predicting price movements has too low a signal-to-noise ratio, but a related question has much higher信噪比:
What state is the market in right now?
Markets are not homogeneous; they switch between regimes:
-
Low-volatility consolidation: Narrow range, unclear direction -
High-volatility consolidation: Sharp swings without a trend -
Unilateral uptrend: Sustained gains with shallow pullbacks -
Unilateral downtrend: Sustained declines with weak rebounds -
Liquidity crisis: Sharp drops with大规模 liquidations
Regimes are persistent. Trend phases may last days or weeks, and consolidation phases too. The autocorrelation over time scales is far more significant than single candlestick movements.
Hamilton’s (1989) regime-switching model pioneered this field. Wang et al. (2020) used HMM to identify bull/bear states in U.S. stocks, effectively avoiding large drawdowns during the 2008 financial crisis and 2020 COVID crash.
Their alpha didn’t come from predicting ups and downs but from reducing exposure during high-risk regimes.
IV. Technical Path: Market State Embedding
Adapting the LLM framework:
LLM: token → vector → similarity calculation → output distribution
Here: market state → vector → similarity calculation → regime classification
Encoder
Goal: Compress high-dimensional heterogeneous features into low-dimensional dense vectors. Constraint: Vectors for similar regimes should be close, while those for different regimes should be far apart.
Referencing TS2Vec (Yue et al., AAAI 2022), a state-of-the-art temporal representation learning method that excels on 150+ UCR/UEA datasets.
Core: Hierarchical contrastive learning with multi-time-scale contrastive losses, learning both timestamp-level and instance-level representations.
Training
Contrastive learning defines positive and negative samples:
-
Positive samples: Two moments with similar subsequent trends -
Negative samples: Two moments with different subsequent trends
We can also reference SoftCLT (ICLR 2024), which uses continuous similarity instead of hard labels.
Output
-
Clustering: Apply KMeans/GMM to historical embeddings to get K clusters, then manually interpret regime meanings. -
Retrieval: Retrieve the top-K historically similar moments to the current embedding and统计 regime distribution.
Retrieval offers stronger interpretability, as it outputs specific historical analogies.
V. Comparison with Traditional Methods
vs. HMM
HMM assumes observations follow a specific distribution (usually Gaussian) and requires predefining the number of states.
Neural networks make no distributional assumptions, handle high-dimensional inputs, and discover naturally occurring regime structures in data.
vs. Technical Indicators
Indicators like ADX, RSI, and Bollinger Bands each capture only one dimension. They struggle to model multi-factor interactions, and thresholds are manually set.
End-to-end learning automatically discovers feature combinations, with data-driven thresholds.
VI. Application Scenarios
The value of regime identification lies in strategy selection and risk control.
Strategy Matching
-
Low-volatility consolidation → Grid strategies -
Trend行情 → Trend-following strategies -
High-volatility consolidation → Reduce position size -
Liquidity crisis → Flat positions
Risk Management
Wang et al.’s research shows that excess returns from regime-switching strategies mainly come from reducing exposure during adverse regimes.
Rule: Upon identifying a high-risk regime, cut positions by half or liquidate. The goal isn’t to capture every move but to avoid systemic risks.
VII. NoFx: AI Trading Infrastructure
The above covers methodology. Implementation requires infrastructure.
NoFx isn’t a product that “lets LLMs predict price movements.” It’s positioned as the infrastructure layer for AI trading.
Data Layer
Cryptocurrency market data is highly fragmented. CEX APIs have varying formats, on-chain data requires custom parsing, and derivatives data is scattered across sources.
NoFx’s first task: Normalize heterogeneous data and unify access interfaces.
Price data:
-
Multi-timeframe candlestick OHLCV: 1m / 3m / 5m / 15m / 30m / 1h / 2h / 4h / 6h / 8h / 12h / 1d / 3d / 1w / 1M -
Tick-level trades -
Volume-Weighted Average Price (VWAP) -
Price change percentages: 1m / 5m / 15m / 30m / 1h / 4h / 24h
Volume data:
-
Raw volume and its moving averages (MA) -
Cumulative Volume Delta (CVD): Cumulative aggressive buy volume – cumulative aggressive sell volume -
CVD multi-timeframe: 5m / 15m / 1h / 4h / 24h -
Taker Buy/Sell Volume -
Volume anomaly detection (multiples relative to MA) -
Volume-price divergence indicators
Open Interest (OI) data:
-
OI absolute value -
OI changes: 1h / 4h / 24h -
OI change rates: 1h / 4h / 24h -
OI-weighted price -
Long/short position ratio -
Large trader position percentage -
Leverage distribution statistics
Funding rate data:
-
Current funding rate -
Predicted funding rate -
Historical funding rate sequences -
Cumulative funding: 24h / 7d / 30d
Liquidation data:
-
Long liquidation volume (USD) -
Short liquidation volume (USD) -
Long/short liquidation ratio -
Large liquidation events (single > 100K) -
Liquidation heatmaps (price range distribution) -
Cumulative liquidations: 1h / 4h / 24h
Capital flow data:
-
Institutional net inflow (futures) -
Institutional net inflow (spot) -
Retail net inflow (futures) -
Retail net inflow (spot) -
Large buy/sell orders (configurable thresholds) -
Exchange net inflow/outflow -
Whale address anomalies
Order book data:
-
Bid/ask prices and volumes at the top of the book -
Spread -
Depth snapshots: ±0.1% / ±0.5% / ±1% / ±2% -
Bid/ask imbalance -
Large order detection -
Order book slope
Technical indicators:
-
EMA: 7 / 13 / 21 / 55 / 100 / 200 -
SMA: 20 / 50 / 100 / 200 -
MACD: Standard + custom parameters -
RSI: 6 / 14 / 21 -
Bollinger Bands: 20-period, 2x standard deviation -
ATR: 14-period -
ADX / DMI -
Stochastic RSI -
OBV (On-Balance Volume) -
Ichimoku Cloud
Volatility data:
-
Realized volatility: 1h / 4h / 24h / 7d -
ATR percentage -
Bollinger Band width -
Price range (High – Low)
The data update service and API are now live:
Example call:
GET /api/quant-data?symbol=BTCUSDT
Response:
{
“netflow”: {
“institution”: {“future”: 1200000, “spot”: -500000},
“personal”: {“future”: -800000, “spot”: 200000}
},
“oi”: {
“current”: 450000000,
“delta”: {“1h”: 1.2, “4h”: 3.5, “24h”: -2.1}
},
“price_change”: {“1h”: 0.8, “4h”: 2.1, “24h”: -1.5},
“cvd”: {“5m”: 150000, “1h”: 890000, “4h”: -2100000},
“funding_rate”: 0.0001,
“liquidation”: {“long”: 1500000, “short”: 800000}
}
Execution Layer
Exchange API differences aren’t limited to data. Parameters for limit orders, market orders, stop-loss orders, position precision, and leverage configurations vary across platforms.
NoFx abstracts a unified execution interface, currently supporting any exchange market:
Strategy layers don’t need to worry about underlying exchanges.
Decision Layer
Above the data and execution layers, we provide an AI decision framework:
Market data → Feature engineering → AI inference → Risk control filtering → Execution
Supported inference engines: DeepSeek, Claude, GPT, Gemini, Qwen.
AI here isn’t predicting price movements but making structured decisions:
-
Multi-dimensional market state analysis -
Candidate asset screening and ranking -
Position management and risk assessment -
Entry/exit condition judgment
Outputs are structured JSON: decisions, confidence levels, and chains of thought. Complete context for each decision is fully recorded.
Regime Integration
Regime identification functions as a module in the decision framework:
Market data → Regime identification → Strategy routing → AI decision → Risk control → Execution
AI receives the current regime judgment as context when making specific decisions. Decision aggressiveness, position limits, and stop-loss widths adjust automatically across regimes.
Why Build This?
The bottleneck in AI trading isn’t models—it’s engineering.
A functional system requires: stable data flows, low-latency execution, robust risk control, traceable logs, and flexible strategy configuration. Without infrastructure, even the strongest models are useless.
Most AI trading products on the market are black boxes. Users don’t know what the AI is “thinking,” why it opens a position, or the risk control logic. When issues arise, debugging is impossible.
NoFx’s design principles: Transparency and controllability.
-
Complete chain-of-thought logs -
Configurable risk control (stop-loss, position limits, leverage restrictions) -
Open-source, deployable by users -
Web UI for parameter tuning, no coding required
AI Trading Layer
Long-term goal: A standardized layer for AI trading.
Three pillars:
First, engineering heritage from traditional quant. Order management, risk engines, backtesting frameworks, execution algorithms—decades of quant trading expertise. Without these, AI is a castle in the air.
Second, cutting-edge AI inference capabilities. LLMs excel at structured analysis, multi-factor judgment, and natural language interaction—tasks beyond traditional rule engines. But AI needs proper constraints; it shouldn’t be left to “predict markets” freely.
Third, rigorous mathematical frameworks. Regime identification, vector similarity, contrastive learning—verifiable, explainable methods. Rejecting the mysticism of “buy because the AI said so.”
Democratization
Technology is just a tool. NoFx’s ultimate goal: Democratizing AI + quant trading.
Current reality: Quant trading has extremely high barriers. You need coding skills, financial knowledge, data engineering expertise, risk management understanding, and access to institutional-grade data and execution channels. Retail traders are excluded.
NoFx offers a visual AI trading orchestration system:
-
No coding required. Strategy logic, risk rules, and AI parameters are configured via a web UI. -
No quant expertise needed. Pre-built strategy templates cover common scenarios—just adjust parameters. -
No need to build infrastructure. Data, execution, risk control, and logs are provided by the platform. -
Full transparency. Every AI decision’s inputs, reasoning, and outputs are auditable.
Someone with no quant experience should be able to configure their own AI strategy in 5 minutes and understand how it works.
This isn’t reducing professionalism. It’s packaging expertise into tools usable by everyone.
Excel let everyone do data analysis without learning SQL. Figma let everyone design without learning Photoshop.
NoFx lets everyone orchestrate AI trading strategies without becoming a quant engineer.
Open-source is inevitable. Infrastructure must be trustworthy and auditable.
VIII. Limitations
-
Overfitting: Models may merely memorize historical patterns with questionable generalization. Out-of-sample validation and rolling backtests are essential. -
Regime drift: As market structures evolve, historical regime features may become obsolete. Continuous monitoring and regular retraining are needed. -
Recognition delay: Regime switches are inevitably identified with lag. A trade-off exists between sensitivity and false positive rates.
This isn’t a predictive holy grail. Its value lies in structured market state descriptions to aid strategy selection and risk control.
IX. About Me
This work starts from first principles.
Instead of chasing the hype around “AI trading,” I first asked: What is the computational essence of AI? Can this essence be applied to financial scenarios? If so, what problems should it solve?
The answer: Vector similarity calculations can be transferred, but the goal shouldn’t be predicting price movements—it should be regime identification.
Another starting point is user-centricity.
No matter how advanced the technology, its impact is limited if only professional quant teams can use it. I want to build something that even traders with no coding skills can use.
Professionalism and usability aren’t contradictory. Professionalism lies in the underlying architecture and methodology; usability lies in product interaction. Making complex things simple is far harder than making simple things complex.
User feedback validates this direction:
-
9,500+ GitHub stars in two months -
1,800+ net new KYC first-time traders brought to Binance in one and a half months -
Over 90,000 active users -
Sustained growth in trading volume
These numbers show real market demand: Retail traders want professional-grade AI trading capabilities without spending months learning quant programming.
NoFx is packaging institutional-grade data, execution, and risk control into products usable by everyone. Technology should serve the masses, not just a few.
Open-source is a user-centric choice. Users need to see what the code does, deploy it themselves, and modify it to their needs. Black-box products can’t build trust in finance.
Once the regime identification module is validated, it will also be open-sourced and integrated.
NoFx official website:
Github:
References
Temporal Representation Learning:
[1] Yue, Z., et al. (2022). TS2Vec: Towards Universal Representation of Time Series. AAAI 2022.
[2] Soft Contrastive Learning for Time Series. ICLR 2024.
[3] Niroshan, G., et al. (2025). TS2Vec-Ensemble. arXiv.
Regime Detection:
[4] Wang, M., Lin, Y.H., & Mikhelson, I. (2020). Regime-Switching Factor Investing with Hidden Markov Models. Journal of Risk and Financial Management.
[5] Hamilton, J.D. (1989). A New Approach to the Economic Analysis of Nonstationary Time Series and the Business Cycle. Econometrica.
[6] Yuan, Y., & Mitra, G. (2019). Market Regime Identification Using Hidden Markov Models. SSRN.
Cryptocurrency Prediction:
[7] Pratas, T.E., et al. (2023). Forecasting Bitcoin Volatility: Exploring the Potential of Deep Learning. Eurasian Economic Review.
[8] Omole, O., & Enke, D. (2024). Deep Learning for Bitcoin Price Direction Prediction. Financial Innovation.
[9] Huang, Z.C., et al. (2024). Forecasting Bitcoin Volatility Using Machine Learning Techniques. Journal of International Financial Markets.

