QueStER: A Revolutionary Approach to Information Retrieval Using Small Language Models

Introduction: The Challenge of Modern Information Retrieval

In today’s digital landscape, finding relevant information efficiently has become increasingly complex. Traditional search engines face a fundamental challenge known as the “vocabulary mismatch problem” – where user queries contain keywords that don’t appear in relevant documents. This gap between what users search for and what documents contain leads to frustrating search experiences and missed information.
Information Retrieval (IR) systems serve as the backbone of search engines and Retrieval-Augmented Generation (RAG) models. For decades, bag-of-words models like BM25 have dominated the field due to their speed and efficiency. These systems rely on term-specific statistics and efficient index structures like Block-Max WAND to deliver fast results. However, their strict keyword matching approach limits their effectiveness when users express their needs using different terminology than what appears in relevant documents.

Traditional Solutions and Their Limitations

Query Rewriting Techniques

The most common approach to address vocabulary mismatch has been query rewriting. Early methods added keywords extracted from documents retrieved by the original query. While seemingly logical, this approach often leads to “query drift” – where the rewritten query moves away from the user’s original intent, especially when initial retrieved documents aren’t relevant.
Recent advances have leveraged Large Language Models (LLMs) to improve query rewriting quality. These methods show promise but face significant challenges:

Small LLMs (under 4B parameters) demonstrate degraded performance
Large LLMs require complex prompts and multiple sampling attempts, reducing efficiency
Cost considerations make large-scale implementation impractical for many applications

Neural Retrieval Models

Modern alternatives include Transformer-based neural IR models like dual encoders (dense and sparse), and late-interaction models such as ColBERTv2. While these approaches show improved effectiveness, they come with substantial drawbacks:

Storage Requirements: Dense indexes on MS MARCO (8.8M documents) require 13 GiB compared to BM25’s 0.67 GiB
Rebuilding Costs: When models are retrained, entire indexes must be rebuilt – impractical for large collections
Computational Overhead: Neural models require significant processing power for both indexing and retrieval

Generative Retrieval Approaches

Generative Retrieval (GR) models attempt to internalize indexes within model parameters, avoiding traditional indexing entirely. These models directly predict document identifiers based on user queries. However, they face their own limitations:

Arbitrary Document Identifiers: Models using clustered or learned identifiers show poor generalization, especially with large document collections
Metadata-Based Approaches: Using document metadata (URLs, titles, keywords) offers better interpretability but remains limited by the quality and completeness of metadata

QueStER: A New Paradigm in Information Retrieval

Core Innovation

QueStER (Query Specification for Generative Retrieval) introduces a fundamental shift in how we approach generative retrieval. Instead of mapping queries to document metadata, QueStER generates search specifications that can be processed by established search technologies.
The key insight is simple yet powerful: generative models should generate search specifications rather than document identifiers. In its current implementation, QueStER generates keyword queries processed by BM25, but the framework can extend to structured specifications used in major search libraries like Lucene.

Three-Fold Advantage

This approach offers three significant benefits:

Leverages Optimized Search Technologies: By generating specifications for established search engines, QueStER taps into decades of optimization in retrieval algorithms and query languages
Eliminates Index Rebuilding: When the underlying neural network evolves, existing indexes remain functional
Enhances Explainability: Users can analyze generated queries, crucial for applications in law, medicine, and patents where transparency is essential

Technical Architecture and Implementation

System Overview

QueStER operates through a sophisticated pipeline that combines language model generation with traditional information retrieval:

Query Generation: An LLM generates multiple candidate query specifications
Retrieval Execution: Efficient index-based bag-of-words IR models process these specifications
Quality Assessment: Top-k retrieved results are evaluated using a cross-encoder reference
Reward Calculation: Expected nDCG (SoftNDCG) is computed from these assessments
Policy Optimization: Rewards drive updates to improve future query generation

Problem Formulation

The core challenge involves generating a query specification pθ(q) from an initial user query q that leads to both efficient and effective retrieval. Since the quality can only be evaluated using IR metrics, the problem is framed as reinforcement learning where:

Policy: The generation process pθ
Reward: An IR metric reflecting query quality
Goal: Learn a policy that generates higher-reward candidates

Policy Optimization with GRPO

QueStER uses Group-Relative Policy Optimization (GRPO) to train the rewriting policy. For each query, the system:

Samples a group of m candidates {qi}i=1m
Calculates associated rewards {ri}
Computes group-relative advantages ai = ri – r̄ (where r̄ is the group mean)
Applies clipped policy-gradient updates
Key implementation details include:

Reward standardization within each group
KL weight set to β=0 (no explicit KL penalty) to encourage exploration
Temperature τ=1.2 during training, τ=0 during inference for deterministic output

Prompt Engineering

After testing 50+ prompt candidates, researchers identified a minimal instruction that performs well zero-shot:

Generate relevant single-word keywords to improve retrieval performance. 
Only output unique keywords, separated by commas. 
[QUERY]: {query} 
[KEYWORDS]:

This simple yet effective prompt format maximizes nDCG@10 while maintaining clarity and consistency.

Reward Mechanism: SoftNDCG

Challenge with Traditional Metrics

Traditional nDCG presents limitations for reinforcement learning in IR contexts. When relevant documents are ranked after non-relevant ones, we need a reward that considers the magnitude of score differences, not just ranking order.

SoftRank Solution

QueStER employs SoftRank, which computes expected nDCG based on the assumption that search engine scores follow a normal distribution:
E(nDCG@k) = E( Σ_{i=1}^k relevance(d_i) / log₂(1+K_i) )
Where:

relevance(d_i) is the relevance of document d_i
K_i is a random variable representing document rank
Scores follow S(q,d) ~ N(se(q,d),νId)

Document Ranking Probability

The probability that document d_i ranks before d_j is given by:
p(d_i ≻ d_j) = σ((s_i – s_j)/ν)
Where σ is the sigmoid function and ν is the standard deviation. This formulation allows:

ν→0: Behavior approaches traditional nDCG
ν→∞: All documents have equal ranking probability
Moderate ν: Provides balance between stability and discriminability

Cross-Encoder Distillation

MS MARCO dataset evaluations rely on user clicks, which are incomplete and contain many false negatives. To address this, QueStER uses distillation – leveraging powerful cross-encoders to indicate document relevance, providing more reliable training signals.

Experimental Results and Performance Analysis

Experimental Setup

Training Dataset: MS MARCO v1 passage dataset (8.8M passages)
Evaluation Sets:

In-domain: MS MARCO dev set (6,980 queries), TREC DL 2019/2020
Out-of-domain: BEIR dataset
Implementation Details:
Models: Qwen3 variants (0.6B, 1.7B, 4B parameters)
Training: 96,000 randomly sampled queries
Hardware: NVIDIA RTX A6000 48GB GPUs
Cost: $150-200 total
Training time: ~2 days for 4B model on single GPU

Performance Comparison

In-Domain Results

On MS MARCO Dev and TREC DL’19/’20, QueStER demonstrates significant improvements:

Model	RR@10	R@1K	nDCG@10 (DL19)	R@1K (DL19)	nDCG@10 (DL20)	R@1K (DL20)
BM25	18.4	85.3	50.6	73.9	48.0	72.3
Qwen3-base	9.1	80.5	27.1	52.0	23.9	47.5
QueStER	22.4	92.1	63.1	82.1	60.8	81.5
QueStER outperforms BM25 by +4.0 nDCG@10 in-domain while maintaining competitive recall rates.

Out-of-Domain Performance

On BEIR datasets, QueStER shows exceptional generalization:

Method	Avg. nDCG@10
BM25	37.3
ANCE	36.8
SPLADEv2	46.6
ColBERTv2	47.1
HyDE	48.8
LameR	50.8
MuGI (GPT4)	51.0
QueStER	45.5
While some methods using large proprietary LLMs show slightly higher averages, QueStER achieves competitive performance with significantly better efficiency.

Efficiency Analysis

QueStER demonstrates an optimal balance between effectiveness and efficiency:

BM25: 16.3 ms/query (highest efficiency, lowest effectiveness)
QueStER: 28 ms/query (strong balance)
Neural Models: >100 ms/query (higher effectiveness but much slower)
LLM-based Methods: Variable but generally slower due to API calls and multiple sampling

The bubble size represents model parameters (billions), showing QueStER achieves favorable performance with a compact 4B model.

Qualitative Analysis

Analysis of keyword overlap distributions reveals that QueStER-generated queries align better with relevant document vocabulary than original queries, both in-domain and out-of-domain.
Example Query Transformation:

Original: “veggie chicken”
Generated: “chicken, vegetable, veggie, recipe, dish, salad, stuffed, healthy, mixture, substitute, meal”
Result: 300% increase in keyword coverage, 47% improvement in relevant document recall

Ablation Studies and Parameter Analysis

Model Size Impact

Performance scales with model size:

0.6B: 42.5 average nDCG@10
1.7B: 43.3 average nDCG@10
4B: 45.5 average nDCG@10
While larger models show better performance, the 1.7B variant offers a favorable efficiency-effectiveness trade-off for resource-constrained environments.

Supervised Fine-Tuning Effects

Experiments with supervised fine-tuning (SFT) as a warm-start showed:

SFT alone improves over the backbone model but doesn’t reach BM25 performance
SFT + GRPO improves but doesn’t outperform GRPO-only training
SFT appears to narrow the rewrite space and reduce exploration

Reward Signal Comparison

Cross-encoder derived labels consistently outperform hard click-based labels:

Hard labels without CE: -1 nDCG@10 performance
Soft labels without CE: Similar degradation
CE-derived labels: Best performance across configurations

SoftRank Parameter Optimization

The standard deviation ν in SoftRank significantly impacts performance:

ν=0.05: 43.9 average nDCG@10 (too unstable)
ν=0.5: 45.5 average nDCG@10 (optimal)
ν=1.0: 43.7 average nDCG@10 (over-smoothed)
Moderate variance provides the best balance between stability and discriminability.

KL Weight Effects

KL-divergence weight β in GRPO shows:

β=0.0: 45.5 average nDCG@10 (best performance)
β=0.01: 43.5 average nDCG@10
β=0.05: 41.0 average nDCG@10
Higher β values overly constrain the policy, limiting exploration and reward optimization.

Evaluation Pool Size

Cut-off value k for E(nDCG@k) affects performance:

k=100: 44.8 average nDCG@10
k=10,000: 45.5 average nDCG@10
Larger evaluation pools provide better training signals despite assuming unlabeled documents are irrelevant.

Practical Implementation Guide

Model Configuration

For optimal performance, use the following configuration:

Component	Setting	Rationale
Base Model	Qwen3-4B	Best efficiency-effectiveness balance
Fine-tuning	LoRA (rank=40, α=40)	Parameter-efficient adaptation
Sampling Temperature	Training: 1.2, Inference: 0.0	Exploration during training, determinism during inference
Batch Size	320 (20 micro-steps)	Stable gradient updates
Optimizer	AdamW (lr=5e-6)	Proven effectiveness for LLM fine-tuning
Training Data	96,000 queries	Sufficient for convergence without overfitting

Deployment Considerations

Hardware Requirements: Single NVIDIA RTX A6000 (48GB) sufficient for training and inference
Inference Optimization: Use greedy decoding (τ=0) for consistent, reproducible results
Batch Processing: Batch size of 256 for efficient inference (6,980 queries processed in ~5 minutes)
Index Compatibility: Works with existing BM25 indexes without modification

Integration Steps

Prepare Training Data: Sample queries from target domain
Configure Cross-Encoder: Use pre-trained CE for relevance assessment
Set Up GRPO Training: Configure parameters per above recommendations
Validate Performance: Test on held-out queries before deployment
Monitor Output: Regularly check generated query quality

Applications and Use Cases

Professional Search Systems

QueStER excels in professional domains where precision and explainability are crucial:

Legal Research: Generates precise legal terminology queries
Medical Literature: Expands medical queries with relevant terminology
Patent Search: Handles technical vocabulary variations effectively

Enterprise Knowledge Management

Organizations can implement QueStER for:

Internal Document Search: Improves employee information access
Customer Support: Enhances FAQ and knowledge base retrieval
Research Portals: Facilitates academic and scientific literature search

E-commerce and Product Discovery

Retail applications benefit from:

Product Search: Handles varied customer terminology
Recommendation Systems: Improves product discovery
Inventory Management: Enhances internal product database searches

Limitations and Considerations

Data Contamination

Public models like Qwen may contain overlapping data with MS MARCO or BEIR datasets. While this affects all comparative studies, precise contamination measurement remains challenging.

Query Language Complexity

Current implementation focuses on keyword generation. Future work should explore:

Boolean query operators (AND, OR, NOT)
Phrase queries and proximity specifications
Hybrid dense-sparse retrieval backends

Domain Adaptation

While QueStER shows strong out-of-domain performance, extreme domain shifts may require additional fine-tuning or domain-specific adaptation strategies.

Future Directions

Structured Query Languages

Expanding beyond keyword generation to support complex query specifications will enhance expressiveness and precision. Integration with search libraries like Lucene could enable sophisticated query construction.

Hybrid Retrieval Architectures

Combining dense and sparse retrieval approaches could leverage the strengths of both methods, potentially improving both effectiveness and efficiency.

Multi-Modal Retrieval

Extending the framework to handle image, video, and audio content would broaden applicability across diverse media types.

Real-Time Adaptation

Developing mechanisms for continuous learning and adaptation could help systems evolve with changing language patterns and user behaviors.

Conclusion

QueStER represents a significant advancement in information retrieval, demonstrating that small language models can achieve competitive performance with large proprietary systems while maintaining superior efficiency. By reframing generative retrieval as query specification generation, QueStER bridges the gap between traditional and neural approaches, offering a practical solution for real-world deployment.
The method’s strengths lie in:

Efficiency: 28ms/query with 4B parameter model
Effectiveness: +4.0 nDCG@10 in-domain, +5.3 out-of-domain over BM25
Explainability: Human-readable query specifications
Scalability: No index rebuilding required when models update
Accessibility: Runs on single GPU, reducing deployment barriers
As information retrieval continues to evolve, QueStER provides a foundation for more efficient, effective, and explainable search systems. Its balance of performance and practicality makes it suitable for immediate deployment across various domains while offering a pathway for future advancements in generative retrieval technology.
The approach demonstrates that with thoughtful architecture and training strategies, small language models can compete with much larger systems, opening possibilities for more accessible and sustainable information retrieval solutions.