Introduction: The Challenge of Modern Information Retrieval

In today’s digital landscape, finding relevant information efficiently has become increasingly complex. Traditional search engines face a fundamental challenge known as the “vocabulary mismatch problem” – where user queries contain keywords that don’t appear in relevant documents. This gap between what users search for and what documents contain leads to frustrating search experiences and missed information.
Information Retrieval (IR) systems serve as the backbone of search engines and Retrieval-Augmented Generation (RAG) models. For decades, bag-of-words models like BM25 have dominated the field due to their speed and efficiency. These systems rely on term-specific statistics and efficient index structures like Block-Max WAND to deliver fast results. However, their strict keyword matching approach limits their effectiveness when users express their needs using different terminology than what appears in relevant documents.

Traditional Solutions and Their Limitations

Query Rewriting Techniques

The most common approach to address vocabulary mismatch has been query rewriting. Early methods added keywords extracted from documents retrieved by the original query. While seemingly logical, this approach often leads to “query drift” – where the rewritten query moves away from the user’s original intent, especially when initial retrieved documents aren’t relevant.
Recent advances have leveraged Large Language Models (LLMs) to improve query rewriting quality. These methods show promise but face significant challenges:

  • Small LLMs (under 4B parameters) demonstrate degraded performance
  • Large LLMs require complex prompts and multiple sampling attempts, reducing efficiency
  • Cost considerations make large-scale implementation impractical for many applications

Neural Retrieval Models

Modern alternatives include Transformer-based neural IR models like dual encoders (dense and sparse), and late-interaction models such as ColBERTv2. While these approaches show improved effectiveness, they come with substantial drawbacks:

  1. Storage Requirements: Dense indexes on MS MARCO (8.8M documents) require 13 GiB compared to BM25’s 0.67 GiB
  2. Rebuilding Costs: When models are retrained, entire indexes must be rebuilt – impractical for large collections
  3. Computational Overhead: Neural models require significant processing power for both indexing and retrieval

Generative Retrieval Approaches

Generative Retrieval (GR) models attempt to internalize indexes within model parameters, avoiding traditional indexing entirely. These models directly predict document identifiers based on user queries. However, they face their own limitations:

  • Arbitrary Document Identifiers: Models using clustered or learned identifiers show poor generalization, especially with large document collections
  • Metadata-Based Approaches: Using document metadata (URLs, titles, keywords) offers better interpretability but remains limited by the quality and completeness of metadata

QueStER: A New Paradigm in Information Retrieval

Core Innovation

QueStER (Query Specification for Generative Retrieval) introduces a fundamental shift in how we approach generative retrieval. Instead of mapping queries to document metadata, QueStER generates search specifications that can be processed by established search technologies.
The key insight is simple yet powerful: generative models should generate search specifications rather than document identifiers. In its current implementation, QueStER generates keyword queries processed by BM25, but the framework can extend to structured specifications used in major search libraries like Lucene.

Three-Fold Advantage

This approach offers three significant benefits:

  1. Leverages Optimized Search Technologies: By generating specifications for established search engines, QueStER taps into decades of optimization in retrieval algorithms and query languages
  2. Eliminates Index Rebuilding: When the underlying neural network evolves, existing indexes remain functional
  3. Enhances Explainability: Users can analyze generated queries, crucial for applications in law, medicine, and patents where transparency is essential

Technical Architecture and Implementation

System Overview

QueStER operates through a sophisticated pipeline that combines language model generation with traditional information retrieval:

  1. Query Generation: An LLM generates multiple candidate query specifications
  2. Retrieval Execution: Efficient index-based bag-of-words IR models process these specifications
  3. Quality Assessment: Top-k retrieved results are evaluated using a cross-encoder reference
  4. Reward Calculation: Expected nDCG (SoftNDCG) is computed from these assessments
  5. Policy Optimization: Rewards drive updates to improve future query generation

Problem Formulation

The core challenge involves generating a query specification pθ(q) from an initial user query q that leads to both efficient and effective retrieval. Since the quality can only be evaluated using IR metrics, the problem is framed as reinforcement learning where:

  • Policy: The generation process pθ
  • Reward: An IR metric reflecting query quality
  • Goal: Learn a policy that generates higher-reward candidates

Policy Optimization with GRPO

QueStER uses Group-Relative Policy Optimization (GRPO) to train the rewriting policy. For each query, the system:

  1. Samples a group of m candidates {qi}i=1m
  2. Calculates associated rewards {ri}
  3. Computes group-relative advantages ai = ri – r̄ (where r̄ is the group mean)
  4. Applies clipped policy-gradient updates
    Key implementation details include:
  • Reward standardization within each group
  • KL weight set to β=0 (no explicit KL penalty) to encourage exploration
  • Temperature τ=1.2 during training, τ=0 during inference for deterministic output

Prompt Engineering

After testing 50+ prompt candidates, researchers identified a minimal instruction that performs well zero-shot:

Generate relevant single-word keywords to improve retrieval performance. 
Only output unique keywords, separated by commas. 
[QUERY]: {query} 
[KEYWORDS]:

This simple yet effective prompt format maximizes nDCG@10 while maintaining clarity and consistency.

Reward Mechanism: SoftNDCG

Challenge with Traditional Metrics

Traditional nDCG presents limitations for reinforcement learning in IR contexts. When relevant documents are ranked after non-relevant ones, we need a reward that considers the magnitude of score differences, not just ranking order.

SoftRank Solution

QueStER employs SoftRank, which computes expected nDCG based on the assumption that search engine scores follow a normal distribution:
E(nDCG@k) = E( Σ_{i=1}^k relevance(d_i) / log₂(1+K_i) )
Where:

  • relevance(d_i) is the relevance of document d_i
  • K_i is a random variable representing document rank
  • Scores follow S(q,d) ~ N(se(q,d),νId)

Document Ranking Probability

The probability that document d_i ranks before d_j is given by:
p(d_i ≻ d_j) = σ((s_i – s_j)/ν)
Where σ is the sigmoid function and ν is the standard deviation. This formulation allows:

  • ν→0: Behavior approaches traditional nDCG
  • ν→∞: All documents have equal ranking probability
  • Moderate ν: Provides balance between stability and discriminability

Cross-Encoder Distillation

MS MARCO dataset evaluations rely on user clicks, which are incomplete and contain many false negatives. To address this, QueStER uses distillation – leveraging powerful cross-encoders to indicate document relevance, providing more reliable training signals.

Experimental Results and Performance Analysis

Experimental Setup

Training Dataset: MS MARCO v1 passage dataset (8.8M passages)
Evaluation Sets:

  • In-domain: MS MARCO dev set (6,980 queries), TREC DL 2019/2020
  • Out-of-domain: BEIR dataset
    Implementation Details:
  • Models: Qwen3 variants (0.6B, 1.7B, 4B parameters)
  • Training: 96,000 randomly sampled queries
  • Hardware: NVIDIA RTX A6000 48GB GPUs
  • Cost: $150-200 total
  • Training time: ~2 days for 4B model on single GPU

Performance Comparison

In-Domain Results

On MS MARCO Dev and TREC DL’19/’20, QueStER demonstrates significant improvements:

Model RR@10 R@1K nDCG@10 (DL19) R@1K (DL19) nDCG@10 (DL20) R@1K (DL20)
BM25 18.4 85.3 50.6 73.9 48.0 72.3
Qwen3-base 9.1 80.5 27.1 52.0 23.9 47.5
QueStER 22.4 92.1 63.1 82.1 60.8 81.5
QueStER outperforms BM25 by +4.0 nDCG@10 in-domain while maintaining competitive recall rates.

Out-of-Domain Performance

On BEIR datasets, QueStER shows exceptional generalization:

Method Avg. nDCG@10
BM25 37.3
ANCE 36.8
SPLADEv2 46.6
ColBERTv2 47.1
HyDE 48.8
LameR 50.8
MuGI (GPT4) 51.0
QueStER 45.5
While some methods using large proprietary LLMs show slightly higher averages, QueStER achieves competitive performance with significantly better efficiency.

Efficiency Analysis

QueStER demonstrates an optimal balance between effectiveness and efficiency:

  • BM25: 16.3 ms/query (highest efficiency, lowest effectiveness)
  • QueStER: 28 ms/query (strong balance)
  • Neural Models: >100 ms/query (higher effectiveness but much slower)
  • LLM-based Methods: Variable but generally slower due to API calls and multiple sampling
    Efficiency-Effectiveness Trade-off
    The bubble size represents model parameters (billions), showing QueStER achieves favorable performance with a compact 4B model.

Qualitative Analysis

Analysis of keyword overlap distributions reveals that QueStER-generated queries align better with relevant document vocabulary than original queries, both in-domain and out-of-domain.
Example Query Transformation:

  • Original: “veggie chicken”
  • Generated: “chicken, vegetable, veggie, recipe, dish, salad, stuffed, healthy, mixture, substitute, meal”
  • Result: 300% increase in keyword coverage, 47% improvement in relevant document recall

Ablation Studies and Parameter Analysis

Model Size Impact

Performance scales with model size:

  • 0.6B: 42.5 average nDCG@10
  • 1.7B: 43.3 average nDCG@10
  • 4B: 45.5 average nDCG@10
    While larger models show better performance, the 1.7B variant offers a favorable efficiency-effectiveness trade-off for resource-constrained environments.

Supervised Fine-Tuning Effects

Experiments with supervised fine-tuning (SFT) as a warm-start showed:

  • SFT alone improves over the backbone model but doesn’t reach BM25 performance
  • SFT + GRPO improves but doesn’t outperform GRPO-only training
  • SFT appears to narrow the rewrite space and reduce exploration

Reward Signal Comparison

Cross-encoder derived labels consistently outperform hard click-based labels:

  • Hard labels without CE: -1 nDCG@10 performance
  • Soft labels without CE: Similar degradation
  • CE-derived labels: Best performance across configurations

SoftRank Parameter Optimization

The standard deviation ν in SoftRank significantly impacts performance:

  • ν=0.05: 43.9 average nDCG@10 (too unstable)
  • ν=0.5: 45.5 average nDCG@10 (optimal)
  • ν=1.0: 43.7 average nDCG@10 (over-smoothed)
    Moderate variance provides the best balance between stability and discriminability.

KL Weight Effects

KL-divergence weight β in GRPO shows:

  • β=0.0: 45.5 average nDCG@10 (best performance)
  • β=0.01: 43.5 average nDCG@10
  • β=0.05: 41.0 average nDCG@10
    Higher β values overly constrain the policy, limiting exploration and reward optimization.

Evaluation Pool Size

Cut-off value k for E(nDCG@k) affects performance:

  • k=100: 44.8 average nDCG@10
  • k=10,000: 45.5 average nDCG@10
    Larger evaluation pools provide better training signals despite assuming unlabeled documents are irrelevant.

Practical Implementation Guide

Model Configuration

For optimal performance, use the following configuration:

Component Setting Rationale
Base Model Qwen3-4B Best efficiency-effectiveness balance
Fine-tuning LoRA (rank=40, α=40) Parameter-efficient adaptation
Sampling Temperature Training: 1.2, Inference: 0.0 Exploration during training, determinism during inference
Batch Size 320 (20 micro-steps) Stable gradient updates
Optimizer AdamW (lr=5e-6) Proven effectiveness for LLM fine-tuning
Training Data 96,000 queries Sufficient for convergence without overfitting

Deployment Considerations

  1. Hardware Requirements: Single NVIDIA RTX A6000 (48GB) sufficient for training and inference
  2. Inference Optimization: Use greedy decoding (τ=0) for consistent, reproducible results
  3. Batch Processing: Batch size of 256 for efficient inference (6,980 queries processed in ~5 minutes)
  4. Index Compatibility: Works with existing BM25 indexes without modification

Integration Steps

  1. Prepare Training Data: Sample queries from target domain
  2. Configure Cross-Encoder: Use pre-trained CE for relevance assessment
  3. Set Up GRPO Training: Configure parameters per above recommendations
  4. Validate Performance: Test on held-out queries before deployment
  5. Monitor Output: Regularly check generated query quality

Applications and Use Cases

Professional Search Systems

QueStER excels in professional domains where precision and explainability are crucial:

  • Legal Research: Generates precise legal terminology queries
  • Medical Literature: Expands medical queries with relevant terminology
  • Patent Search: Handles technical vocabulary variations effectively

Enterprise Knowledge Management

Organizations can implement QueStER for:

  • Internal Document Search: Improves employee information access
  • Customer Support: Enhances FAQ and knowledge base retrieval
  • Research Portals: Facilitates academic and scientific literature search

E-commerce and Product Discovery

Retail applications benefit from:

  • Product Search: Handles varied customer terminology
  • Recommendation Systems: Improves product discovery
  • Inventory Management: Enhances internal product database searches

Limitations and Considerations

Data Contamination

Public models like Qwen may contain overlapping data with MS MARCO or BEIR datasets. While this affects all comparative studies, precise contamination measurement remains challenging.

Query Language Complexity

Current implementation focuses on keyword generation. Future work should explore:

  • Boolean query operators (AND, OR, NOT)
  • Phrase queries and proximity specifications
  • Hybrid dense-sparse retrieval backends

Domain Adaptation

While QueStER shows strong out-of-domain performance, extreme domain shifts may require additional fine-tuning or domain-specific adaptation strategies.

Future Directions

Structured Query Languages

Expanding beyond keyword generation to support complex query specifications will enhance expressiveness and precision. Integration with search libraries like Lucene could enable sophisticated query construction.

Hybrid Retrieval Architectures

Combining dense and sparse retrieval approaches could leverage the strengths of both methods, potentially improving both effectiveness and efficiency.

Multi-Modal Retrieval

Extending the framework to handle image, video, and audio content would broaden applicability across diverse media types.

Real-Time Adaptation

Developing mechanisms for continuous learning and adaptation could help systems evolve with changing language patterns and user behaviors.

Conclusion

QueStER represents a significant advancement in information retrieval, demonstrating that small language models can achieve competitive performance with large proprietary systems while maintaining superior efficiency. By reframing generative retrieval as query specification generation, QueStER bridges the gap between traditional and neural approaches, offering a practical solution for real-world deployment.
The method’s strengths lie in:

  • Efficiency: 28ms/query with 4B parameter model
  • Effectiveness: +4.0 nDCG@10 in-domain, +5.3 out-of-domain over BM25
  • Explainability: Human-readable query specifications
  • Scalability: No index rebuilding required when models update
  • Accessibility: Runs on single GPU, reducing deployment barriers
    As information retrieval continues to evolve, QueStER provides a foundation for more efficient, effective, and explainable search systems. Its balance of performance and practicality makes it suitable for immediate deployment across various domains while offering a pathway for future advancements in generative retrieval technology.
    The approach demonstrates that with thoughtful architecture and training strategies, small language models can compete with much larger systems, opening possibilities for more accessible and sustainable information retrieval solutions.