Evidence-Based Text Generation with Large Language Models: A Systematic Study of Citations, Attributions, and Quotations

In the digital age, large language models (LLMs) have become increasingly widespread—powering everything from customer service chatbots to content creation tools. These models are reshaping how humans process and generate text, but their growing popularity has brought a critical concern to the forefront: How can we trust the information they produce? When an LLM generates an analysis report, an academic review, or a key piece of information, how do we verify that the content is supported by solid evidence? And how can we trace the sources behind its conclusions?

To address these questions, the field of “evidence-based text generation for large language models” has emerged. Its core goal is to link LLM outputs with supporting evidence, ensuring the content is traceable and verifiable. However, the field has long suffered from inconsistencies: terminology varies (some studies call the technology “citation generation,” others “attribution text generation”), evaluation methods are fragmented, and there is no unified benchmark for comparison. These issues have made it difficult to integrate and advance research across the board.

Against this backdrop, a systematic analysis of 134 relevant research papers was conducted. This study not only proposes a unified classification framework but also compiles over 300 evaluation metrics, focusing specifically on how citations, attributions, and quotations can enable evidence-based text generation. This blog post will break down the study’s methodology, key findings, and core datasets, providing clear guidance for researchers and practitioners in the field.

I. Why Does Evidence-Based Text Generation Matter?

Beneath the “creativity” of large language models lies a significant flaw: they often produce content that seems plausible but contradicts facts—a phenomenon known as “hallucination.” For example, when writing an article on climate change, an LLM might fabricate research data or misquote expert opinions, and readers would have no easy way to detect these errors. This “evidence-free output” can have serious consequences in fields like academic research, journalism, and medical consulting.

Evidence-based text generation was developed to solve this problem. It requires LLMs to explicitly link generated content with supporting evidence sources—such as stating, “According to a 2023 study published in Nature…” or directly quoting core arguments from a research paper. This approach allows readers to verify information authenticity and makes LLM outputs more credible.

For years, however, research in this field has been fragmented. Some studies refer to this technology as “citation generation,” while others use “attribution text generation.” When evaluating model performance, some researchers rely on human scoring, while others use machine metrics—making results hard to compare. Even the definition of “evidence” varies: some studies consider entire documents as evidence, while others focus on specific sentences.

The value of this systematic study lies in its ability to establish a unified “language system” and “evaluation framework” for the field by synthesizing 134 core papers. This provides a consistent foundation for future research to build upon.

II. How Was the Study Conducted?

To systematically analyze a research field, two key steps are required: defining how to find relevant studies and establishing criteria to filter valuable research. This study used rigorous methods to ensure it covered core literature in the field and classified it scientifically.

1. Literature Search: Targeting Core Papers Precisely

The research team first identified keyword combinations that would focus on large language models while covering the core concept of “evidence linking.” The final search string used was:

(“large language model” OR “llm”) AND (“citation” OR “attribution” OR “quote”)

The logic behind this combination is clear: “large language model” or “llm” narrows the focus to the target technology, while “citation,” “attribution,” and “quote” lock in the core methods of “evidence linking.”

To ensure comprehensiveness, the team searched 9 major academic databases spanning computer science, artificial intelligence, and linguistics. The number of papers retrieved from each database is shown in the table below:

Academic Database	Number of Papers
ACL Anthology	54
ACM Digital Library	7
arXiv	59
ICML Proceedings	0
ICLR Proceedings	3
IEEE Xplore	4
NeurIPS Proceedings	3
ScienceDirect	0
Springer Nature	4
Total	134

(Data source: Literature search results by the research team, February 2025)

The results show that ACL Anthology (a leading database in computational linguistics) and arXiv (a preprint platform) contributed the most papers—reflecting that research in this field is primarily concentrated in natural language processing and artificial intelligence.

Figure: Literature search and screening form the foundation of systematic research. A rigorous process ensures the reliability of results (Image from unsplash.com)

2. Screening Criteria: Focusing on Core Research

After the initial search, the team identified 805 unique papers. To keep the study focused, they established three strict inclusion criteria—only papers that met all three were included in the final analysis:

Criterion 1: The study must focus on “natural language text generation using large language models.” This excluded papers that only discussed traditional machine learning models or non-text generation tasks.
Criterion 2: The study must “proactively integrate citations of evidence sources during text generation.” For example, papers that only discussed the accuracy of LLM outputs without addressing how to link evidence were excluded.
Criterion 3: The paper must be “written in English and available electronically in full.” This ensured the research team could fully interpret the content without language barriers or difficulties accessing the literature.

The screening process was conducted independently by two researchers. They first made preliminary judgments based on titles and abstracts, and reviewed full texts when necessary. Any disagreements were resolved through discussion, and 134 papers that met all criteria were finally selected.

3. Paper Classification: Organizing by “Contribution Type”

To streamline the analysis, the research team categorized the 134 papers into six types based on their “contribution type”—each corresponding to a different research direction:

Contribution Type	Detailed Description
Approach	Proposes new technologies, methods, or processes for evidence-based text generation. For example, a new algorithm that enables LLMs to automatically insert citations.
Application	Translates existing methods into practical tools, such as software libraries or prototype systems. For example, an “intelligent citation generation plugin” developed based on a specific algorithm.
Resource	Releases datasets or benchmark test sets to support related research. For example, an academic paper dataset with annotated citations.
Evaluation	Proposes new evaluation metrics or frameworks to measure how well models link evidence to generated content. For example, an automated scoring method to judge citation relevance.
Survey	Synthesizes multiple studies to summarize progress in the field. For example, previous review papers on “LLMs and citations.”
Position	Expresses views on the direction of the field without presenting new empirical data. For example, a paper discussing whether “future citation generation should prioritize credibility or fluency.”

This classification helps researchers quickly locate the literature they need: those looking for technical methods can focus on the “Approach” category, while those seeking datasets can turn to the “Resource” category.

III. Core Datasets: Detailed Annotation Information for 134 Papers

To enable other researchers to directly reuse the results of this study, the team compiled three datasets covering paper metadata, evaluation metrics, and dataset information. These datasets are stored in CSV format with clear structures for easy querying and analysis.

1. publications.csv: Paper Metadata and Classification Details

This dataset contains basic information and detailed annotations for all 134 papers, with 24 fields covering almost all core dimensions needed to understand a paper. Below are explanations of key fields:

Title, Abstract, Year, Authors, Url: These are basic paper details, allowing researchers to quickly locate and access the original text.
Annotator: Records which researcher completed the annotation (annotator information was anonymized to ensure privacy) to ensure traceability.
Contribution Type: Refers to the six categories mentioned earlier (“Approach,” “Application,” “Resource,” etc.), enabling direct filtering of specific research types.
Citation Term: The term used in the paper to describe “evidence linking” (attribution, citation, or quote), reflecting the diversity of terminology in the field.
Task Name: The specific task defined in the paper (e.g., “citation generation,” “attribution text generation”), helping readers understand the study’s goals.
Citation Modality: The type of evidence being cited (text, graphs, tables, images, etc.). For example, some studies focus on citing text paragraphs, while others cite tabular data.
Evidence Level: The precision of the citation—ranging from “entire document” to “sentence,” “word,” or even “table cell.” Higher precision means more specific evidence but also greater technical difficulty.
Citation Style: How evidence is presented to users, such as “inline citations (e.g., [1])”, “narrative citations (e.g., “Smith et al. (2023) noted”)”, or direct “quotes (exact text from sources)”.
Citation Visibility: Whether citations appear in the final output or only serve as an intermediate step during model generation (invisible to users). For example, some tools display citations explicitly in generated articles, while others only use citations to improve content accuracy without showing them.
Prompting: The prompt engineering methods used in the paper, such as zero-shot prompting, few-shot prompting, or chain-of-thought prompting. Prompting strategies are core techniques for LLM applications and directly impact generation performance.
Pre-training / Fine-tuning: Whether the LLM requires pre-training or fine-tuning, and the specific method (e.g., supervised fine-tuning, reinforcement learning). This affects the implementation cost of the method—fine-tuning requires more data and computing resources, while zero-shot prompting is more lightweight.
Task: The specific scenario addressed in the paper, such as Question Answering, Summarization, or Grounded Text Generation.

With these fields, researchers can quickly filter papers that match their needs. For example, if someone is looking for “research on question answering tasks that uses few-shot prompting, requires no fine-tuning, and generates inline citations,” they can simply filter the corresponding fields.

2. evaluation.csv: A Compilation of Evaluation Metrics and Frameworks

Evaluation is the “measuring stick” of research, but metrics in this field have long been inconsistent. This dataset compiles all evaluation metrics and frameworks extracted from the 134 papers, with 7 fields:

Metric Name, Metric Abbreviation: For example, the full name of “BLEU” is “Bilingual Evaluation Understudy.”
Framework: If the metric belongs to a specific evaluation framework (e.g., the “ROUGE framework”), it is noted here to avoid redundant analysis.
Evaluation Method: How the metric is calculated, such as “human evaluation,” “lexical overlap,” or “LLM-as-a-judge.”
Evaluation Dimension: The specific aspect the metric measures, such as “attribution accuracy,” “citation relevance,” or “language fluency.”
Description: A brief explanation of the metric, mostly taken directly from the original paper to ensure accuracy.
Source: A link to the paper that proposed the metric, allowing researchers to trace its design logic.

For example, “BLEU” is a common lexical overlap metric used to evaluate the similarity between generated text and reference text—it is often used to measure language fluency. “Human evaluation,” by contrast, is more subjective but comprehensive, as it can assess complex dimensions like citation reasonableness.

The value of this dataset lies in helping researchers clearly identify “which metrics are suitable for evaluating which dimensions,” avoiding blind selection. For instance, if evaluating “whether citations accurately point to evidence,” researchers may need to use “retrieval-based” metrics rather than metrics focused solely on language fluency.

3. datasets.csv: Related Datasets and Benchmark Test Sets

Data is the foundation for training and evaluating models. This dataset compiles all datasets and benchmark test sets mentioned in the 134 papers, with 4 fields:

Dataset, Benchmark: If the dataset belongs to a specific benchmark test set (e.g., “FEVER”), it is noted here.
Dataset Task: The tasks the dataset can be used for, such as question answering, summarization, or citation generation.
Source: A link to the dataset or the paper introducing it, allowing researchers to access it directly.

For example, “FEVER” is a well-known fact-checking benchmark that includes a large number of claims requiring verification and corresponding evidence—it is often used to train models’ ability to link evidence. “PubMedQA,” on the other hand, focuses on medical question answering and is suitable for evaluating evidence citation in professional fields.

This dataset helps researchers quickly find data sources relevant to their research direction, avoiding redundant work.

Figure: Structured datasets are the foundation for research reuse. Clear field design greatly improves usability (Image from pexels.com)

IV. The Study’s Value and Future Directions

The significance of this systematic study extends beyond organizing existing research—it also establishes a “common language” for the field. By unifying terminology, organizing evaluation methods, and compiling datasets, it helps researchers avoid redundant work and focus on real challenges.

From a practical perspective, evidence-based text generation technology has broad application prospects:

In academic writing, it can automatically add accurate citations to papers;
In journalism, it can make AI-generated content traceable, reducing misinformation;
In customer service, it can ensure AI responses are based on corporate knowledge bases, avoiding false promises.

Of course, the field still faces many challenges: How to balance “evidence accuracy” and “text fluency”? How to handle evidence citation in multilingual scenarios? How to design more efficient evaluation metrics that reduce reliance on human input? These questions require further research to address.

For researchers and practitioners, these three datasets serve as an important starting point. By analyzing existing methods, evaluation metrics, and data sources, they can more quickly identify innovation directions. For those new to the field, the datasets also provide a way to quickly understand the overall landscape.

V. License Information

All datasets from this study are available under the terms of the LICENSE. If you wish to use or modify the data, please comply with the relevant terms to ensure legal and ethical use.

This study reflects an important shift in large language models—from “generating content” to “generating credible content.” In the future, as technology matures, “every sentence backed by evidence” may become a basic requirement for AI text generation. This study lays a critical foundation for this progress, and its findings will continue to guide research and practice in the field of evidence-based text generation.

“

（注：文档部分内容可能由 AI 生成）

Evidence-Based Text Generation with Large Language Models: A Systematic Study of Citations and Datasets