Multilingual Confidence in LLMs: Uncovering Language Bias and the Native-Tone Solution

高效码农

14 hours ago

Understanding Multilingual Confidence in Large Language Models: Challenges and Solutions

The Reliability Problem in AI Text Generation

Large Language Models (LLMs) like GPT and Llama have revolutionized how we interact with technology. These systems can answer questions, write essays, and even create code. However, they occasionally generate hallucinations – content that sounds plausible but is factually incorrect or entirely fabricated.

Imagine asking an LLM about the capital of France and getting “Lyon” instead of “Paris”. While obvious in this case, such errors become problematic in critical applications like medical advice or legal documents. This is where confidence estimation becomes crucial – it helps determine how trustworthy an AI’s response might be.

Why Multilingual Confidence Matters

Most LLM research focuses on English. But with over 7,000 languages worldwide, understanding how these models perform across different languages is essential. The MLINGCONF study addresses this gap by examining:

How confident are LLMs in non-English responses?
Do confidence patterns differ between general knowledge and culture-specific questions?
Can we improve reliability for language-specific tasks?

The MLINGCONF Benchmark: A New Standard

Researchers created a comprehensive dataset called MlingConf to evaluate multilingual confidence. This dataset includes:

Language-Agnostic Tasks (LA)

These test general knowledge across languages:

TriviaQA: Factual questions (e.g., “Who invented the telephone?”)
GSM8K: Math word problems (e.g., “If John buys 3 apples…”)
CommonsenseQA: Everyday reasoning (e.g., “Why do people use umbrellas?”)
SciQ: Science questions (e.g., “What gas do plants release?”)

Language-Specific Tasks (LS)

These focus on culturally contextual knowledge:

LSQA: Questions about traditions, geography, and history specific to:
- UK/US English
- Chinese
- Japanese
- French
- Thai

Table 1: Example LSQA Questions

Language	Question Example
English	“What is the highest mountain in the UK?”
Chinese	“Which Japanese city is famous for deer?”
Japanese	“What traditional food is associated with Japanese New Year?”

How the Study Was Conducted

1. Data Preparation

Translation: Original English questions were translated into 4 other languages using GPT-4
Quality Control:
- Automated checks for translation accuracy
- Expert linguists validated 50 random samples per language
- Final dataset contained 1,238-1,857 samples per language

2. Tested Models

GPT-3.5-Turbo (commercial model)
Llama-3.1-8B-Instruct (open-source model)

3. Confidence Estimation Methods

Three approaches were compared:

Method	How It Works	Example
Probability-based	Calculates average probability of each word in the answer	“The Eiffel Tower is in Paris” gets 0.92 confidence
p(True)	Asks the model “Is this answer correct?”	Model responds “True” with 85% probability
Self-verbalized	Prompts model to state confidence numerically	“I am 90% confident in this answer”

Key Findings: Language Dominance

1. Language-Agnostic Tasks (General Knowledge)

English Dominance: Models showed higher confidence and accuracy when answering in English
Math Tasks: Language had minimal impact (math symbols are universal)
Low-Resource Languages: Thai performed worse than high-resource languages (Chinese, Japanese, French)

Graph 1: Accuracy Comparison Across Languages (Source: Original Figure 3)

2. Language-Specific Tasks (Cultural Knowledge)

Native-Tone Advantage: Using the question’s language improved performance
Example: Japanese cultural questions answered in Japanese had 79.46% accuracy vs. 44.64% in English

The Native-Tone Prompting Strategy

Researchers developed a two-step approach to improve reliability:

1. Language Identification

First prompt the model to determine the question’s cultural context:

Question: "What traditional clothing do Japanese people wear during festivals?"
Identify language context: Japanese

2. Native Language Response

Generate the answer in the identified language:

Answer in Japanese: "祭りで着る伝統的な服装は浴衣です" (Confidence: 92%)

This NTP (Native-Tone Prompting) strategy:

Increased accuracy by 10-15% on cultural questions
Reduced calibration errors (confidence more accurately matched accuracy)

Extended Analysis

1. Other Confidence Methods

Paraphrasing Questions: Had minimal impact on confidence scores
Multiple Sampling: High randomness reduced reliability
Chain-of-Thought: Improved reasoning but didn’t significantly affect confidence

2. Additional Languages

Testing on Korean, Arabic, German, Indonesian, and Italian confirmed:

English dominance in general knowledge
NTP benefits for culturally specific questions

Practical Implications

This research helps:

Developers build more reliable multilingual AI systems
Users understand when to trust AI responses
Researchers create better confidence estimation methods

FAQs

Q: What’s the difference between “language-agnostic” and “language-specific” tasks?

A: Language-agnostic tasks test general knowledge (math, science) where language shouldn’t matter. Language-specific tasks involve culturally contextual knowledge (traditions, geography).

Q: How does NTP actually work?

A: The model first identifies the question’s cultural context, then generates the answer in that language, leveraging its training data for that specific language.

Q: Why was Thai performance lower?

A: Likely due to having less training data (medium-resource language) compared to high-resource languages like Chinese.

Conclusion

The MLINGCONF study reveals important patterns in multilingual AI reliability. While English currently shows advantages in general knowledge tasks, culturally specific questions benefit from native-language prompting. As AI systems become more globally used, understanding these nuances will be crucial for building trustworthy multilingual applications.