Generate High-Quality Questions from Text — Practical Guide
What this tool does
This project generates multiple, diverse, human-readable questions from input text. It supports a range of large language model backends and providers. You feed the tool a dataset or a local file that contains text. The tool calls a model to create a set number of questions for every input item. Optionally, the tool can also generate answers for those questions. The final output is written as JSON Lines files. These files are ready for use in training, content creation, assessment generation, or dataset augmentation.
Quick start — minimal runnable example
Follow these steps to run a simple example on your machine. Copy the commands exactly.
# 1. Create a virtual environment and activate it
python3 -m venv .venv && source .venv/bin/activate
# 2. Install dependencies
pip install -r requirements.txt
# 3. Export an API key for the chosen provider
export OPENROUTER_API_KEY=your_api_key_here
# 4. Run the main script on a sample dataset
python3 src/main.py mkurman/hindawi-journals-2007-2023 \
--provider openrouter \
--model qwen/qwen3-235b-a22b-2507 \
--output-dir ./data/questions_openrouter \
--start-index 0 \
--end-index 10 \
--num-questions 5 \
--text-column text \
--verbose
The example above shows a complete, minimal workflow. Replace the dataset, provider, model, and API key with values you control.
Environment and prerequisites
-
Python version required, 3.9 or later. The code uses modern type hints that require at least Python 3.9. -
Install dependencies from requirements.txt
. The listed packages includeaiohttp
,datasets
, andtqdm
. -
A network-accessible API key is required for the provider you plan to use. Export the key as an environment variable using the naming conventions described below.
Supported input sources
The tool accepts two main types of input:
-
Hugging Face datasets identified by
org/dataset
. The code loads those datasets withdatasets.load_dataset
. -
Local files. Supported formats:
-
JSON Lines files with .jsonl
or.json
extensions. Each line should contain a JSON object. -
Parquet files.
-
Default text column name is text
. Change this with the --text-column
option when your file uses a different field name.
Command line usage and important options
Run the script like this:
python3 src/main.py <dataset_or_jsonl_path> \
--provider <provider> \
--model <model_name> \
--output-dir <dir>
Key options you will use frequently:
Option | Meaning |
---|---|
--text-column TEXT |
Name of the field containing text. Default is text . |
--num-questions INT |
Number of questions to generate per input. Default is 3. |
--max-tokens INT |
Maximum tokens per request. Default is 4096. |
--provider-url URL |
Use this when --provider other to point to a custom API base URL. |
--num-workers INT |
Number of concurrent worker processes. Default is 1. |
--shuffle |
Shuffle the dataset before processing. |
--max-items INT |
Process only up to this number of items. |
--start-index and --end-index |
Slice the dataset. Start is inclusive, end is exclusive. Indexing is zero-based. |
--dataset-split SPLIT |
For Hugging Face datasets, select the split. Default is train . |
--sleep-between-requests S |
Seconds to sleep between API requests. Use to avoid rate limits. |
--sleep-between-items S |
Seconds to sleep between items. Use for throttling when processing large batches. |
--style STYLE |
Specify one or more generation styles. The program picks one per item. |
--no-style |
Generate questions without style directives. Use for neutral output. |
--styles-file FILE |
Load style definitions from a file with one style per line. Lines starting with # are comments. |
--with-answer |
Also request answers for the generated questions. Output will include an output field. |
--answer-provider PROVIDER |
Use a different provider for answer generation. Default is the same provider used for questions. |
--answer-model MODEL |
Use a different model when generating answers. Default is the same model used for questions. |
--answer-single-request |
Send all questions in a single request for answer generation. This may improve efficiency at the cost of error isolation. |
--verbose and --debug |
Increase logging verbosity for troubleshooting. |
Provider support and authentication conventions
The README lists many supported providers. Use the --provider
option with one of the supported names. Common provider values include openai
, anthropic
, openrouter
, qwen
, groq
, gemini
, and other
. The other
provider allows you to point the tool at a custom OpenAI-compatible API by supplying --provider-url
.
Environment variables for API keys follow a consistent naming pattern, provider name in uppercase followed by _API_KEY
. Examples:
-
OPENAI_API_KEY
for OpenAI -
ANTHROPIC_API_KEY
for Anthropic -
OPENROUTER_API_KEY
for OpenRouter -
GROQ_API_KEY
for Groq -
QWEN_API_KEY
for Qwen -
OTHER_API_KEY
for custom endpoints
Ollama typically runs locally at http://localhost:11434
, so you may not need an environment variable when using it.
Style control for generated questions
The tool provides flexible style control. By default it randomly selects from a built-in library of more than 35 styles. These styles include academic, creative, casual, and more.
You can control style in three ways:
-
Use --style
to set a single style string or a comma-separated list. The program will choose one style per item. -
Use --styles-file
to load a custom list of styles from a text file. Each line is one style. Use#
to add comments. -
Use --no-style
to generate questions without additional style instructions.
Style selection affects phrasing and tone. Use a consistent style when you want uniform question wording. Use multiple styles to create variety.
Generating answers alongside questions
When --with-answer
is set, the tool will generate answers for the questions it produces. Answer generation has configurable behavior:
-
You can delegate answers to a different provider or model by using --answer-provider
and--answer-model
. -
The --answer-single-request
option sends all questions in one request to reduce the number of API calls. This may save time but can reduce error isolation. -
On failure, the output
field for that question will be set to the stringerror
. Ananswer_error
field will contain more details about what went wrong. Use these fields to programmatically detect problems and trigger retries if needed.
Example command that generates answers with a different provider:
export OPENROUTER_API_KEY=your_openrouter_key
export ANTHROPIC_API_KEY=your_anthropic_key
python3 src/main.py mkurman/hindawi-journals-2007-2023 \
--provider openrouter \
--model qwen/qwen3-235b-a22b-2507 \
--answer-provider anthropic \
--answer-model claude-3-haiku-20240307 \
--output-dir ./data/qa_multi_provider \
--start-index 0 \
--end-index 5 \
--num-questions 2 \
--with-answer \
--verbose
Output format details
Results are saved in JSON Lines files. The file name includes a timestamp and relevant parameters for easy traceability. Each line is a JSON object describing one generated question and related metadata.
A typical record contains these fields:
-
input
: The generated question text. -
source_text
: The original text from which the question was derived. -
question_index
: The index of this question within the set for the input. -
total_questions
: Number of questions generated for this input. -
metadata
: Metadata about the original item, including the original item index and the text column name. -
generation_settings
: Settings used to produce this question. This includes provider, model, style, number of requested questions, and the actual number generated. -
timestamp
: ISO 8601 timestamp for when the question was generated.
Example success record:
{
"input": "What practical applications benefit most from question generation using LLMs?",
"source_text": "...original text...",
"question_index": 1,
"total_questions": 5,
"metadata": { "original_item_index": 0, "text_column": "text" },
"generation_settings": {
"provider": "openrouter",
"model": "qwen/qwen3-235b-a22b-2507",
"style": "formal and academic",
"num_questions_requested": 5,
"num_questions_generated": 5,
"max_tokens": 4096
},
"timestamp": "2025-08-17T12:34:56.789012"
}
When answers are generated, the record also contains an output
field that holds the answer text.
If an answer fails, output
will equal "error"
and the answer_error
field will hold diagnostic information.
Concurrency and rate control
Two main knobs help you manage throughput and provider limits:
-
--num-workers
controls concurrency. Increase it to process more items in parallel. -
--sleep-between-requests
and--sleep-between-items
let you insert pauses to avoid rate limiting. Use them when you see throttling errors.
Tune these values based on the API limits of your provider and the model’s reliability.
Example command set
Basic example with a single provider:
export OPENROUTER_API_KEY=your_api_key_here
python3 src/main.py mkurman/hindawi-journals-2007-2023 \
--provider openrouter \
--model qwen/qwen3-235b-a22b-2507 \
--output-dir ./data/questions_openrouter \
--start-index 0 \
--end-index 10 \
--num-questions 5 \
--text-column text \
--verbose
Generate questions with answers:
export OPENROUTER_API_KEY=your_api_key_here
python3 src/main.py mkurman/hindawi-journals-2007-2023 \
--provider openrouter \
--model qwen/qwen3-235b-a22b-2507 \
--output-dir ./data/qa_openrouter \
--start-index 0 \
--end-index 10 \
--num-questions 3 \
--with-answer
Multi-provider example where questions and answers use different backends is shown above in the answer generation section.
Practical recommendations
Follow these pragmatic tips when using the tool in real tasks:
-
Use --num-workers
to scale throughput. Balance concurrency with the provider’s rate limits. -
For large datasets, process data in slices using --start-index
and--end-index
. This avoids loading everything into memory and simplifies retries. -
Use --styles-file
to maintain consistent style lists across runs. Put one style per line and comment lines with a hash sign. -
Check the JSON Lines output for answer_error
fields to detect answer failures. Retry only failed items instead of reprocessing the entire batch. -
Save outputs to separate directories for each run. Timestamps in file names make audits and rollbacks straightforward. -
When you need deterministic output, specify a single style and keep models and provider settings consistent across runs.
These practices help maintain reliability and make it easier to track the provenance of generated data.
Troubleshooting checklist
If you run into problems, use this checklist to diagnose common failures:
-
Confirm dependencies installed with pip install -r requirements.txt
. -
Verify Python version is 3.9 or later with python3 --version
. -
Confirm the environment variable for the provider API key is exported and correct. -
Check provider and model name spelling. -
Ensure the output directory exists and is writable. -
If you see rate-limit errors, reduce concurrency and add delays using --sleep-between-requests
. -
Inspect JSONL files for output: "error"
andanswer_error
fields to identify failed items.
How-to steps: structured mini guide
Follow these steps to get a predictable, repeatable run.
-
Create and activate a Python virtual environment. -
Install dependencies from requirements.txt
. -
Export the API key for the provider you plan to use. -
Run src/main.py
with dataset path, provider, model, output directory, and other options you need. -
Inspect generated JSONL files in the output directory. Use tools such as jq
, Python scripts, or text editors to parse lines and validate entries. -
If you requested answers, scan for answer_error
fields and retry failed items as necessary.
A compact command summary:
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
export OPENROUTER_API_KEY=your_api_key_here
python3 src/main.py <dataset_or_jsonl_path> --provider openrouter --model <model> --output-dir ./data/out --num-questions 5
FAQ
Do I have to use Python 3.9?
Yes. The code uses type features that require Python 3.9 or newer.
How do I use a local JSONL file?
Pass the file path as the first argument to src/main.py
. Ensure the records include a text field and specify the field with --text-column
if it is not named text
.
Can I use different providers for questions and answers?
Yes. Use --answer-provider
and --answer-model
to delegate answers to a different backend.
What file format is the output?
Output is JSON Lines. Each line is one JSON object representing a generated question and optional answer.
How do I handle failures in answer generation?
Look for output
fields set to "error"
and examine answer_error
for details. Retry only those items.
Structured data snippets ready for embedding
You can embed structured data blocks to describe the page content. Below are JSON-LD examples ready to include in a web page.
FAQ schema JSON-LD:
{
"@context": "https://schema.org",
"@type": "FAQPage",
"mainEntity": [
{
"@type": "Question",
"name": "Do I have to use Python 3.9?",
"acceptedAnswer": {
"@type": "Answer",
"text": "Yes. The code uses type features that require Python 3.9 or newer."
}
},
{
"@type": "Question",
"name": "How do I use a local JSONL file?",
"acceptedAnswer": {
"@type": "Answer",
"text": "Pass the file path as the first argument to src/main.py and specify the text column if needed."
}
},
{
"@type": "Question",
"name": "Can questions and answers use different providers?",
"acceptedAnswer": {
"@type": "Answer",
"text": "Yes. Use --answer-provider and --answer-model to delegate answers to a different backend."
}
}
]
}
HowTo schema JSON-LD for the minimal run:
{
"@context": "https://schema.org",
"@type": "HowTo",
"name": "Minimal workflow to generate questions from text",
"step": [
{
"@type": "HowToStep",
"name": "Create a virtual environment",
"text": "python3 -m venv .venv && source .venv/bin/activate"
},
{
"@type": "HowToStep",
"name": "Install dependencies",
"text": "pip install -r requirements.txt"
},
{
"@type": "HowToStep",
"name": "Set API key",
"text": "export OPENROUTER_API_KEY=your_api_key_here"
},
{
"@type": "HowToStep",
"name": "Run the generator",
"text": "python3 src/main.py <dataset_or_jsonl_path> --provider openrouter --model <model> --output-dir ./data/out --num-questions 5"
}
]
}