Generate High-Quality Questions from Text — Practical Guide

What this tool does

This project generates multiple, diverse, human-readable questions from input text. It supports a range of large language model backends and providers. You feed the tool a dataset or a local file that contains text. The tool calls a model to create a set number of questions for every input item. Optionally, the tool can also generate answers for those questions. The final output is written as JSON Lines files. These files are ready for use in training, content creation, assessment generation, or dataset augmentation.

Quick start — minimal runnable example

Follow these steps to run a simple example on your machine. Copy the commands exactly.

# 1. Create a virtual environment and activate it
python3 -m venv .venv && source .venv/bin/activate

# 2. Install dependencies
pip install -r requirements.txt

# 3. Export an API key for the chosen provider
export OPENROUTER_API_KEY=your_api_key_here

# 4. Run the main script on a sample dataset
python3 src/main.py mkurman/hindawi-journals-2007-2023 \
    --provider openrouter \
    --model qwen/qwen3-235b-a22b-2507 \
    --output-dir ./data/questions_openrouter \
    --start-index 0 \
    --end-index 10 \
    --num-questions 5 \
    --text-column text \
    --verbose

The example above shows a complete, minimal workflow. Replace the dataset, provider, model, and API key with values you control.

Environment and prerequisites

Python version required, 3.9 or later. The code uses modern type hints that require at least Python 3.9.
Install dependencies from requirements.txt. The listed packages include aiohttp, datasets, and tqdm.
A network-accessible API key is required for the provider you plan to use. Export the key as an environment variable using the naming conventions described below.

Supported input sources

The tool accepts two main types of input:

Hugging Face datasets identified by org/dataset. The code loads those datasets with datasets.load_dataset.
Local files. Supported formats:
- JSON Lines files with .jsonl or .json extensions. Each line should contain a JSON object.
- Parquet files.

Default text column name is text. Change this with the --text-column option when your file uses a different field name.

Command line usage and important options

Run the script like this:

python3 src/main.py <dataset_or_jsonl_path> \
  --provider <provider> \
  --model <model_name> \
  --output-dir <dir>

Key options you will use frequently:

Option	Meaning
`--text-column TEXT`	Name of the field containing text. Default is `text`.
`--num-questions INT`	Number of questions to generate per input. Default is 3.
`--max-tokens INT`	Maximum tokens per request. Default is 4096.
`--provider-url URL`	Use this when `--provider other` to point to a custom API base URL.
`--num-workers INT`	Number of concurrent worker processes. Default is 1.
`--shuffle`	Shuffle the dataset before processing.
`--max-items INT`	Process only up to this number of items.
`--start-index` and `--end-index`	Slice the dataset. Start is inclusive, end is exclusive. Indexing is zero-based.
`--dataset-split SPLIT`	For Hugging Face datasets, select the split. Default is `train`.
`--sleep-between-requests S`	Seconds to sleep between API requests. Use to avoid rate limits.
`--sleep-between-items S`	Seconds to sleep between items. Use for throttling when processing large batches.
`--style STYLE`	Specify one or more generation styles. The program picks one per item.
`--no-style`	Generate questions without style directives. Use for neutral output.
`--styles-file FILE`	Load style definitions from a file with one style per line. Lines starting with `#` are comments.
`--with-answer`	Also request answers for the generated questions. Output will include an `output` field.
`--answer-provider PROVIDER`	Use a different provider for answer generation. Default is the same provider used for questions.
`--answer-model MODEL`	Use a different model when generating answers. Default is the same model used for questions.
`--answer-single-request`	Send all questions in a single request for answer generation. This may improve efficiency at the cost of error isolation.
`--verbose` and `--debug`	Increase logging verbosity for troubleshooting.

Provider support and authentication conventions

The README lists many supported providers. Use the --provider option with one of the supported names. Common provider values include openai, anthropic, openrouter, qwen, groq, gemini, and other. The other provider allows you to point the tool at a custom OpenAI-compatible API by supplying --provider-url.

Environment variables for API keys follow a consistent naming pattern, provider name in uppercase followed by _API_KEY. Examples:

OPENAI_API_KEY for OpenAI
ANTHROPIC_API_KEY for Anthropic
OPENROUTER_API_KEY for OpenRouter
GROQ_API_KEY for Groq
QWEN_API_KEY for Qwen
OTHER_API_KEY for custom endpoints

Ollama typically runs locally at http://localhost:11434, so you may not need an environment variable when using it.

Style control for generated questions

The tool provides flexible style control. By default it randomly selects from a built-in library of more than 35 styles. These styles include academic, creative, casual, and more.

You can control style in three ways:

Use --style to set a single style string or a comma-separated list. The program will choose one style per item.
Use --styles-file to load a custom list of styles from a text file. Each line is one style. Use # to add comments.
Use --no-style to generate questions without additional style instructions.

Style selection affects phrasing and tone. Use a consistent style when you want uniform question wording. Use multiple styles to create variety.

Generating answers alongside questions

When --with-answer is set, the tool will generate answers for the questions it produces. Answer generation has configurable behavior:

You can delegate answers to a different provider or model by using --answer-provider and --answer-model.
The --answer-single-request option sends all questions in one request to reduce the number of API calls. This may save time but can reduce error isolation.
On failure, the output field for that question will be set to the string error. An answer_error field will contain more details about what went wrong. Use these fields to programmatically detect problems and trigger retries if needed.

Example command that generates answers with a different provider:

export OPENROUTER_API_KEY=your_openrouter_key
export ANTHROPIC_API_KEY=your_anthropic_key

python3 src/main.py mkurman/hindawi-journals-2007-2023 \
  --provider openrouter \
  --model qwen/qwen3-235b-a22b-2507 \
  --answer-provider anthropic \
  --answer-model claude-3-haiku-20240307 \
  --output-dir ./data/qa_multi_provider \
  --start-index 0 \
  --end-index 5 \
  --num-questions 2 \
  --with-answer \
  --verbose

Output format details

Results are saved in JSON Lines files. The file name includes a timestamp and relevant parameters for easy traceability. Each line is a JSON object describing one generated question and related metadata.

A typical record contains these fields:

input: The generated question text.
source_text: The original text from which the question was derived.
question_index: The index of this question within the set for the input.
total_questions: Number of questions generated for this input.
metadata: Metadata about the original item, including the original item index and the text column name.
generation_settings: Settings used to produce this question. This includes provider, model, style, number of requested questions, and the actual number generated.
timestamp: ISO 8601 timestamp for when the question was generated.

Example success record:

{
  "input": "What practical applications benefit most from question generation using LLMs?",
  "source_text": "...original text...",
  "question_index": 1,
  "total_questions": 5,
  "metadata": { "original_item_index": 0, "text_column": "text" },
  "generation_settings": {
    "provider": "openrouter",
    "model": "qwen/qwen3-235b-a22b-2507",
    "style": "formal and academic",
    "num_questions_requested": 5,
    "num_questions_generated": 5,
    "max_tokens": 4096
  },
  "timestamp": "2025-08-17T12:34:56.789012"
}

When answers are generated, the record also contains an output field that holds the answer text.

If an answer fails, output will equal "error" and the answer_error field will hold diagnostic information.

Concurrency and rate control

Two main knobs help you manage throughput and provider limits:

--num-workers controls concurrency. Increase it to process more items in parallel.
--sleep-between-requests and --sleep-between-items let you insert pauses to avoid rate limiting. Use them when you see throttling errors.

Tune these values based on the API limits of your provider and the model’s reliability.

Example command set

Basic example with a single provider:

export OPENROUTER_API_KEY=your_api_key_here

python3 src/main.py mkurman/hindawi-journals-2007-2023 \
    --provider openrouter \
    --model qwen/qwen3-235b-a22b-2507 \
    --output-dir ./data/questions_openrouter \
    --start-index 0 \
    --end-index 10 \
    --num-questions 5 \
    --text-column text \
    --verbose

Generate questions with answers:

export OPENROUTER_API_KEY=your_api_key_here

python3 src/main.py mkurman/hindawi-journals-2007-2023 \
    --provider openrouter \
    --model qwen/qwen3-235b-a22b-2507 \
    --output-dir ./data/qa_openrouter \
    --start-index 0 \
    --end-index 10 \
    --num-questions 3 \
    --with-answer

Multi-provider example where questions and answers use different backends is shown above in the answer generation section.

Practical recommendations

Follow these pragmatic tips when using the tool in real tasks:

Use --num-workers to scale throughput. Balance concurrency with the provider’s rate limits.
For large datasets, process data in slices using --start-index and --end-index. This avoids loading everything into memory and simplifies retries.
Use --styles-file to maintain consistent style lists across runs. Put one style per line and comment lines with a hash sign.
Check the JSON Lines output for answer_error fields to detect answer failures. Retry only failed items instead of reprocessing the entire batch.
Save outputs to separate directories for each run. Timestamps in file names make audits and rollbacks straightforward.
When you need deterministic output, specify a single style and keep models and provider settings consistent across runs.

These practices help maintain reliability and make it easier to track the provenance of generated data.

Troubleshooting checklist

If you run into problems, use this checklist to diagnose common failures:

Confirm dependencies installed with pip install -r requirements.txt.
Verify Python version is 3.9 or later with python3 --version.
Confirm the environment variable for the provider API key is exported and correct.
Check provider and model name spelling.
Ensure the output directory exists and is writable.
If you see rate-limit errors, reduce concurrency and add delays using --sleep-between-requests.
Inspect JSONL files for output: "error" and answer_error fields to identify failed items.

How-to steps: structured mini guide

Follow these steps to get a predictable, repeatable run.

Create and activate a Python virtual environment.
Install dependencies from requirements.txt.
Export the API key for the provider you plan to use.
Run src/main.py with dataset path, provider, model, output directory, and other options you need.
Inspect generated JSONL files in the output directory. Use tools such as jq, Python scripts, or text editors to parse lines and validate entries.
If you requested answers, scan for answer_error fields and retry failed items as necessary.

A compact command summary:

python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
export OPENROUTER_API_KEY=your_api_key_here
python3 src/main.py <dataset_or_jsonl_path> --provider openrouter --model <model> --output-dir ./data/out --num-questions 5

FAQ

Do I have to use Python 3.9?
Yes. The code uses type features that require Python 3.9 or newer.

How do I use a local JSONL file?
Pass the file path as the first argument to src/main.py. Ensure the records include a text field and specify the field with --text-column if it is not named text.

Can I use different providers for questions and answers?
Yes. Use --answer-provider and --answer-model to delegate answers to a different backend.

What file format is the output?
Output is JSON Lines. Each line is one JSON object representing a generated question and optional answer.

How do I handle failures in answer generation?
Look for output fields set to "error" and examine answer_error for details. Retry only those items.

Structured data snippets ready for embedding

You can embed structured data blocks to describe the page content. Below are JSON-LD examples ready to include in a web page.

FAQ schema JSON-LD:

{
  "@context": "https://schema.org",
  "@type": "FAQPage",
  "mainEntity": [
    {
      "@type": "Question",
      "name": "Do I have to use Python 3.9?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "Yes. The code uses type features that require Python 3.9 or newer."
      }
    },
    {
      "@type": "Question",
      "name": "How do I use a local JSONL file?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "Pass the file path as the first argument to src/main.py and specify the text column if needed."
      }
    },
    {
      "@type": "Question",
      "name": "Can questions and answers use different providers?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "Yes. Use --answer-provider and --answer-model to delegate answers to a different backend."
      }
    }
  ]
}

HowTo schema JSON-LD for the minimal run:

{
  "@context": "https://schema.org",
  "@type": "HowTo",
  "name": "Minimal workflow to generate questions from text",
  "step": [
    {
      "@type": "HowToStep",
      "name": "Create a virtual environment",
      "text": "python3 -m venv .venv && source .venv/bin/activate"
    },
    {
      "@type": "HowToStep",
      "name": "Install dependencies",
      "text": "pip install -r requirements.txt"
    },
    {
      "@type": "HowToStep",
      "name": "Set API key",
      "text": "export OPENROUTER_API_KEY=your_api_key_here"
    },
    {
      "@type": "HowToStep",
      "name": "Run the generator",
      "text": "python3 src/main.py <dataset_or_jsonl_path> --provider openrouter --model <model> --output-dir ./data/out --num-questions 5"
    }
  ]
}

LLM Question Generator: Create Custom Questions from Text in Seconds