Turn Any Article into a High-Quality Question Bank with AI
“
“I have hundreds of journal papers and need solid questions for model training—fast.”
“Can my laptop handle it? Do I need a GPU?”
“Which provider is cheapest if I only want to test ten samples?”
If any of those questions sound familiar, this guide will give you exact, copy-paste answers.
We break the small open-source tool into plain steps, show working examples, and keep everything inside the original README—no external links, no fluff.
1. What the Tool Actually Does
Feed it any text, get N questions back.
-
Data source: Hugging Face datasets or local .jsonl
,.json
,.parquet
. -
Engine: 10+ large language model providers—OpenAI, Anthropic, Gemini, local Ollama, and more. -
Tone control: 30+ built-in styles (academic, humorous, critical, kid-friendly…) chosen at random or forced by you. -
Output: One JSON-L file per run. Each line carries the original text, the generated question, and full metadata for downstream cleaning.
2. Environment Setup in Three Commands
Step | Command | Why it matters |
---|---|---|
1. Optional virtual environment | python3 -m venv .venv && source .venv/bin/activate |
Keeps dependencies tidy |
2. Install packages | pip install -r requirements.txt |
Only three libraries: aiohttp , datasets , tqdm |
3. Export the API key | export OPENROUTER_API_KEY=YOUR_KEY |
Use the provider you like; table below |
No GPU is required—computation happens in the cloud, and your laptop only sends HTTP requests.
3. Run Your First 10 Lines in 30 Seconds
Copy-paste the block and hit Enter:
export OPENROUTER_API_KEY=YOUR_KEY
python3 src/main.py mkurman/hindawi-journals-2007-2023 \
--provider openrouter \
--model qwen/qwen3-235b-a22b-2507 \
--output-dir ./data/demo \
--start-index 0 \
--end-index 10 \
--num-questions 5 \
--text-column text \
--verbose
When it stops, ./data/demo
contains a file like
questions_2025-08-18_14-22-03_hindawi-journals_openrouter_qwen.jsonl
.
4. Every Flag Explained in One Table
Flag | Example Value | Purpose | Common Pitfall |
---|---|---|---|
<dataset_or_path> |
mkurman/hindawi-journals-2007-2023 or ./myfile.jsonl |
Where to read text | Local path must end with .jsonl , .json , or .parquet |
--provider |
openrouter |
Which LLM gateway | See provider list |
--model |
qwen/qwen3-235b-a22b-2507 |
Exact model name | Must match provider catalog |
--output-dir |
./results |
Where to drop JSONL | Old files stay; new ones are time-stamped |
--num-questions |
3 |
Questions per text | Too high may hit rate limits |
--start-index / --end-index |
0 / 100 |
Slice dataset | Zero-based, end exclusive |
--text-column |
text |
Column holding text | Change if your file uses content |
--style |
"formal and academic" |
Force tone | One string or comma list |
--no-style |
— | Neutral questions | Overrides any style flag |
--styles-file |
./my_styles.txt |
Load from file | One style per line, # comments |
--num-workers |
4 |
Parallel calls | Combine with --sleep-between-requests |
--sleep-between-requests |
0.5 |
Seconds between calls | Prevents HTTP 429 |
--max-items |
1000 |
Global cap | Alternative to --end-index |
5. Provider & Key Cheat Sheet
Provider | Environment Variable | Notes |
---|---|---|
OpenAI | OPENAI_API_KEY |
Works with official or proxy endpoints |
Anthropic | ANTHROPIC_API_KEY |
Claude family |
OpenRouter | OPENROUTER_API_KEY |
One key → many models |
Groq | GROQ_API_KEY |
Very fast inference |
Together | TOGETHER_API_KEY |
Strong open-source catalog |
Cerebras | CEREBRAS_API_KEY |
New chips, beta pricing |
Gemini | GEMINI_API_KEY |
Export as env var even if docs mention query param |
Qwen (official) | QWEN_API_KEY |
Tongyi Qianwen |
Qwen (DeepInfra) | QWEN_DEEPINFRA_API_KEY |
Alternative endpoint |
Kimi (Moonshot) | KIMI_API_KEY |
Chinese-first |
Z.ai | Z_AI_API_KEY |
Replace dots/ dashes with underscores |
Featherless | FEATHERLESS_API_KEY |
Budget friendly |
Chutes | CHUTES_API_KEY |
Same as above |
Hugging Face | HUGGINGFACE_API_KEY |
Needed only for private datasets |
Ollama | None | Assumes http://localhost:11434 |
6. Three Ways to Feed the Tool
-
Hugging Face Hub
Useorg/dataset
, default split istrain
; add--dataset-split test
if needed. -
Local JSONL
One JSON object per line:{"text": "Large Language Models excel at generating diverse questions."} {"text": "Neural networks learn complex patterns from data."}
-
Local Parquet
Same--text-column
rule applies.
7. Tone System Deep Dive
7.1 Default Behavior
No extra flags → tool picks one of 35+ styles per text.
Examples:
-
formal and academic -
funny and entertaining -
practical and application-focused -
thought-provoking and philosophical
7.2 Single Tone
--style "formal and academic"
7.3 Random Pool
--style "casual and conversational,funny and humorous,concise and direct"
Each text gets one style from the pool.
7.4 File-Based Styles
Create my_styles.txt
:
# Lines starting with # are ignored
formal and academic
creative and imaginative
simple and straightforward
Then:
--styles-file ./my_styles.txt
8. Output Anatomy
Successful record:
{
"input": "What are the latest advances in text summarization?",
"source_text": "Original paragraph...",
"question_index": 1,
"total_questions": 5,
"metadata": { "original_item_index": 0, "text_column": "text" },
"generation_settings": {
"provider": "openrouter",
"model": "qwen/qwen3-235b-a22b-2507",
"style": "formal and academic",
"num_questions_requested": 5,
"num_questions_generated": 5,
"max_tokens": 4096
},
"timestamp": "2025-08-17T12:34:56.789012"
}
Failure record:
{
"error": "Rate limit exceeded",
"source_text": "...",
"metadata": { ... }
}
9. Quick-Fire FAQ
Q1: I only want 5 samples—shortest command?
--max-items 5
# or
--start-index 0 --end-index 5
Q2: How do I stay inside free tier limits?
--sleep-between-requests 1.0 --num-workers 1
Q3: Local Ollama workflow?
-
Start Ollama and pull a model ollama run hf.co/lmstudio-community/Qwen3-4B-Instruct-2507-GGUF:Q4_K_M
-
Run the script python3 src/main.py ./my.jsonl \ --provider ollama \ --model hf.co/lmstudio-community/Qwen3-4B-Instruct-2507-GGUF:Q4_K_M \ --output-dir ./ollama_out
No key required.
Q4: Output files are huge—any rollover?
Timestamp is automatic; just schedule daily runs.
Q5: How custom can styles get?
Any short English phrase the model understands: --style "like a curious five-year-old"
.
10. Real-World Mini-Project
Turn 100 news articles into interview questions.
python3 src/main.py ./news.jsonl \
--provider openrouter \
--model qwen/qwen3-235b-a22b-2507 \
--output-dir ./interview_q \
--style "critical thinking for job interview" \
--num-questions 3 \
--max-items 100 \
--sleep-between-requests 0.5
Open interview_q/questions_2025-08-18_xxx.jsonl
, remove empty lines, and you’re done.
11. Troubleshooting Checklist
Error Message | Root Cause | Fix |
---|---|---|
ModuleNotFoundError: aiohttp |
Missing dependencies | pip install -r requirements.txt |
401 Unauthorized |
Wrong or unset key | Re-export the correct variable |
429 Rate limit |
Too many requests | Increase --sleep-between-requests |
Empty output | Index out of range | Check dataset length |
Connection refused |
Ollama offline | Ensure http://localhost:11434 responds |
12. Next Steps
You now know how to:
-
Install the tool in under five minutes -
Generate questions from any public or local dataset -
Tune tone, rate limits, and concurrency -
Read the output and fix common errors
What you can do next:
-
Feed the question bank into a RAG pipeline. -
Maintain a company style file for customer-support or tech-interview scenarios. -
Load the JSONL into Excel or a database for human quality review.
Happy experimenting!