Turn Any Article into a High-Quality Question Bank with AI
“
“I have hundreds of journal papers and need solid questions for model training—fast.”
“Can my laptop handle it? Do I need a GPU?”
“Which provider is cheapest if I only want to test ten samples?”
If any of those questions sound familiar, this guide will give you exact, copy-paste answers.
We break the small open-source tool into plain steps, show working examples, and keep everything inside the original README—no external links, no fluff.
1. What the Tool Actually Does
Feed it any text, get N questions back.
-
Data source: Hugging Face datasets or local .jsonl
,.json
,.parquet
. -
Engine: 10+ large language model providers—OpenAI, Anthropic, Gemini, local Ollama, and more. -
Tone control: 30+ built-in styles (academic, humorous, critical, kid-friendly…) chosen at random or forced by you. -
Output: One JSON-L file per run. Each line carries the original text, the generated question, and full metadata for downstream cleaning.
2. Environment Setup in Three Commands
No GPU is required—computation happens in the cloud, and your laptop only sends HTTP requests.
3. Run Your First 10 Lines in 30 Seconds
Copy-paste the block and hit Enter:
export OPENROUTER_API_KEY=YOUR_KEY
python3 src/main.py mkurman/hindawi-journals-2007-2023 \
--provider openrouter \
--model qwen/qwen3-235b-a22b-2507 \
--output-dir ./data/demo \
--start-index 0 \
--end-index 10 \
--num-questions 5 \
--text-column text \
--verbose
When it stops, ./data/demo
contains a file like
questions_2025-08-18_14-22-03_hindawi-journals_openrouter_qwen.jsonl
.
4. Every Flag Explained in One Table
5. Provider & Key Cheat Sheet
6. Three Ways to Feed the Tool
-
Hugging Face Hub
Useorg/dataset
, default split istrain
; add--dataset-split test
if needed. -
Local JSONL
One JSON object per line:{"text": "Large Language Models excel at generating diverse questions."} {"text": "Neural networks learn complex patterns from data."}
-
Local Parquet
Same--text-column
rule applies.
7. Tone System Deep Dive
7.1 Default Behavior
No extra flags → tool picks one of 35+ styles per text.
Examples:
-
formal and academic -
funny and entertaining -
practical and application-focused -
thought-provoking and philosophical
7.2 Single Tone
--style "formal and academic"
7.3 Random Pool
--style "casual and conversational,funny and humorous,concise and direct"
Each text gets one style from the pool.
7.4 File-Based Styles
Create my_styles.txt
:
# Lines starting with # are ignored
formal and academic
creative and imaginative
simple and straightforward
Then:
--styles-file ./my_styles.txt
8. Output Anatomy
Successful record:
{
"input": "What are the latest advances in text summarization?",
"source_text": "Original paragraph...",
"question_index": 1,
"total_questions": 5,
"metadata": { "original_item_index": 0, "text_column": "text" },
"generation_settings": {
"provider": "openrouter",
"model": "qwen/qwen3-235b-a22b-2507",
"style": "formal and academic",
"num_questions_requested": 5,
"num_questions_generated": 5,
"max_tokens": 4096
},
"timestamp": "2025-08-17T12:34:56.789012"
}
Failure record:
{
"error": "Rate limit exceeded",
"source_text": "...",
"metadata": { ... }
}
9. Quick-Fire FAQ
Q1: I only want 5 samples—shortest command?
--max-items 5
# or
--start-index 0 --end-index 5
Q2: How do I stay inside free tier limits?
--sleep-between-requests 1.0 --num-workers 1
Q3: Local Ollama workflow?
-
Start Ollama and pull a model ollama run hf.co/lmstudio-community/Qwen3-4B-Instruct-2507-GGUF:Q4_K_M
-
Run the script python3 src/main.py ./my.jsonl \ --provider ollama \ --model hf.co/lmstudio-community/Qwen3-4B-Instruct-2507-GGUF:Q4_K_M \ --output-dir ./ollama_out
No key required.
Q4: Output files are huge—any rollover?
Timestamp is automatic; just schedule daily runs.
Q5: How custom can styles get?
Any short English phrase the model understands: --style "like a curious five-year-old"
.
10. Real-World Mini-Project
Turn 100 news articles into interview questions.
python3 src/main.py ./news.jsonl \
--provider openrouter \
--model qwen/qwen3-235b-a22b-2507 \
--output-dir ./interview_q \
--style "critical thinking for job interview" \
--num-questions 3 \
--max-items 100 \
--sleep-between-requests 0.5
Open interview_q/questions_2025-08-18_xxx.jsonl
, remove empty lines, and you’re done.
11. Troubleshooting Checklist
12. Next Steps
You now know how to:
-
Install the tool in under five minutes -
Generate questions from any public or local dataset -
Tune tone, rate limits, and concurrency -
Read the output and fix common errors
What you can do next:
-
Feed the question bank into a RAG pipeline. -
Maintain a company style file for customer-support or tech-interview scenarios. -
Load the JSONL into Excel or a database for human quality review.
Happy experimenting!