Give Every Post a Voice: A Step-by-Step Guide to bskyScribe, the Open-Source Media-Description Bot for Bluesky

Imagine scrolling Bluesky on the train.
You see a 45-second video, but the creator left no caption.
A friend shares an infographic, yet the text is too small to read.
For users with low vision, hearing loss, or simply a broken headphone jack, these posts are locked doors.
bskyScribe is a small, friendly key.
It waits in the background, listens for a mention, and then automatically writes a short, human-readable summary—under 250 characters—so that everyone can join the conversation.

This guide walks you through the entire journey: cloning the code, getting your free API keys, testing on one post, and finally letting the bot run 24/7 in the cloud.
No prior AI experience is required; if you can open a terminal and run pip install, you are ready.


1. Why a “media-description bot” matters

Situation Without description With bskyScribe
Blind or low-vision user “There’s an image here” “Three corgis chase a frisbee across a sunny lawn.”
Commuter on mute Video autoplays silently Comment reads: “Speaker demos a silent red keyboard switch for 10 s.”
Multilingual community Japanese chart is unreadable Auto-summary: “Chart shows 7 % phone shipment growth in 2024.”

bskyScribe does not try to dazzle; it equalizes access, one reply at a time.
Behind the scenes, Google Gemini supplies the brain, the AT Protocol supplies the ears and mouth, and a few dozen lines of Python keep everything polite and fast.


2. The 30-second overview

Five-step loop: notification → fetch → analyze → summarize → reply
  1. Listen – subscribes to Bluesky notifications.
  2. Fetch – downloads any image, audio, or video into memory (no disk clutter).
  3. Analyze – asks Gemini: What’s going on here?
  4. Trim – compresses the answer to < 250 characters.
  5. Reply – posts the summary as a threaded reply and logs the event.

3. What you need before you start

Item How to get it
Python 3.9 or newer 👉python.org or Anaconda
Google Gemini API key 👉Google AI Studio – free tier
Bluesky account + app password 👉Bluesky Settings → App Passwords
Any computer with Internet Local laptop, Raspberry Pi, or cloud VM

Tip: An app password is not your login password. It is a 16-character string you can revoke at any time, perfect for bots.


4. Installation in three commands

4.1 Clone the repository

git clone https://github.com/your-org/bskyScribe.git
cd bskyScribe

4.2 Create an isolated environment

python -m venv venv
source venv/bin/activate   # Windows: venv\Scripts\activate

Why isolate?
Think of it as giving the project its own bookshelf—no mixed-up libraries when you upgrade something else.

4.3 Install dependencies

pip install -r requirements.txt

A long list of “Successfully installed …” means you are ready to configure.


5. Two-minute setup: keys and passwords

5.1 Copy the template

cp .env.example .env

5.2 Edit the file

GEMINI_API_KEY=your_40_char_key_here
BLUESKY_USERNAME=your_handle.bsky.social
BLUESKY_PASSWORD=your_16_char_app_password
Editing an .env file

Keep secrets out of source code; .env is ignored by git for safety.


6. Two ways to run the bot

6.1 One-off transcription (great for testing)

from bots.transcriptionBot import MediaProcessingBot

bot = MediaProcessingBot()               # reads .env automatically
post_url = "https://bsky.app/profile/naomi.dev/post/3jx..."
result = bot.transcribe_post(post_url)   # takes 3–15 s locally
print(bot.format_transcription_reply(result))

Typical terminal output:

{"response": "A latte with leaf art sits next to an open notebook in soft morning light."}

Ready? Post it:

bot.post_transcription_reply(post_url)

6.2 Always-on daemon (public service)

from daemon import Scribe

scribe = Scribe()
scribe.monitor_mentions()   # runs forever, Ctrl+C to stop
Cloud terminal session

7. Reading the JSON response

Field Example Purpose
thinking “Audio is clear English, 12 s, topic keyboard” Debug log, can be hidden
request_type SUMMARIZE Helps front-end pick an icon
media_type VIDEO Affects TTS voice choice
response_character_count 245 Safety check
response “Speaker demos silent red switch…” The text the user sees

8. Project anatomy: open the lid

bskyScribe/
├── clients/
│   ├── bluesky.py          # Talks to Bluesky (AT Protocol)
│   └── gemini.py           # Talks to Google Gemini
├── bots/
│   └── transcriptionBot.py # Glues the two together
├── prompt/
│   └── prompt.txt          # Plain-text instructions for Gemini
├── daemon.py               # 24/7 background loop
├── render.yaml             # Cloud deployment recipe
├── Procfile                # Heroku/Render start command
├── requirements.txt        # Python packages
└── .env                    # Your private keys
  • bluesky.py uses subscribeRepos for notifications and getPostThread to fetch media.
  • gemini.py converts images to base64, uploads audio to a short-lived URL, then POSTs to https://generativelanguage.googleapis.com/v1beta/models/gemini-pro-vision:generateContent.
  • transcriptionBot.py decides describe vs. read text vs. summarize speech.

    • High-density text in an image → OCR
    • Otherwise → visual description
    • Audio or video → transcript → summary

9. The prompt: teaching the AI to be brief

prompt.txt is only ten lines, yet it shapes the tone:

- Answer in English, max 250 characters.  
- One-sentence overview, then 2–3 key details.  
- If text dominates the image, transcribe it.  
- Do not guess personal names.  
- Replace graphic violence with “Content may be disturbing.”

Tip: Explicit character limits work better than “be concise.” The model self-truncates.


10. Deployment: from laptop to cloud in ten minutes

10.1 Render (free tier)

  1. Log in with GitHub at 👉render.com.
  2. New → Background Worker → pick your repo.
  3. Paste your .env lines into the “Environment” tab.
  4. Deploy. First build takes 2–3 min; later pushes are automatic.

10.2 Watch the logs

In the Render dashboard:

INFO:root:👂 Mention detected from @naomi.dev
INFO:root:✅ Reply posted, length 238 chars

That is your heartbeat.


11. Cost and performance snapshot

Resource Free quota Typical usage
Render Worker 512 MB RAM ~90 MB
Google Gemini 60 requests/min 1 per post
Outbound traffic 100 GB/mo <5 MB per media file

Even at 500 posts per day you stay inside zero-cost limits. If traffic spikes, Render queues jobs instead of dropping them.


12. Frequently asked questions

Q1: Will the bot read my private messages?
A: No. It only touches public posts and only when explicitly mentioned. The source code is open for review.

Q2: What if the audio contains sensitive personal details?
A: Files are held in memory, never stored. Add “ignore personal names” to prompt.txt if desired.

Q3: Can it reply in other languages?
A: Replace “Answer in English” with “Answer in the user’s language” in prompt.txt. Gemini auto-detects.


13. Customization: make the bot yours

  • Signature line – append “— via bskyScribe ✨” in format_transcription_reply.
  • NSFW filter – check returned text for disallowed words before posting.
  • Backfill old posts – loop through your 2023 archive and add descriptions retroactively.

14. Contributing: from user to co-maintainer

  1. Fork the repo.
  2. Create a branch feature/emoji-support.
  3. Run python -m pytest tests/ to confirm nothing breaks.
  4. Open a pull request titled Add emoji support for video summaries (#42).
  5. A maintainer will review within 48 h and credit you in the release notes.

15. Closing thought: information should never be silent

bskyScribe is not the flashiest AI project on GitHub, but it may be the kindest.
It takes a multimodal model that costs fractions of a cent and turns it into a 250-character note slipped under every locked door.
If you want your own posts to speak up—today—install the code, add your keys, and type:

@bskyscribe.bsky.social describe this

You will see, perhaps for the first time, how gentle technology can be.

A sun-lit path marked for accessibility
Accessibility is not a feature; it is the path that lets everyone walk in the sun.