IntraScribe: Unlock Secure, Local-First Transcription for Sensitive Meetings

高效码农

2 months ago

IntraScribe: A Local-First Voice Transcription & Collaboration Platform

For companies, schools, and government offices that can’t — or won’t — send data to the cloud.

1. What Is IntraScribe?

Imagine finishing a two-hour meeting and having a clean, editable transcript—complete with speaker names and a concise AI summary—before you’ve even left the room.
IntraScribe makes that possible without ever sending audio outside your building.

In plain language:

Real-time speech-to-text that runs on your own server
Automatic speaker diarization (“Who said what?”)
AI-generated summaries in Markdown
Full data sovereignty — no cloud, no external APIs

2. Why Local-First Matters

Scenario	Risk with Cloud Services	IntraScribe Approach
Sensitive R&D meeting	IP could leak	Everything stays on-prem
Student counseling session	FERPA/GDPR violations	Data never leaves campus
Hospital case review	HIPAA non-compliance	Air-gapped install
Command-and-control center	2-second cloud latency	Sub-500 ms local latency

3. Core Features in Everyday Terms

Feature	What You See	What Happens Under the Hood
Real-time transcription	Words appear as you speak	Browser → WebRTC → FunASR model → SSE stream back
Speaker labels	“Alice: …”, “Bob: …”	Pyannote model slices audio by voice
Batch re-transcription	Higher accuracy after the call	Original + cached audio re-processed on GPU
Editable transcript	Double-click to fix typos	Postgres update → Real-time refresh for all viewers
AI summary & title	One-click Markdown report	LiteLLM picks best model from config.yaml
Template library	Company-branded formats	Save per-user or system-wide templates
Session management	Start, pause, re-transcribe, delete	REST endpoints + Supabase Realtime

4. Who Should Use It?

Enterprise IT
Locked-down VLANs, strict infosec reviews, zero-trust architecture.
Universities & Research Labs
Lecture capture, thesis defenses, multi-language seminars.
Government & Defense
Classified briefings, inter-agency coordination.
Healthcare & Legal
Patient consults, depositions, contract negotiations.

5. End-to-End Workflow (3-Minute Overview)

Create Session
Click “Start Recording” → Browser asks for mic permission → Backend returns session_id.
Live Transcription
Audio flows via WebRTC; text chunks arrive via Server-Sent Events (SSE) in <500 ms.
Stop & Finalize
Click “Stop” → Browser closes WebRTC → Server uploads full audio to Supabase Storage → GPU batch job starts.
Auto Enhance
Batch job: noise reduction → speaker diarization → high-accuracy re-transcription → Postgres update.
AI Summary
Push “Summarize”; LiteLLM uses your template to spit out a Markdown file and a one-line title.
Edit & Share
Double-click any segment or speaker label → changes sync to all teammates in real time.

6. Tech Stack (High-Level)

Layer	Technology	Purpose
Frontend	Next.js (App Router) + React + TypeScript + Tailwind CSS	Fast, modern UI
Backend	FastAPI (Python, uv-managed)	REST, SSE, WebRTC endpoints
ASR	FunASR (local)	Chinese + English speech recognition
Diarization	pyannote.audio	Voice fingerprinting
AI Generation	LiteLLM (ollama, OpenAI, Azure fallbacks)	Summaries & titles
Storage	Supabase (Postgres + Auth + Storage + Realtime)	ACID data, file blobs, row-level security
Media Processing	FFmpeg	Transcode, slice, metadata extraction

7. Folder Layout at a Glance

intrascribe/
├─ backend/
│  ├─ app/
│  │  ├─ api.py                # REST & SSE routes
│  │  ├─ services.py           # Business logic
│  │  ├─ stt_adapter.py        # FunASR wrapper
│  │  ├─ speaker_diarization.py
│  │  ├─ batch_transcription.py
│  │  ├─ audio_processing_service.py
│  │  ├─ audio_converter.py    # FFmpeg wrapper
│  │  ├─ schemas.py, models.py # DTOs & domain models
│  │  ├─ clients.py, repositories.py
│  ├─ main_v1.py               # WebRTC entry point
│  ├─ config.yaml              # AI & ASR settings
│  └─ pyproject.toml
├─ web/                        # Next.js frontend
├─ supabase/
│  ├─ database_schema.sql
│  └─ migrations/

8. Installation Guide (Copy-Paste Ready)

8.1 Prerequisites

Node.js 18+
Python 3.10+ with uv (Python runner)
FFmpeg
(Optional) ollama qwen3:8b for local AI summaries

# Ubuntu / Debian
sudo apt update && sudo apt install ffmpeg

# macOS (Homebrew)
brew install ffmpeg

# Install Supabase CLI
curl -fsSL https://raw.githubusercontent.com/supabase/cli/main/install.sh | bash

8.2 Clone & Start Database

git clone https://github.com/your-org/intrascribe.git
cd intrascribe/supabase

supabase start
# If you hit 502, skip edge-runtime:
# supabase start -x edge-runtime

Copy the printed URLs/keys; you’ll need them next.

supabase db reset   # Seed tables, RLS, functions

8.3 Environment Files

web/.env.local

NEXT_PUBLIC_SUPABASE_URL=http://127.0.0.1:54321
NEXT_PUBLIC_SUPABASE_ANON_KEY=eyJhbGc...
BACKEND_URL=http://localhost:8000

backend/.env

SUPABASE_URL=http://127.0.0.1:54321
SUPABASE_ANON_KEY=eyJhbGc...
SUPABASE_SERVICE_ROLE_KEY=eyJhbGc...
HUGGINGFACE_TOKEN=hf_...
PYANNOTE_MODEL=pyannote/speaker-diarization-3.1

8.4 Launch the Stack

# Terminal 1 – Backend
cd backend
uv sync
uv run main_v1.py
# Listening on http://localhost:8000

# Terminal 2 – Frontend
cd web
npm install
npm run dev
# Open http://localhost:3000

9. API Quick Reference

All endpoints live under /api/v1.

Purpose	Method & Path	Notes
Health check	`GET /health`	Returns 200 OK
Create session	`POST /sessions`	Returns `{session_id}`
Finalize session	`POST /sessions/{id}/finalize`	Triggers batch job
Re-transcribe	`POST /sessions/{id}/retranscribe`	Uses latest model
Delete session	`DELETE /sessions/{id}`	Also deletes audio files
Generate summary	`POST /sessions/{id}/summarize`	Optional `template_id`
Rename speaker	`POST /sessions/{id}/rename-speaker`	Updates all segments
Live SSE stream	`GET /transcript?webrtc_id=...`	Requires Bearer token

Authentication: Attach the Supabase JWT as Authorization: Bearer <token>.

10. Common Troubleshooting

Symptom	Likely Cause	Quick Fix
No real-time text	Mic permission denied	Browser settings → allow mic
Diarization fails	Missing HF token	Add `HUGGINGFACE_TOKEN` to `.env`
Transcode error	FFmpeg missing	Re-install FFmpeg, ensure in PATH
Summary empty	Model mis-configured	Check `config.yaml` for correct LiteLLM keys
Frontend 404 on API	Proxy mis-configured	Verify `BACKEND_URL` in `.env.local`

11. Customization & Extensibility

Swap Speech Engine
Implement BaseASR in stt_adapter.py, update config.yaml.
Custom Summarization Prompts
Add new Markdown templates in the UI; use {{transcript}}, {{speakers}}, {{duration}} placeholders.
Hardware Front-End
Replace WebRTC with WebSocket, gRPC, or raw UDP; the backend remains unchanged.
LDAP / SSO Integration
Supabase Auth supports SAML 2.0 and generic OAuth providers.

12. Roadmap: What’s Next?

Status	Feature
✅ Available	Real-time transcription, speaker labels, AI summary
🚧 In Dev	Edge-device capture (Raspberry Pi, microphone arrays)
🔮 Planned	Conversational AI: “What did Alice conclude in the last meeting?”

13. FAQ: One-Line Answers

Does it need internet?
Only for the first model download; then fully offline.
GPU required?
No—CPU works, just slower.
Multiple concurrent sessions?
Yes; one browser tab = one capture stream.
Mobile support?
Works in any modern mobile browser; add to home screen as PWA.
Commercial license?
MIT—use, modify, resell freely.

14. Final Word

IntraScribe isn’t magic.
It’s a carefully glued stack of open-source pieces—FunASR for ears, pyannote for memory, LiteLLM for a brain—running on your hardware, your network, your terms.

If you’ve read this far, the quickest way to feel the difference is to spin it up locally.
Five minutes from now, you could be watching your own words appear on screen—and staying there.