Stoned.AI — live-streamed human + AI conversation show, both sides voiced via local Kokoro TTS. Governance docs 00-09, README, .gitignore. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
6.7 KiB
Architecture Plan
Current State
- No implementation exists yet. This is a greenfield project.
- The Arena project (
/home/svc-admin/ai-projects/projects/arena) provides reusable infrastructure:src/arena/tts.py— Kokoro TTS backend (ArenaTTSManager,KokoroBackend)/opt/models/kokoro— downloaded Kokoro voice modelspykokoro— installed Python package- Pattern for SSE-based real-time conversation delivery
- Pattern for WAV serving and browser audio playback
Target State
A lightweight Python web server (stoned-web) with two browser-facing views:
- Host view (
/host) — Jason's control panel. Text input box, send button, voice selection per speaker, session start/stop, status display. - Broadcast view (
/broadcast) — Clean, OBS-capturable page. Scrolling conversation cards only. No controls. Styled for stream.
Both views receive conversation turns over Server-Sent Events. The broadcast view is the OBS browser source. The host view is what Jason operates on his own screen.
Design Principles
- Principle 1: Text-in, voice-out for both sides. The host types; the system voices. The AI generates text; the system voices. No microphone dependency.
- Principle 2: Reuse Arena TTS infrastructure. Do not reimplement Kokoro synthesis. Import and use
ArenaTTSManagerdirectly from the arena package or copy the relevant module. - Principle 3: Broadcast view is read-only. The
/broadcastURL has zero interactive elements. It exists only for OBS to consume. - Principle 4: One AI at a time. The session has exactly one human speaker and one AI speaker. Multi-AI is not in scope.
Major Components
-
Component: Web Server (
src/stoned_ai/web.py)- Purpose: HTTP server handling both views, SSE streams, session state, and audio file serving.
- Responsibilities: Accept host message submissions. Dispatch AI calls. Trigger TTS for both sides. Serve WAV files. Push turns to connected SSE clients.
- Dependencies:
stoned_ai/tts.py,stoned_ai/ai.py, standard library (http.serveror a lightweight framework).
-
Component: TTS Layer (
src/stoned_ai/tts.py)- Purpose: Synthesize WAV audio for any speaker given a voice ID and text.
- Responsibilities: Wrap
ArenaTTSManager(or import the Arenatts.pymodule directly). Store generated WAVs in a session-scoped directory. Return a browser-fetchable path. - Dependencies:
pykokoro,/opt/models/kokoro.
-
Component: AI Backend (
src/stoned_ai/ai.py)- Purpose: Call the configured AI model and return a clean text response.
- Responsibilities: Accept conversation history and a prompt. Call the model CLI or API. Return cleaned text. Initially wraps
codex execorgemini -p. Claude API added later. - Dependencies:
subprocess(for CLI backends),anthropicSDK (for Claude backend, Phase 2).
-
Component: Cleaning Engine (
src/stoned_ai/clean.py)- Purpose: Strip CLI noise from AI responses.
- Responsibilities: Apply regex filters for Codex and Gemini banner lines, warnings, token counts.
- Dependencies: None beyond stdlib. Can be copied from Arena's
clean.pyand extended.
-
Component: Broadcast View (
/broadcast)- Purpose: Clean, OBS-capturable HTML page.
- Responsibilities: Connect to the SSE stream. Render conversation cards. Play audio. Never show controls.
- Dependencies: Browser-side JavaScript only.
-
Component: Host View (
/host)- Purpose: Jason's control panel for operating the show.
- Responsibilities: Text input and send. Voice selection per speaker. Session start/stop. Status display. Mirrors the conversation feed.
- Dependencies: Browser-side JavaScript only.
Data Flow
- Jason opens
/hostin his browser and/broadcastin OBS as a browser source. - Jason starts a session, selects voices for himself and the AI, enters the opening topic or first message.
- Jason types his message and hits send.
- Server receives the message, queues it as a "host turn."
- Server calls Kokoro TTS for Jason's voice, stores the WAV, pushes the turn to all SSE clients.
- Both views render the host card. Both play the WAV audio.
- Server calls the AI backend with the conversation history.
- AI returns a text response. Server cleans it.
- Server calls Kokoro TTS for the AI voice, stores the WAV, pushes the AI turn to all SSE clients.
- Both views render the AI card. Both play the WAV audio.
- Repeat from step 3.
Key Decisions
-
Decision 1: Copy or import Arena's TTS module rather than duplicating Kokoro logic.
- Why:
ArenaTTSManageris already tested and handles session audio, path safety, and pipeline caching. - Tradeoff: Creates a dependency on Arena's internal code. Mitigated by treating it as a stable utility layer.
- Why:
-
Decision 2: Two separate URLs for host and broadcast.
- Why: The host needs controls. OBS must not capture controls. Mixing them on one page creates layout complexity and accidental capture risk.
- Tradeoff: Two SSE connections instead of one. Acceptable at this scale.
-
Decision 3: Start with CLI-based AI backends (Codex/Gemini), add Claude API in Phase 2.
- Why: Both CLIs are already present and working on
svc-ai. Fastest path to a functional prototype. - Tradeoff: CLI output noise requires cleaning. Claude API (Phase 2) is cleaner but needs an API key and the
anthropicSDK.
- Why: Both CLIs are already present and working on
-
Decision 4: No speech-to-text. Host types.
- Why: Eliminates microphone capture, audio routing, and STT accuracy problems. Aligns with how Jason already works.
- Tradeoff: Host must type during the live stream. This is the intended format — the typing is part of the show.
Rejected Alternatives
-
Alternative: Using Arena's existing
arena-webserver with modifications.- Why rejected: Arena is an AI-to-AI tool. Retrofitting a human-in-the-loop mode and a separate broadcast view would require significant changes to Arena's core, risking regressions. A clean separate project is lower risk and lower coupling.
-
Alternative: Streaming audio from
svc-aito a Windows machine via virtual audio cable.- Why rejected: The browser-source approach in OBS is simpler, more reliable, and already proven in the Arena project. All audio plays in the browser, which OBS captures directly.
Open Questions
- Question 1: Should the Claude API backend use claude-sonnet-4-6 as the default, or should the model be configurable per session?
- Question 2: Should conversation history be capped at a rolling window to prevent prompt length creep, or left unbounded for the initial version?
Signature
- Document role: governing
- Created by: Claude (supervisor)
- Created at: 2026-04-12
- Revision status: initial
- Future revision rule: this document may be revised only by the user or by an explicitly authorized supervisor revision