Files

Jason Hall fcd93ee0af Initialize project governance and baseline structure

Stoned.AI — live-streamed human + AI conversation show, both sides voiced
via local Kokoro TTS. Governance docs 00-09, README, .gitignore.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-04-12 21:55:46 +00:00

6.7 KiB

Raw Blame History

Architecture Plan

Current State

No implementation exists yet. This is a greenfield project.
The Arena project (/home/svc-admin/ai-projects/projects/arena) provides reusable infrastructure:
- src/arena/tts.py — Kokoro TTS backend (ArenaTTSManager, KokoroBackend)
- /opt/models/kokoro — downloaded Kokoro voice models
- pykokoro — installed Python package
- Pattern for SSE-based real-time conversation delivery
- Pattern for WAV serving and browser audio playback

Target State

A lightweight Python web server (stoned-web) with two browser-facing views:

Host view (/host) — Jason's control panel. Text input box, send button, voice selection per speaker, session start/stop, status display.
Broadcast view (/broadcast) — Clean, OBS-capturable page. Scrolling conversation cards only. No controls. Styled for stream.

Both views receive conversation turns over Server-Sent Events. The broadcast view is the OBS browser source. The host view is what Jason operates on his own screen.

Design Principles

Principle 1: Text-in, voice-out for both sides. The host types; the system voices. The AI generates text; the system voices. No microphone dependency.
Principle 2: Reuse Arena TTS infrastructure. Do not reimplement Kokoro synthesis. Import and use ArenaTTSManager directly from the arena package or copy the relevant module.
Principle 3: Broadcast view is read-only. The /broadcast URL has zero interactive elements. It exists only for OBS to consume.
Principle 4: One AI at a time. The session has exactly one human speaker and one AI speaker. Multi-AI is not in scope.

Major Components

Component: Web Server (src/stoned_ai/web.py)
- Purpose: HTTP server handling both views, SSE streams, session state, and audio file serving.
- Responsibilities: Accept host message submissions. Dispatch AI calls. Trigger TTS for both sides. Serve WAV files. Push turns to connected SSE clients.
- Dependencies: stoned_ai/tts.py, stoned_ai/ai.py, standard library (http.server or a lightweight framework).
Component: TTS Layer (src/stoned_ai/tts.py)
- Purpose: Synthesize WAV audio for any speaker given a voice ID and text.
- Responsibilities: Wrap ArenaTTSManager (or import the Arena tts.py module directly). Store generated WAVs in a session-scoped directory. Return a browser-fetchable path.
- Dependencies: pykokoro, /opt/models/kokoro.
Component: AI Backend (src/stoned_ai/ai.py)
- Purpose: Call the configured AI model and return a clean text response.
- Responsibilities: Accept conversation history and a prompt. Call the model CLI or API. Return cleaned text. Initially wraps codex exec or gemini -p. Claude API added later.
- Dependencies: subprocess (for CLI backends), anthropic SDK (for Claude backend, Phase 2).
Component: Cleaning Engine (src/stoned_ai/clean.py)
- Purpose: Strip CLI noise from AI responses.
- Responsibilities: Apply regex filters for Codex and Gemini banner lines, warnings, token counts.
- Dependencies: None beyond stdlib. Can be copied from Arena's clean.py and extended.
Component: Broadcast View (/broadcast)
- Purpose: Clean, OBS-capturable HTML page.
- Responsibilities: Connect to the SSE stream. Render conversation cards. Play audio. Never show controls.
- Dependencies: Browser-side JavaScript only.
Component: Host View (/host)
- Purpose: Jason's control panel for operating the show.
- Responsibilities: Text input and send. Voice selection per speaker. Session start/stop. Status display. Mirrors the conversation feed.
- Dependencies: Browser-side JavaScript only.

Data Flow

Jason opens /host in his browser and /broadcast in OBS as a browser source.
Jason starts a session, selects voices for himself and the AI, enters the opening topic or first message.
Jason types his message and hits send.
Server receives the message, queues it as a "host turn."
Server calls Kokoro TTS for Jason's voice, stores the WAV, pushes the turn to all SSE clients.
Both views render the host card. Both play the WAV audio.
Server calls the AI backend with the conversation history.
AI returns a text response. Server cleans it.
Server calls Kokoro TTS for the AI voice, stores the WAV, pushes the AI turn to all SSE clients.
Both views render the AI card. Both play the WAV audio.
Repeat from step 3.

Key Decisions

Decision 1: Copy or import Arena's TTS module rather than duplicating Kokoro logic.
- Why: ArenaTTSManager is already tested and handles session audio, path safety, and pipeline caching.
- Tradeoff: Creates a dependency on Arena's internal code. Mitigated by treating it as a stable utility layer.
Decision 2: Two separate URLs for host and broadcast.
- Why: The host needs controls. OBS must not capture controls. Mixing them on one page creates layout complexity and accidental capture risk.
- Tradeoff: Two SSE connections instead of one. Acceptable at this scale.
Decision 3: Start with CLI-based AI backends (Codex/Gemini), add Claude API in Phase 2.
- Why: Both CLIs are already present and working on svc-ai. Fastest path to a functional prototype.
- Tradeoff: CLI output noise requires cleaning. Claude API (Phase 2) is cleaner but needs an API key and the anthropic SDK.
Decision 4: No speech-to-text. Host types.
- Why: Eliminates microphone capture, audio routing, and STT accuracy problems. Aligns with how Jason already works.
- Tradeoff: Host must type during the live stream. This is the intended format — the typing is part of the show.

Rejected Alternatives

Alternative: Using Arena's existing arena-web server with modifications.
- Why rejected: Arena is an AI-to-AI tool. Retrofitting a human-in-the-loop mode and a separate broadcast view would require significant changes to Arena's core, risking regressions. A clean separate project is lower risk and lower coupling.
Alternative: Streaming audio from svc-ai to a Windows machine via virtual audio cable.
- Why rejected: The browser-source approach in OBS is simpler, more reliable, and already proven in the Arena project. All audio plays in the browser, which OBS captures directly.

Open Questions

Question 1: Should the Claude API backend use claude-sonnet-4-6 as the default, or should the model be configurable per session?
Question 2: Should conversation history be capped at a rolling window to prevent prompt length creep, or left unbounded for the initial version?

Signature

Document role: governing
Created by: Claude (supervisor)
Created at: 2026-04-12
Revision status: initial
Future revision rule: this document may be revised only by the user or by an explicitly authorized supervisor revision

6.7 KiB Raw Blame History