Initialize project governance and baseline structure
Stoned.AI — live-streamed human + AI conversation show, both sides voiced via local Kokoro TTS. Governance docs 00-09, README, .gitignore. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
112
docs/02-ARCHITECTURE-PLAN.md
Normal file
112
docs/02-ARCHITECTURE-PLAN.md
Normal file
@@ -0,0 +1,112 @@
|
||||
# Architecture Plan
|
||||
|
||||
## Current State
|
||||
|
||||
- No implementation exists yet. This is a greenfield project.
|
||||
- The Arena project (`/home/svc-admin/ai-projects/projects/arena`) provides reusable infrastructure:
|
||||
- `src/arena/tts.py` — Kokoro TTS backend (`ArenaTTSManager`, `KokoroBackend`)
|
||||
- `/opt/models/kokoro` — downloaded Kokoro voice models
|
||||
- `pykokoro` — installed Python package
|
||||
- Pattern for SSE-based real-time conversation delivery
|
||||
- Pattern for WAV serving and browser audio playback
|
||||
|
||||
## Target State
|
||||
|
||||
A lightweight Python web server (`stoned-web`) with two browser-facing views:
|
||||
|
||||
1. **Host view** (`/host`) — Jason's control panel. Text input box, send button, voice selection per speaker, session start/stop, status display.
|
||||
2. **Broadcast view** (`/broadcast`) — Clean, OBS-capturable page. Scrolling conversation cards only. No controls. Styled for stream.
|
||||
|
||||
Both views receive conversation turns over Server-Sent Events. The broadcast view is the OBS browser source. The host view is what Jason operates on his own screen.
|
||||
|
||||
## Design Principles
|
||||
|
||||
- Principle 1: **Text-in, voice-out for both sides.** The host types; the system voices. The AI generates text; the system voices. No microphone dependency.
|
||||
- Principle 2: **Reuse Arena TTS infrastructure.** Do not reimplement Kokoro synthesis. Import and use `ArenaTTSManager` directly from the arena package or copy the relevant module.
|
||||
- Principle 3: **Broadcast view is read-only.** The `/broadcast` URL has zero interactive elements. It exists only for OBS to consume.
|
||||
- Principle 4: **One AI at a time.** The session has exactly one human speaker and one AI speaker. Multi-AI is not in scope.
|
||||
|
||||
## Major Components
|
||||
|
||||
- Component: **Web Server (`src/stoned_ai/web.py`)**
|
||||
- Purpose: HTTP server handling both views, SSE streams, session state, and audio file serving.
|
||||
- Responsibilities: Accept host message submissions. Dispatch AI calls. Trigger TTS for both sides. Serve WAV files. Push turns to connected SSE clients.
|
||||
- Dependencies: `stoned_ai/tts.py`, `stoned_ai/ai.py`, standard library (`http.server` or a lightweight framework).
|
||||
|
||||
- Component: **TTS Layer (`src/stoned_ai/tts.py`)**
|
||||
- Purpose: Synthesize WAV audio for any speaker given a voice ID and text.
|
||||
- Responsibilities: Wrap `ArenaTTSManager` (or import the Arena `tts.py` module directly). Store generated WAVs in a session-scoped directory. Return a browser-fetchable path.
|
||||
- Dependencies: `pykokoro`, `/opt/models/kokoro`.
|
||||
|
||||
- Component: **AI Backend (`src/stoned_ai/ai.py`)**
|
||||
- Purpose: Call the configured AI model and return a clean text response.
|
||||
- Responsibilities: Accept conversation history and a prompt. Call the model CLI or API. Return cleaned text. Initially wraps `codex exec` or `gemini -p`. Claude API added later.
|
||||
- Dependencies: `subprocess` (for CLI backends), `anthropic` SDK (for Claude backend, Phase 2).
|
||||
|
||||
- Component: **Cleaning Engine (`src/stoned_ai/clean.py`)**
|
||||
- Purpose: Strip CLI noise from AI responses.
|
||||
- Responsibilities: Apply regex filters for Codex and Gemini banner lines, warnings, token counts.
|
||||
- Dependencies: None beyond stdlib. Can be copied from Arena's `clean.py` and extended.
|
||||
|
||||
- Component: **Broadcast View (`/broadcast`)**
|
||||
- Purpose: Clean, OBS-capturable HTML page.
|
||||
- Responsibilities: Connect to the SSE stream. Render conversation cards. Play audio. Never show controls.
|
||||
- Dependencies: Browser-side JavaScript only.
|
||||
|
||||
- Component: **Host View (`/host`)**
|
||||
- Purpose: Jason's control panel for operating the show.
|
||||
- Responsibilities: Text input and send. Voice selection per speaker. Session start/stop. Status display. Mirrors the conversation feed.
|
||||
- Dependencies: Browser-side JavaScript only.
|
||||
|
||||
## Data Flow
|
||||
|
||||
1. Jason opens `/host` in his browser and `/broadcast` in OBS as a browser source.
|
||||
2. Jason starts a session, selects voices for himself and the AI, enters the opening topic or first message.
|
||||
3. Jason types his message and hits send.
|
||||
4. Server receives the message, queues it as a "host turn."
|
||||
5. Server calls Kokoro TTS for Jason's voice, stores the WAV, pushes the turn to all SSE clients.
|
||||
6. Both views render the host card. Both play the WAV audio.
|
||||
7. Server calls the AI backend with the conversation history.
|
||||
8. AI returns a text response. Server cleans it.
|
||||
9. Server calls Kokoro TTS for the AI voice, stores the WAV, pushes the AI turn to all SSE clients.
|
||||
10. Both views render the AI card. Both play the WAV audio.
|
||||
11. Repeat from step 3.
|
||||
|
||||
## Key Decisions
|
||||
|
||||
- Decision 1: **Copy or import Arena's TTS module rather than duplicating Kokoro logic.**
|
||||
- Why: `ArenaTTSManager` is already tested and handles session audio, path safety, and pipeline caching.
|
||||
- Tradeoff: Creates a dependency on Arena's internal code. Mitigated by treating it as a stable utility layer.
|
||||
|
||||
- Decision 2: **Two separate URLs for host and broadcast.**
|
||||
- Why: The host needs controls. OBS must not capture controls. Mixing them on one page creates layout complexity and accidental capture risk.
|
||||
- Tradeoff: Two SSE connections instead of one. Acceptable at this scale.
|
||||
|
||||
- Decision 3: **Start with CLI-based AI backends (Codex/Gemini), add Claude API in Phase 2.**
|
||||
- Why: Both CLIs are already present and working on `svc-ai`. Fastest path to a functional prototype.
|
||||
- Tradeoff: CLI output noise requires cleaning. Claude API (Phase 2) is cleaner but needs an API key and the `anthropic` SDK.
|
||||
|
||||
- Decision 4: **No speech-to-text. Host types.**
|
||||
- Why: Eliminates microphone capture, audio routing, and STT accuracy problems. Aligns with how Jason already works.
|
||||
- Tradeoff: Host must type during the live stream. This is the intended format — the typing is part of the show.
|
||||
|
||||
## Rejected Alternatives
|
||||
|
||||
- Alternative: Using Arena's existing `arena-web` server with modifications.
|
||||
- Why rejected: Arena is an AI-to-AI tool. Retrofitting a human-in-the-loop mode and a separate broadcast view would require significant changes to Arena's core, risking regressions. A clean separate project is lower risk and lower coupling.
|
||||
|
||||
- Alternative: Streaming audio from `svc-ai` to a Windows machine via virtual audio cable.
|
||||
- Why rejected: The browser-source approach in OBS is simpler, more reliable, and already proven in the Arena project. All audio plays in the browser, which OBS captures directly.
|
||||
|
||||
## Open Questions
|
||||
|
||||
- Question 1: Should the Claude API backend use claude-sonnet-4-6 as the default, or should the model be configurable per session?
|
||||
- Question 2: Should conversation history be capped at a rolling window to prevent prompt length creep, or left unbounded for the initial version?
|
||||
|
||||
## Signature
|
||||
|
||||
- Document role: governing
|
||||
- Created by: Claude (supervisor)
|
||||
- Created at: 2026-04-12
|
||||
- Revision status: initial
|
||||
- Future revision rule: this document may be revised only by the user or by an explicitly authorized supervisor revision
|
||||
Reference in New Issue
Block a user