114 lines
6.4 KiB
Markdown
114 lines
6.4 KiB
Markdown
# Architecture Plan
|
|
|
|
## Current State
|
|
|
|
- No implementation exists yet. This is a greenfield project.
|
|
- The Arena project (`/home/svc-admin/ai-projects/projects/arena`) provides reusable infrastructure:
|
|
- `src/arena/tts.py` — Kokoro TTS backend (`ArenaTTSManager`, `KokoroBackend`)
|
|
- `/opt/models/kokoro` — downloaded Kokoro voice models
|
|
- `pykokoro` — installed Python package
|
|
- Pattern for SSE-based real-time conversation delivery
|
|
- Pattern for WAV serving and browser audio playback
|
|
|
|
## Target State
|
|
|
|
A new "Stoned" mode implemented directly within the existing **Arena** project (`/home/svc-admin/ai-projects/projects/arena`).
|
|
|
|
1. **Host view** (`https://arena.accursedbinkie.com`) — The existing Arena control panel, updated with a "Human Input" box and a "Stoned" mode preset.
|
|
2. **Broadcast view** (`/broadcast`) — A new clean, OBS-capturable route added to the Arena web server.
|
|
|
|
Both views receive conversation turns over the existing Arena SSE stream.
|
|
|
|
## Design Principles
|
|
|
|
- Principle 1: **Text-in, voice-out for both sides.** (Unchanged)
|
|
- Principle 2: **Direct integration into Arena.** No separate server. Leverage Arena's `ArenaHub` and `ArenaTTSManager` directly.
|
|
- Principle 3: **Broadcast view is read-only.** (Unchanged)
|
|
- Principle 4: **Human-in-the-loop support.** Add a `human` agent runner to Arena that waits for UI input.
|
|
|
|
|
|
## Major Components
|
|
|
|
- Component: **Web Server (`src/stoned_ai/web.py`)**
|
|
- Purpose: HTTP server handling both views, SSE streams, session state, and audio file serving.
|
|
- Responsibilities: Accept host message submissions. Dispatch AI calls. Trigger TTS for both sides. Serve WAV files. Push turns to connected SSE clients.
|
|
- Dependencies: `stoned_ai/tts.py`, `stoned_ai/ai.py`, standard library (`http.server` or a lightweight framework).
|
|
|
|
- Component: **TTS Layer (`src/stoned_ai/tts.py`)**
|
|
- Purpose: Synthesize WAV audio for any speaker given a voice ID and text.
|
|
- Responsibilities: Wrap `ArenaTTSManager` (or import the Arena `tts.py` module directly). Store generated WAVs in a session-scoped directory. Return a browser-fetchable path.
|
|
- Dependencies: `pykokoro`, `/opt/models/kokoro`.
|
|
|
|
- Component: **AI Backend (`src/stoned_ai/ai.py`)**
|
|
- Purpose: Call the configured AI model and return a clean text response.
|
|
- Responsibilities: Accept conversation history and a prompt. Call the model CLI or API. Return cleaned text. Initially wraps `codex exec` or `gemini -p`. Claude API added later.
|
|
- Dependencies: `subprocess` (for CLI backends), `anthropic` SDK (for Claude backend, Phase 2).
|
|
|
|
- Component: **Cleaning Engine (`src/stoned_ai/clean.py`)**
|
|
- Purpose: Strip CLI noise from AI responses.
|
|
- Responsibilities: Apply regex filters for Codex and Gemini banner lines, warnings, token counts.
|
|
- Dependencies: None beyond stdlib. Can be copied from Arena's `clean.py` and extended.
|
|
|
|
- Component: **Broadcast View (`/broadcast`)**
|
|
- Purpose: Clean, OBS-capturable HTML page.
|
|
- Responsibilities: Connect to the SSE stream. Render conversation cards. Play audio. Never show controls.
|
|
- Dependencies: Browser-side JavaScript only.
|
|
|
|
- Component: **Host View (`/host`)**
|
|
- Purpose: Jason's control panel for operating the show.
|
|
- Responsibilities: Text input and send. Voice selection per speaker. Session start/stop. Status display. Mirrors the conversation feed.
|
|
- Dependencies: Browser-side JavaScript only.
|
|
|
|
## Data Flow
|
|
|
|
1. Jason opens `/host` in his browser and `/broadcast` in OBS as a browser source.
|
|
2. Jason starts a session, selects voices for himself and the AI, enters the opening topic or first message.
|
|
3. Jason types his message and hits send.
|
|
4. Server receives the message, queues it as a "host turn."
|
|
5. Server calls Kokoro TTS for Jason's voice, stores the WAV, pushes the turn to all SSE clients.
|
|
6. Both views render the host card. Both play the WAV audio.
|
|
7. Server calls the AI backend with the conversation history.
|
|
8. AI returns a text response. Server cleans it.
|
|
9. Server calls Kokoro TTS for the AI voice, stores the WAV, pushes the AI turn to all SSE clients.
|
|
10. Both views render the AI card. Both play the WAV audio.
|
|
11. Repeat from step 3.
|
|
|
|
## Key Decisions
|
|
|
|
- Decision 1: **Copy or import Arena's TTS module rather than duplicating Kokoro logic.**
|
|
- Why: `ArenaTTSManager` is already tested and handles session audio, path safety, and pipeline caching.
|
|
- Tradeoff: Creates a dependency on Arena's internal code. Mitigated by treating it as a stable utility layer.
|
|
|
|
- Decision 2: **Two separate URLs for host and broadcast.**
|
|
- Why: The host needs controls. OBS must not capture controls. Mixing them on one page creates layout complexity and accidental capture risk.
|
|
- Tradeoff: Two SSE connections instead of one. Acceptable at this scale.
|
|
|
|
- Decision 3: **Start with CLI-based AI backends (Codex/Gemini), add Claude API in Phase 2.**
|
|
- Why: Both CLIs are already present and working on `svc-ai`. Fastest path to a functional prototype.
|
|
- Tradeoff: CLI output noise requires cleaning. Claude API (Phase 2) is cleaner but needs an API key and the `anthropic` SDK.
|
|
|
|
- Decision 4: **No speech-to-text. Host types.**
|
|
- Why: Eliminates microphone capture, audio routing, and STT accuracy problems. Aligns with how Jason already works.
|
|
- Tradeoff: Host must type during the live stream. This is the intended format — the typing is part of the show.
|
|
|
|
## Rejected Alternatives
|
|
|
|
- Alternative: Using Arena's existing `arena-web` server with modifications.
|
|
- Why rejected: Arena is an AI-to-AI tool. Retrofitting a human-in-the-loop mode and a separate broadcast view would require significant changes to Arena's core, risking regressions. A clean separate project is lower risk and lower coupling.
|
|
|
|
- Alternative: Streaming audio from `svc-ai` to a Windows machine via virtual audio cable.
|
|
- Why rejected: The browser-source approach in OBS is simpler, more reliable, and already proven in the Arena project. All audio plays in the browser, which OBS captures directly.
|
|
|
|
## Open Questions
|
|
|
|
- Question 1: Should the Claude API backend use claude-sonnet-4-6 as the default, or should the model be configurable per session?
|
|
- Question 2: Should conversation history be capped at a rolling window to prevent prompt length creep, or left unbounded for the initial version?
|
|
|
|
## Signature
|
|
|
|
- Document role: governing
|
|
- Created by: Claude (supervisor)
|
|
- Created at: 2026-04-12
|
|
- Revision status: initial
|
|
- Future revision rule: this document may be revised only by the user or by an explicitly authorized supervisor revision
|