stoned-ai/docs/02-ARCHITECTURE-PLAN.md

# Architecture Plan

## Current State

- No implementation exists yet. This is a greenfield project.
- The Arena project (`/home/svc-admin/ai-projects/projects/arena`) provides reusable infrastructure:
  - `src/arena/tts.py` — Kokoro TTS backend (`ArenaTTSManager`, `KokoroBackend`)
  - `/opt/models/kokoro` — downloaded Kokoro voice models
  - `pykokoro` — installed Python package
  - Pattern for SSE-based real-time conversation delivery
  - Pattern for WAV serving and browser audio playback

## Target State

A new "Stoned" mode implemented directly within the existing **Arena** project (`/home/svc-admin/ai-projects/projects/arena`).

1. **Host view** (`https://arena.accursedbinkie.com`) — The existing Arena control panel, updated with a "Human Input" box and a "Stoned" mode preset.
2. **Broadcast view** (`/broadcast`) — A new clean, OBS-capturable route added to the Arena web server.

Both views receive conversation turns over the existing Arena SSE stream.

## Design Principles

- Principle 1: **Text-in, voice-out for both sides.** (Unchanged)
- Principle 2: **Direct integration into Arena.** No separate server. Leverage Arena's `ArenaHub` and `ArenaTTSManager` directly.
- Principle 3: **Broadcast view is read-only.** (Unchanged)
- Principle 4: **Human-in-the-loop support.** Add a `human` agent runner to Arena that waits for UI input.


## Major Components

- Component: **Web Server (`src/stoned_ai/web.py`)**
  - Purpose: HTTP server handling both views, SSE streams, session state, and audio file serving.
  - Responsibilities: Accept host message submissions. Dispatch AI calls. Trigger TTS for both sides. Serve WAV files. Push turns to connected SSE clients.
  - Dependencies: `stoned_ai/tts.py`, `stoned_ai/ai.py`, standard library (`http.server` or a lightweight framework).

- Component: **TTS Layer (`src/stoned_ai/tts.py`)**
  - Purpose: Synthesize WAV audio for any speaker given a voice ID and text.
  - Responsibilities: Wrap `ArenaTTSManager` (or import the Arena `tts.py` module directly). Store generated WAVs in a session-scoped directory. Return a browser-fetchable path.
  - Dependencies: `pykokoro`, `/opt/models/kokoro`.

- Component: **AI Backend (`src/stoned_ai/ai.py`)**
  - Purpose: Call the configured AI model and return a clean text response.
  - Responsibilities: Accept conversation history and a prompt. Call the model CLI or API. Return cleaned text. Initially wraps `codex exec` or `gemini -p`. Claude API added later.
  - Dependencies: `subprocess` (for CLI backends), `anthropic` SDK (for Claude backend, Phase 2).

- Component: **Cleaning Engine (`src/stoned_ai/clean.py`)**
  - Purpose: Strip CLI noise from AI responses.
  - Responsibilities: Apply regex filters for Codex and Gemini banner lines, warnings, token counts.
  - Dependencies: None beyond stdlib. Can be copied from Arena's `clean.py` and extended.

- Component: **Broadcast View (`/broadcast`)**
  - Purpose: Clean, OBS-capturable HTML page.
  - Responsibilities: Connect to the SSE stream. Render conversation cards. Play audio. Never show controls.
  - Dependencies: Browser-side JavaScript only.

- Component: **Host View (`/host`)**
  - Purpose: Jason's control panel for operating the show.
  - Responsibilities: Text input and send. Voice selection per speaker. Session start/stop. Status display. Mirrors the conversation feed.
  - Dependencies: Browser-side JavaScript only.

## Data Flow

1. Jason opens `/host` in his browser and `/broadcast` in OBS as a browser source.
2. Jason starts a session, selects voices for himself and the AI, enters the opening topic or first message.
3. Jason types his message and hits send.
4. Server receives the message, queues it as a "host turn."
5. Server calls Kokoro TTS for Jason's voice, stores the WAV, pushes the turn to all SSE clients.
6. Both views render the host card. Both play the WAV audio.
7. Server calls the AI backend with the conversation history.
8. AI returns a text response. Server cleans it.
9. Server calls Kokoro TTS for the AI voice, stores the WAV, pushes the AI turn to all SSE clients.
10. Both views render the AI card. Both play the WAV audio.
11. Repeat from step 3.

## Key Decisions

- Decision 1: **Copy or import Arena's TTS module rather than duplicating Kokoro logic.**
  - Why: `ArenaTTSManager` is already tested and handles session audio, path safety, and pipeline caching.
  - Tradeoff: Creates a dependency on Arena's internal code. Mitigated by treating it as a stable utility layer.

- Decision 2: **Two separate URLs for host and broadcast.**
  - Why: The host needs controls. OBS must not capture controls. Mixing them on one page creates layout complexity and accidental capture risk.
  - Tradeoff: Two SSE connections instead of one. Acceptable at this scale.

- Decision 3: **Start with CLI-based AI backends (Codex/Gemini), add Claude API in Phase 2.**
  - Why: Both CLIs are already present and working on `svc-ai`. Fastest path to a functional prototype.
  - Tradeoff: CLI output noise requires cleaning. Claude API (Phase 2) is cleaner but needs an API key and the `anthropic` SDK.

- Decision 4: **No speech-to-text. Host types.**
  - Why: Eliminates microphone capture, audio routing, and STT accuracy problems. Aligns with how Jason already works.
  - Tradeoff: Host must type during the live stream. This is the intended format — the typing is part of the show.

## Rejected Alternatives

- Alternative: Using Arena's existing `arena-web` server with modifications.
  - Why rejected: Arena is an AI-to-AI tool. Retrofitting a human-in-the-loop mode and a separate broadcast view would require significant changes to Arena's core, risking regressions. A clean separate project is lower risk and lower coupling.

- Alternative: Streaming audio from `svc-ai` to a Windows machine via virtual audio cable.
  - Why rejected: The browser-source approach in OBS is simpler, more reliable, and already proven in the Arena project. All audio plays in the browser, which OBS captures directly.

## Open Questions

- Question 1: Should the Claude API backend use claude-sonnet-4-6 as the default, or should the model be configurable per session?
- Question 2: Should conversation history be capped at a rolling window to prevent prompt length creep, or left unbounded for the initial version?

## Signature

- Document role: governing
- Created by: Claude (supervisor)
- Created at: 2026-04-12
- Revision status: initial
- Future revision rule: this document may be revised only by the user or by an explicitly authorized supervisor revision