Architecture
Voice Mirror is a voice-native IDE: you build real desktop apps and websites by voice, watch them render live in a sandbox App Preview, and the in-app AI can see and drive the running app — the same surface you watch. The north-star loop is voice → build → see → fix.
It is a Tauri 2 desktop application with a Rust backend and a Svelte 5 frontend. It runs as a floating orb overlay that expands into a full VS Code-style workspace. Voice, screen awareness, terminal, and browser automation are supporting capabilities around the see-and-drive core.
Voice Mirror is currently in alpha, under active development, and Windows-first. The App Preview see-and-drive loop, native-app driving, and push-to-talk are Windows-only today; macOS and Linux support is planned.
System Overview
Section titled “System Overview”Tauri 2 desktop app (transparent, always-on-top, frameless) | +-- Rust backend (src-tauri/) | Voice pipeline: STT (Whisper / whisper.cpp), TTS (Kokoro + Edge fallback), | VAD (Silero), wake word (OpenWakeWord) | App Preview: live capture + see-and-drive (CDP + UI Automation engines) | MCP server: 45 tools across 5 groups (native Rust binary, stdio JSON-RPC) | Providers: CLI agents (PTY) + local API + cloud API | IPC: Named pipe (length-prefixed JSON) between MCP binary and Tauri app | Services: capture/stream, window-follow, browser bridge, dev-server | manager, file/inbox watchers, input hook, logger | +-- Svelte 5 frontend (src/) | VS Code-style IDE: command palette, editor tabs, file tree, LSP, | integrated terminals, dev-server status bar, chat | Floating orb overlay with animated voice states | +-- Vite build system (ghostty-web WASM terminal, CodeMirror 6 editor)App Preview — See and Drive
Section titled “App Preview — See and Drive”App Preview is the centerpiece of the architecture. It is a live, true-size view of the app being built, so you and the AI look at the same running surface.
How the AI sees — a live stream of the running app:
| Source | Capture method |
|---|---|
| Web / Tauri / WebView2 / Electron | CDP screencast |
| Native Windows windows | Windows Graphics Capture (MJPEG to localhost) |
How the AI drives — it reads an accessibility/element tree exposed as @e{n} refs, then clicks and types against those refs. There are two engines behind one identical tool surface:
| Engine | Drives | Notes |
|---|---|---|
| CDP | Web, Tauri, WebView2, Electron apps | AX tree via Accessibility.getFullAXTree; transport-agnostic parser shared between the in-process Lens webview (WebView2 COM) and external apps launched with --remote-debugging-port (CDP WebSocket) |
| UI Automation (UIA) | Native Windows apps — Notepad, Calculator, Settings, Win32 / WinForms / WPF / Qt | Microsoft UI Automation COM API on a dedicated MTA worker thread |
Both engines emit the same - {role} "{name}" @e{n} tree lines and the same JSON shape, so the AI cannot tell which engine is underneath. Driving a native, non-web Windows app through the same @ref model that it uses for websites is genuinely novel beyond a typical IDE.
Two-way focus sync (window-follow): an OS focus-event hook (SetWinEventHook) arbitrates between the window the user last brought to the foreground and the window Claude last drove, with the most-recent action winning (plus short hysteresis to avoid thrash). The preview auto-follows whichever window was last touched, and survives a window closing by re-pointing to a surviving window.
App Preview and native-app driving are Windows-only today.
The MCP tools behind App Preview live in the capture group: sandbox_start, sandbox_attach, sandbox_snapshot, sandbox_screenshot, sandbox_click, sandbox_type, sandbox_close_window, plus capture_list_windows, capture_window, capture_browser, and list_ports.
Two Binaries, One Crate
Section titled “Two Binaries, One Crate”Voice Mirror compiles into two Rust binaries from a single Cargo crate.
1. Tauri App Binary
Section titled “1. Tauri App Binary”The main desktop application. It manages windows, IPC, the voice pipeline, App Preview capture/streaming, providers, and the backend services (capture/stream, window-follow, browser bridge, dev-server manager, file watcher, inbox watcher, input hook, logger).
2. MCP Binary (voice-mirror-mcp)
Section titled “2. MCP Binary (voice-mirror-mcp)”A separate native Rust binary that implements the MCP (Model Context Protocol) server. It speaks stdio JSON-RPC to AI agents (such as Claude Code) and talks to the Tauri app over a named pipe (with a file-inbox fallback).
Key files:
src-tauri/src/bin/mcp.rs— binary entry pointsrc-tauri/src/mcp/server.rs— JSON-RPC protocol handlersrc-tauri/src/mcp/tools.rs— tool registry (45 tools, 5 groups)src-tauri/src/mcp/pipe_router.rs— concurrent message routing
Communication flow:
Claude Code <-> stdio JSON-RPC <-> voice-mirror-mcp <-> named pipe <-> Tauri appThe Electron / Node MCP server still in the repo is legacy and is not the active target.
Voice Pipeline
Section titled “Voice Pipeline”The entire voice pipeline is native Rust — no Python backend.
Speech-to-Text (STT):
| Engine | Notes |
|---|---|
| Whisper (local, whisper.cpp) | Default; model size base (tiny/base/small downloadable). Runs on CPU; optional CUDA GPU acceleration (stt_use_gpu) speeds up dictation and falls back to CPU when no GPU is available |
Text-to-Speech (TTS):
| Engine | Notes |
|---|---|
| Kokoro (local ONNX) | Default; runs locally on CPU, with GPU acceleration when available |
| Edge TTS | Free Microsoft cloud voices; automatic fallback when the local Kokoro model is not present |
Voice Activity Detection (VAD): Silero ONNX, with an energy-based fallback.
The voice stack runs fully on CPU and offline by default; enabling GPU acceleration (CUDA) makes dictation and speech noticeably faster.
Wake word: OpenWakeWord, default phrase “hey_claude”.
TTS is sentence-level streaming and interruptible.
Activation Modes:
| Mode | Trigger | Use Case |
|---|---|---|
| Wake Word (default) | “Hey Claude” | Hands-free, background listening |
| Push-to-Talk (Windows) | Mouse button or hotkey | Quick questions |
| Call Mode | Always on | Continuous conversation |
AI Providers
Section titled “AI Providers”Switch providers without restarting; API keys are auto-detected from environment variables.
CLI agents (spawned in a PTY terminal): Claude Code, OpenCode, Codex, Gemini CLI, Kimi CLI.
Local API providers (auto-detected): Ollama, LM Studio, Jan.
Cloud API providers:
| Provider | |
|---|---|
| OpenAI | Anthropic |
| Google Gemini | Grok (xAI) |
| Groq | Mistral |
| OpenRouter | DeepSeek |
75+ models are also reachable via OpenCode’s gateway.
IPC: Named Pipes
Section titled “IPC: Named Pipes”The Tauri app and MCP binary communicate over Windows named pipes using length-prefixed JSON frames.
protocol.rs— defines theMcpToAppandAppToMcpmessage enumspipe_server.rs(Tauri side) — dispatches incoming requests to handlerspipe_client.rs(MCP side) — connects and sends/receives framespipe_router.rs(MCP side) — routes responses byrequest_id(oneshot for browser/sandbox responses, mpsc for user messages)
Browser Automation
Section titled “Browser Automation”Browser automation is CDP-based. The MCP browser_action tool routes through the named pipe to the Tauri app’s in-process Lens WebView2, which is driven via WebView2’s CallDevToolsProtocolMethod (CDP). Accessibility snapshots use Accessibility.getFullAXTree, parsed into the same @e{n} ref map that subsequent click/type/etc. actions target — the same CDP path and AX-tree parser the web/Tauri/Electron App Preview uses.
services/browser_bridge.rs— dispatches browser actions; uses WebView2ExecuteScriptfor JS return values,CapturePreviewfor screenshots, andCallDevToolsProtocolMethodfor CDPservices/cdp.rs— transport-agnostic AX-tree parser and@refmodel, shared by the browser bridge and the sandbox CDP path- Direct HTTP sub-actions (search, fetch) use
reqwestwithout the pipe
MCP Tools (45 Total, 5 Groups)
Section titled “MCP Tools (45 Total, 5 Groups)”Tools load dynamically — only what’s needed is active at any time — and auto-unload after roughly 15 idle calls. The default voice-assistant profile loads core + memory + browser (with capture always loaded on top). See the MCP Tools Reference for the complete list.
| Group | Always loaded | Tools | Count |
|---|---|---|---|
| core | yes | voice_send, voice_inbox, voice_listen, voice_status, get_logs | 5 |
| capture | yes | App Preview see-and-drive: sandbox_* (start/attach/snapshot/screenshot/click/type/close_window), capture_list_windows, capture_window, capture_browser, list_ports | 11 |
| memory | no | memory_search, memory_get, memory_remember, memory_forget, memory_stats, memory_flush | 6 |
| browser | no | browser_action — one unified tool dispatching ~50 sub-actions (navigate, click, type, screenshot, snapshot, search, fetch, cookies, storage, …) | 1 |
| n8n | no | Workflow / execution / credential / tag / node automation | 22 |
VS Code-Style IDE Surface
Section titled “VS Code-Style IDE Surface”The expanded panel is a multi-panel workspace combining a code editor, file tree, live browser preview, integrated terminals, and chat.
- Command palette with Go-to-File / Line / Symbol
- LSP integration — definitions, references, rename, format
- Per-tab editor buffers — CodeMirror 6 (JS, TS, Rust, CSS, HTML, JSON, Markdown, Python) with dirty indicators and a diff viewer
- Integrated terminals — xterm / ghostty-web WASM emulator
- Dev-server manager — Node and Python detection / auto-start (including venv setup), with bottom status-bar start / stop / restart
- Real-time chat streaming with inline tool-activity cards
- First-run “Get Started” 9-step tutorial
The live browser preview is a native WebView2 child window positioned by absolute pixel coordinates so it renders above DOM elements; panel resizing syncs bounds via a ResizeObserver.
Memory
Section titled “Memory”A 3-tier persistent memory system (core / stable / notes) with hybrid semantic + keyword search, exposed through the memory MCP group.
Self-Diagnostics
Section titled “Self-Diagnostics”Voice Mirror runs runtime self-diagnostics: component health contracts, crash/hang self-reporting, and structured logging routed into per-channel Output logs (app, cli, voice, mcp, browser, frontend, preview, plus dynamic per-project channels). Logs are queryable from the AI side via the get_logs tool.
UI States
Section titled “UI States”Floating Orb (Collapsed)
Section titled “Floating Orb (Collapsed)”A draggable circle with animated state indicators:
| State | Animation |
|---|---|
| Idle / Listening | Gentle pulse |
| Recording | Fast pulse |
| Speaking | Wave effect |
| Thinking | Spin |
Theme System
Section titled “Theme System”- Colorblind-safe default preset, plus a Light preset
- Custom themes — import / export as JSON, with real-time preview
Security Model
Section titled “Security Model”- In-process browser — browser tools drive the built-in WebView2 (via CDP), isolated from your everyday browser sessions
- Tool-mediated actions — every action flows through LLM → tool schema → handler
- MCP tool gating — groups load on demand; the LLM can only call loaded tools
- Capture on request — App Preview capture and screenshots are triggered by tool calls or user request, not passive
- Destructive tool confirmation — tools like
memory_forgetrequire explicit confirmation
Getting Started
Section titled “Getting Started”git clone https://github.com/contextmirror/voice-mirrornpm installnpm run devSee Development for the full setup, including Rust toolchain requirements.