# Voice Service Page Requirements This document contains the complete frontend requirements for the voice service page, enabling users to interact with Google Cloud voice services including Speech-to-Text, Translation, Text-to-Speech, and real-time voice interpretation. All UI components should use user settings as defaults but allow overrides. ## Table of Contents 1. [Overview](#overview) 2. [Page Structure and Layout](#page-structure-and-layout) 3. [User Interactions and Functionality](#user-interactions-and-functionality) 4. [Backend Routes and API Integration](#backend-routes-and-api-integration) 5. [Field and Attribute Reference](#field-and-attribute-reference) 6. [Dynamic Rendering Guidelines](#dynamic-rendering-guidelines) --- ## Overview The voice service page enables users to interact with Google Cloud voice services through a unified interface. The frontend consists of a single page (`/voice` or `/voice-service`) with different sections/tabs for each feature: - **Speech-to-Text Section** - Record speech and convert to text - **Translation Section** - Translate text between languages - **Voice Interpretation Section** - Record speech and get translated text - **Text-to-Speech Section** - Convert text to spoken audio - **Settings Section** - Configure default voice settings - **Real-time WebSocket Features** - Live streaming for speech-to-text and text-to-speech All features use user settings as defaults (loaded from `/api/voice-google/settings`) but allow users to override settings per operation. The page supports both HTTP REST endpoints for file-based processing and WebSocket endpoints for real-time streaming. **Key Principles:** - **Settings-Driven**: User preferences loaded and applied automatically - **Multi-Modal**: Supports both recording and real-time streaming - **Language-Aware**: All operations support multiple languages - **User-Scoped**: Settings are user-specific and stored per user --- ## Page Structure and Layout ### Voice Service Page (`/voice` or `/voice-service`) The voice service page uses a tabbed or sectioned interface to organize different voice features. Each section is self-contained but shares common settings and reusable UI components. ### Reusable UI Components The page is composed of reusable components that can be combined in different ways to create each feature section: #### Language Selector Component A dropdown selector for choosing languages, used across multiple sections: - **Properties:** - Default value from user settings (configurable: `sttLanguage`, `ttsLanguage`, or custom) - Supports language codes with region (e.g., "de-DE", "en-US") or without region (e.g., "de", "en") - Options populated from `/api/voice-google/languages` (cached) - Shows language code and display name - Optional: Search/filter functionality - **Usage:** - Voice Recording Section (Speech-to-Text mode): Single language selector (default: `sttLanguage`) - Text Processing Section (Translation mode): Two language selectors (source and target, defaults vary) - Voice Recording Section (Voice Interpretation mode): Two language selectors (source and target with region codes) - Text Processing Section (Text-to-Speech mode): Single language selector (default: `ttsLanguage`) - Settings Section: Multiple language selectors for different purposes #### Voice Selector Component A dropdown selector for choosing TTS voices, filtered by selected language: - **Properties:** - Default value from user settings (`ttsVoice`) - Options populated from `/api/voice-google/voices?language_code={selectedLanguage}` - Automatically updates when language changes - Shows voice name, language, and gender/type information - Optional: Search/filter functionality - **Usage:** - Text Processing Section (Text-to-Speech mode): Single voice selector - Settings Section: Single voice selector (filtered by TTS language) #### Recording Controls Component Controls for starting and stopping audio recording: - **Properties:** - "Start Recording" button (shown when not recording) - "Stop Recording" button (shown when recording) - Recording indicator (red dot, timer showing duration) - Optional: Visual waveform or audio level indicator - Handles microphone permission requests - **Usage:** - Voice Recording Section (Speech-to-Text mode): Recording controls + language selector - Voice Recording Section (Voice Interpretation mode): Recording controls + source/target language selectors - Voice Recording Section (WebSocket mode): Recording controls + WebSocket connection controls #### Text Input Component Multi-line textarea for text input: - **Properties:** - Multi-line textarea - Optional: Character count - Optional: Placeholder text - Validation: Required or optional based on context - **Usage:** - Text Processing Section (Translation mode): Text input + language selectors - Text Processing Section (Text-to-Speech mode): Text input + language/voice selectors - Text Processing Section (WebSocket mode): Text input + WebSocket controls #### Result Display Component Generic component for displaying operation results: - **Properties:** - Text display area (read-only or editable) - Optional: Multiple text areas (e.g., original and translated) - Metadata display (confidence, language, audio info, etc.) - Action buttons: Copy, Clear, Download (context-dependent) - Optional: Additional actions (Swap, Regenerate, etc.) - **Variants:** - **Text Result:** Displays transcribed or translated text with metadata - **Dual Text Result:** Displays original and translated text side-by-side - **Audio Result:** Displays audio player with download option and metadata - **Usage:** - Voice Recording Section (Speech-to-Text mode): Text result with confidence, language, audio metadata - Text Processing Section (Translation mode): Dual text result with language information - Voice Recording Section (Voice Interpretation mode): Dual text result with confidence, languages, audio metadata - Text Processing Section (Text-to-Speech mode): Audio result with voice name, language, download option #### Audio Player Component HTML5 audio player for playback: - **Properties:** - HTML5 audio element with controls (play, pause, volume, progress) - Download button - Metadata display (voice name, language, format) - Optional: Regenerate button - **Usage:** - Text Processing Section (Text-to-Speech mode): Audio playback of generated speech - Text Processing Section (WebSocket mode): Sequential playback of audio chunks #### WebSocket Connection Controls Component Controls for managing WebSocket connections: - **Properties:** - "Connect" button (establishes WebSocket connection) - "Disconnect" button (closes WebSocket connection) - Connection status indicator (connected, disconnected, connecting, error) - Connection ID display (optional) - **Usage:** - Voice Recording Section (WebSocket mode): Connection controls + recording controls + live transcription - Text Processing Section (WebSocket mode): Connection controls + text input + audio playback #### Live Transcription Display Component Real-time transcription display for WebSocket mode: - **Properties:** - Updates in real-time as results arrive - Distinguishes between interim results (grayed out, updating) and final results (confirmed, stable) - Shows confidence scores (optional) - Scrolls to latest content automatically - **Usage:** - Voice Recording Section (WebSocket mode): Live transcription display #### Status Indicators Component Visual indicators for operation status: - **Properties:** - Loading state (spinner, progress bar) - Error messages (red text, error icon) - Success messages (green text, success icon) - Microphone permission status - Recording status (idle, recording, processing) - Connection status (for WebSocket mode) - **Usage:** - All sections: Loading, error, and success states - Recording sections: Microphone permission and recording status - WebSocket sections: Connection status #### Settings Form Component Form for managing user voice settings: - **Properties:** - Multiple language selectors (STT, TTS, target) - Voice selector (filtered by TTS language) - Translation enabled toggle/checkbox - Conditional fields (target language shown when translation enabled) - Action buttons: Save, Reset to Defaults, Browse Languages, Browse Voices - **Usage:** - Settings Section: Complete settings form #### Language/Voice Browser Component Modal or sidebar for browsing available languages and voices: - **Properties:** - List display with search/filter functionality - Language browser: Shows language code and display name - Voice browser: Shows voice name, language, gender/type, with optional language filter - Select button to use selected item - Can be opened from language/voice selectors or settings - **Usage:** - Language Browser: Opened from language selectors or settings - Voice Browser: Opened from voice selector or settings ### Page Sections The page is organized into three main sections, each supporting multiple operation modes: #### Text Processing Section Supports operations that use text input: **Translation** and **Text-to-Speech**. **Mode Toggle:** - User can switch between "Translation" and "Text-to-Speech" modes - Mode selection determines which components and options are displayed **Translation Mode:** - Text Input + Language Selectors (source/target) + Dual Text Result Display + Status Indicators - Action button: "Translate" **Text-to-Speech Mode:** - Text Input + Language Selector + Voice Selector + Audio Result Display + Status Indicators - Action button: "Generate Speech" **WebSocket Real-time Mode (optional toggle):** - For Text-to-Speech: WebSocket Connection Controls + Text Input + Audio Player + Status Indicators - User can toggle between HTTP and WebSocket modes #### Voice Recording Section Supports operations that use audio recording: **Speech-to-Text** and **Voice Interpretation**. **Mode Toggle:** - User can switch between "Speech-to-Text" and "Voice Interpretation" modes - Mode selection determines which components and options are displayed **Speech-to-Text Mode:** - Language Selector + Recording Controls + Text Result Display + Status Indicators **Voice Interpretation Mode:** - Language Selectors (source/target) + Recording Controls + Dual Text Result Display + Status Indicators **WebSocket Real-time Mode (optional toggle):** - For Speech-to-Text: WebSocket Connection Controls + Recording Controls + Live Transcription Display + Status Indicators - User can toggle between HTTP and WebSocket modes #### Settings Section Standalone section for managing user voice settings: - Settings Form + Status Indicators - Includes: Language selectors, voice selector, translation toggle, action buttons --- ## User Interactions and Functionality ### Loading User Settings **On Page Load:** - Frontend calls `GET /api/voice-google/settings` - If user settings exist, use them as defaults - If no user settings, use default settings from response - Pre-populate all language and voice selectors with settings - Store settings in component state for quick access **Settings Structure:** ```json { "success": true, "data": { "user_settings": { "sttLanguage": "de-DE", "ttsLanguage": "de-DE", "ttsVoice": "de-DE-Wavenet-A", "translationEnabled": true, "targetLanguage": "en-US" }, "default_settings": { "sttLanguage": "de-DE", "ttsLanguage": "de-DE", "ttsVoice": "de-DE-Wavenet-A", "translationEnabled": true, "targetLanguage": "en-US" } } } ``` ### Text Processing Section The Text Processing Section supports two modes: **Translation** and **Text-to-Speech**. Users can switch between modes using a mode toggle. The section also supports an optional WebSocket real-time mode toggle. #### Translation Mode (HTTP) **Workflow:** 1. User switches to Translation mode (if not already selected) 2. User enters text in input field 3. User selects source language (default from settings or "de") 4. User selects target language (default from `targetLanguage` or "en") 5. User clicks "Translate" button 6. Frontend validates text is not empty 7. If validation fails: - Show error: "Please enter text to translate" 8. If validation passes: - Show loading state - Create FormData with text, sourceLanguage, targetLanguage - Submit to `POST /api/voice-google/translate` - Handle response: - Success: Display original and translated text side-by-side - Error: Show error message **Additional Features:** - Swap languages button: Swaps source and target languages - Copy buttons: Copy original or translated text to clipboard - Clear button: Clear input and results **Form Validation:** - Text input must not be empty - Source and target languages must be different (optional validation) **Error Handling:** - 401 Unauthorized → Show authentication error - 400 Bad Request → Show error details (empty text, invalid languages, etc.) - 500 Internal Server Error → Show generic error message #### Text-to-Speech Mode (HTTP) **Workflow:** 1. User switches to Text-to-Speech mode (if not already selected) 2. User enters text in input field 3. User selects language (default from `ttsLanguage` setting) 4. User selects voice (default from `ttsVoice` setting) - Voice dropdown updates when language changes - Frontend calls `GET /api/voice-google/voices?language_code={selectedLanguage}` - Populate voice dropdown with available voices 5. User clicks "Generate Speech" button 6. Frontend validates text is not empty 7. If validation fails: - Show error: "Please enter text to convert to speech" 8. If validation passes: - Show loading state - Create FormData with text, language, voice - Submit to `POST /api/voice-google/text-to-speech` - Handle response: - Success: Create audio blob from response, display audio player, show download button - Error: Show error message **Audio Playback:** - Create HTML5 audio element from response blob - Display audio controls (play, pause, volume, progress) - Extract voice name and language from response headers (`X-Voice-Name`, `X-Language-Code`) - Display voice information **Form Validation:** - Text input must not be empty - Language must be selected - Voice must be selected **Error Handling:** - 401 Unauthorized → Show authentication error - 400 Bad Request → Show error details - 500 Internal Server Error → Show generic error message #### Text-to-Speech Mode (WebSocket Real-time) **Connection Workflow:** 1. User switches to Text-to-Speech mode 2. User toggles to WebSocket real-time mode 3. User selects language and voice (defaults from settings) 4. User clicks "Connect" button 5. Frontend establishes WebSocket connection to `/api/voice-google/ws/text-to-speech?userId={userId}&language={language}&voice={voice}` 6. Backend sends connection confirmation 7. Frontend shows "Connected" status 8. User enters text and clicks "Speak" button 9. Frontend sends text via WebSocket: `{type: "text_to_speak", text: "..."}` 10. Backend processes text and sends audio: `{type: "audio_data", audio: base64_audio, format: "mp3"}` 11. Frontend decodes base64 audio, creates audio blob, plays audio 12. User can send multiple texts while connected 13. User clicks "Disconnect" to close WebSocket **Audio Playback:** - Queue audio chunks for sequential playback - Or play chunks immediately as received - Show playback progress **Error Handling:** - WebSocket connection errors → Show error message, allow retry - Processing errors → Show error message from backend: `{type: "error", error: "..."}` ### Voice Recording Section The Voice Recording Section supports two modes: **Speech-to-Text** and **Voice Interpretation**. Users can switch between modes using a mode toggle. The section also supports an optional WebSocket real-time mode toggle. #### Speech-to-Text Mode (HTTP) **Recording Workflow:** 1. User switches to Speech-to-Text mode (if not already selected) 2. User selects language (default from `sttLanguage` setting) 3. User clicks "Start Recording" 4. Frontend requests microphone access via `navigator.mediaDevices.getUserMedia()` 5. If permission denied: - Show error message: "Microphone access denied. Please enable microphone permissions in your browser settings." - Disable recording button 6. If permission granted: - Start MediaRecorder API - Show recording indicator (red dot, start timer) - Capture audio stream 7. User speaks into microphone 8. User clicks "Stop Recording" 9. Frontend stops MediaRecorder 10. Convert audio stream to audio file (Blob) 11. Validate audio file (size, format) 12. If validation fails: - Show error: "Invalid audio format or file too large" 13. If validation passes: - Show loading state - Create FormData with audio file and language - Submit to `POST /api/voice-google/speech-to-text` - Handle response: - Success: Display transcribed text, confidence, language, metadata - Error: Show error message **Form Validation:** - Language must be selected - Audio file must be valid format (WebM, WAV, MP3, etc.) - Audio file size must be within limits **Error Handling:** - 401 Unauthorized → Show authentication error, redirect to login - 400 Bad Request → Show error details from response - 500 Internal Server Error → Show generic error message #### Voice Interpretation Mode (HTTP) **Recording and Interpretation Workflow:** 1. User switches to Voice Interpretation mode (if not already selected) 2. User selects source language (default from `sttLanguage` setting) 3. User selects target language (default from `targetLanguage` setting) 4. User clicks "Start Recording" 5. Frontend requests microphone access 6. If permission granted: - Start recording - Show recording indicator 7. User speaks into microphone 8. User clicks "Stop Recording" 9. Frontend stops recording and converts to audio file 10. Validate audio file 11. If validation passes: - Show loading state - Create FormData with audioFile, fromLanguage, toLanguage - Submit to `POST /api/voice-google/realtime-interpreter` - Handle response: - Success: Display original text and translated text - Error: Show error message **Form Validation:** - Source and target languages must be selected - Audio file must be valid **Error Handling:** - Same as Speech-to-Text errors #### Speech-to-Text Mode (WebSocket Real-time) **Connection Workflow:** 1. User switches to Speech-to-Text mode 2. User toggles to WebSocket real-time mode 3. User selects language (default from settings) 4. User clicks "Connect" button or switches to WebSocket mode 5. Frontend establishes WebSocket connection to `/api/voice-google/ws/speech-to-text?userId={userId}&language={language}` 6. Backend sends connection confirmation: `{type: "connected", connection_id, message}` 7. Frontend shows "Connected" status 8. User clicks "Start Recording" 9. Frontend requests microphone access 10. If permission granted: - Start MediaRecorder - Start capturing audio chunks - Encode chunks to base64 - Send chunks via WebSocket: `{type: "audio_chunk", data: base64_audio, timestamp}` 11. Backend processes chunks and sends results: - `{type: "transcription_result", text, confidence, is_final}` 12. Frontend displays results: - Interim results (is_final: false) → Grayed out, updating - Final results (is_final: true) → Confirmed, stable 13. User clicks "Stop Recording" or "Disconnect" 14. Frontend stops recording and closes WebSocket **Keep-Alive:** - Frontend sends ping messages periodically: `{type: "ping", timestamp}` - Backend responds with: `{type: "pong", timestamp}` **Error Handling:** - WebSocket connection errors → Show error message, allow retry - Processing errors → Show error message from backend: `{type: "error", error: "..."}` ### Settings Section **Viewing Settings:** 1. User navigates to Settings section 2. Frontend calls `GET /api/voice-google/settings` 3. Display current settings in form 4. If no user settings, show default settings **Updating Settings:** 1. User modifies settings in form: - `sttLanguage` - Required - `ttsLanguage` - Required - `ttsVoice` - Required - `translationEnabled` - Optional (default: true) - `targetLanguage` - Optional (default: "en-US") 2. User clicks "Save Settings" button 3. Frontend validates required fields 4. If validation fails: - Show error: "Please fill in all required fields" 5. If validation passes: - Show loading state - Build settings object - Submit to `POST /api/voice-google/settings` with settings object - Handle response: - Success: Show success message, update UI, update defaults for other sections - Error: Show error message **Voice Selection:** - When user changes `ttsLanguage`, frontend should: 1. Call `GET /api/voice-google/voices?language_code={newLanguage}` 2. Update voice dropdown with filtered voices 3. If current voice is not available in new language, select first available voice or clear selection **Form Validation:** - `sttLanguage` is required - `ttsLanguage` is required - `ttsVoice` is required - `targetLanguage` is required if `translationEnabled` is true **Error Handling:** - 401 Unauthorized → Show authentication error - 400 Bad Request → Show error details (missing required fields) - 500 Internal Server Error → Show generic error message **Voice Selection:** - When user changes `ttsLanguage`, frontend should: 1. Call `GET /api/voice-google/voices?language_code={newLanguage}` 2. Update voice dropdown with filtered voices 3. If current voice is not available in new language, select first available voice or clear selection **Form Validation:** - `sttLanguage` is required - `ttsLanguage` is required - `ttsVoice` is required - `targetLanguage` is required if `translationEnabled` is true **Error Handling:** - 401 Unauthorized → Show authentication error - 400 Bad Request → Show error details (missing required fields) - 500 Internal Server Error → Show generic error message ### Discovering Available Languages and Voices This functionality is available from multiple locations (Settings section, language selectors, voice selectors) and opens a modal/sidebar browser. **Browsing Languages:** 1. User clicks "Load Available Languages" button (in Settings or language selector) 2. Frontend calls `GET /api/voice-google/languages` 3. Display languages in modal/sidebar 4. Show language code and display name 5. User can search/filter languages 6. User clicks "Select" to use language 7. Update language selector with selected language **Browsing Voices:** 1. User clicks "Load Available Voices" button (in Settings or voice selector) 2. User optionally selects language filter 3. Frontend calls `GET /api/voice-google/voices?language_code={selectedLanguage}` (if filter selected) - Or `GET /api/voice-google/voices` (if no filter) 4. Display voices in modal/sidebar 5. Show voice name, language, gender/type 6. Group voices by language if no filter 7. User can search/filter voices 8. User clicks "Select" to use voice 9. Update voice selector with selected voice --- ## Backend Routes and API Integration ### Complete Route Reference All backend routes used by voice service page: | Route | Method | Purpose | When Used | Access Control | |-------|--------|---------|-----------|----------------| | `/api/voice-google/speech-to-text` | POST | Convert speech to text | User stops recording | Current user only | | `/api/voice-google/translate` | POST | Translate text | User clicks "Translate" | Current user only | | `/api/voice-google/realtime-interpreter` | POST | Speech to translated text | User stops recording in interpreter mode | Current user only | | `/api/voice-google/text-to-speech` | POST | Convert text to speech | User clicks "Generate Speech" | Current user only | | `/api/voice-google/languages` | GET | Get available languages | User browses languages | Current user only | | `/api/voice-google/voices` | GET | Get available voices | User browses voices or changes language | Current user only | | `/api/voice-google/settings` | GET | Get voice settings | Page load, settings view | Current user only | | `/api/voice-google/settings` | POST | Save voice settings | User saves settings | Current user only | | `/api/voice-google/health` | GET | Health check | Optional: on page load | Current user only | | `/api/voice-google/ws/speech-to-text` | WebSocket | Real-time speech-to-text | User connects WebSocket | Current user only | | `/api/voice-google/ws/text-to-speech` | WebSocket | Real-time text-to-speech | User connects WebSocket | Current user only | | `/api/voice-google/ws/realtime-interpreter` | WebSocket | Real-time interpretation | User connects WebSocket (future) | Current user only | ### API Request Patterns **Speech-to-Text Request:** ``` POST /api/voice-google/speech-to-text Content-Type: multipart/form-data Body: { audioFile: , language: "de-DE" } ``` - `audioFile` is required (audio file from recording) - `language` is required (language code like "de-DE", "en-US") - Handle 400 (invalid format), 401 (unauthorized), 500 errors **Translation Request:** ``` POST /api/voice-google/translate Content-Type: multipart/form-data Body: { text: "Text to translate", sourceLanguage: "de", targetLanguage: "en" } ``` - `text` is required (non-empty string) - `sourceLanguage` is required (language code like "de", "en") - `targetLanguage` is required (language code like "de", "en") - Handle 400 (empty text, invalid languages), 401, 500 errors **Real-time Interpreter Request:** ``` POST /api/voice-google/realtime-interpreter Content-Type: multipart/form-data Body: { audioFile: , fromLanguage: "de-DE", toLanguage: "en-US", connectionId: "optional-connection-id" } ``` - `audioFile` is required - `fromLanguage` is required (language code with region) - `toLanguage` is required (language code with region) - `connectionId` is optional - Handle same errors as speech-to-text **Text-to-Speech Request:** ``` POST /api/voice-google/text-to-speech Content-Type: multipart/form-data Body: { text: "Text to speak", language: "de-DE", voice: "de-DE-Wavenet-A" } ``` - `text` is required (non-empty string) - `language` is required (language code with region) - `voice` is optional (voice name, defaults to system default) - Response is audio file (audio/mpeg) with headers: - `X-Voice-Name`: Voice name used - `X-Language-Code`: Language code used - Handle 400 (empty text), 401, 500 errors **Get Languages Request:** ``` GET /api/voice-google/languages ``` - Returns: `{success: true, languages: [...]}` - Languages array contains language objects with code and name - Handle 400, 401, 500 errors **Get Voices Request:** ``` GET /api/voice-google/voices?language_code=de-DE ``` - `language_code` query parameter is optional - If provided, filters voices by language - Returns: `{success: true, voices: [...], language_filter: "de-DE"}` - Voices array contains voice objects with name, language, gender, etc. - Handle 400, 401, 500 errors **Get Settings Request:** ``` GET /api/voice-google/settings ``` - Returns: `{success: true, data: {user_settings, default_settings}}` - `user_settings` may be null if no settings exist - `default_settings` always present - Handle 401, 500 errors **Save Settings Request:** ``` POST /api/voice-google/settings Content-Type: application/json Body: { "sttLanguage": "de-DE", "ttsLanguage": "de-DE", "ttsVoice": "de-DE-Wavenet-A", "translationEnabled": true, "targetLanguage": "en-US" } ``` - Required fields: `sttLanguage`, `ttsLanguage`, `ttsVoice` - Optional fields: `translationEnabled` (default: true), `targetLanguage` (default: "en-US") - Returns: `{success: true, message: "...", data: settings}` - Handle 400 (missing required fields), 401, 500 errors **WebSocket Connection:** ``` WebSocket: /api/voice-google/ws/speech-to-text?userId={userId}&language={language} WebSocket: /api/voice-google/ws/text-to-speech?userId={userId}&language={language}&voice={voice} ``` - Query parameters: `userId`, `language`, `voice` (for TTS) - Backend sends connection confirmation on connect - Client sends messages: `{type: "audio_chunk", data: base64, timestamp}` or `{type: "text_to_speak", text: "..."}` - Backend sends messages: `{type: "transcription_result", text, confidence, is_final}` or `{type: "audio_data", audio: base64, format: "mp3"}` - Handle connection errors, processing errors ### Response Handling **Speech-to-Text Response:** ```json { "success": true, "text": "Transcribed text here", "confidence": 0.95, "language": "de-DE", "audio_info": { "size": 12345, "format": "webm", "estimated_duration": 3.5 } } ``` **Translation Response:** ```json { "success": true, "original_text": "Original text", "translated_text": "Translated text", "source_language": "de", "target_language": "en" } ``` **Real-time Interpreter Response:** ```json { "success": true, "original_text": "Original transcribed text", "translated_text": "Translated text", "confidence": 0.95, "source_language": "de-DE", "target_language": "en-US", "audio_info": { "size": 12345, "format": "webm", "estimated_duration": 3.5 } } ``` **Text-to-Speech Response:** - Binary audio file (audio/mpeg) - Headers: - `Content-Type: audio/mpeg` - `Content-Disposition: attachment; filename=speech.mp3` - `X-Voice-Name: de-DE-Wavenet-A` - `X-Language-Code: de-DE` **Languages Response:** ```json { "success": true, "languages": [ { "code": "de-DE", "name": "German (Germany)" }, ... ] } ``` **Voices Response:** ```json { "success": true, "voices": [ { "name": "de-DE-Wavenet-A", "language": "de-DE", "gender": "FEMALE", "ssml_gender": "FEMALE" }, ... ], "language_filter": "de-DE" } ``` **Settings Response:** ```json { "success": true, "data": { "user_settings": { "sttLanguage": "de-DE", "ttsLanguage": "de-DE", "ttsVoice": "de-DE-Wavenet-A", "translationEnabled": true, "targetLanguage": "en-US" }, "default_settings": { "sttLanguage": "de-DE", "ttsLanguage": "de-DE", "ttsVoice": "de-DE-Wavenet-A", "translationEnabled": true, "targetLanguage": "en-US" } } } ``` **Error Responses:** - 400 Bad Request → Display validation errors from response `detail` field - 401 Unauthorized → Show authentication error, redirect to login - 500 Internal Server Error → Show generic error message --- ## Field and Attribute Reference ### Voice Settings Fields The following fields are used for voice settings. These are not provided by a backend attributes endpoint but are defined by the API contract: **Required Fields:** - `sttLanguage` - Speech-to-Text language (text/select, editable, required, visible) - Format: Language code with region (e.g., "de-DE", "en-US") - Default: "de-DE" - `ttsLanguage` - Text-to-Speech language (text/select, editable, required, visible) - Format: Language code with region (e.g., "de-DE", "en-US") - Default: "de-DE" - `ttsVoice` - Text-to-Speech voice name (text/select, editable, required, visible) - Format: Voice identifier (e.g., "de-DE-Wavenet-A") - Default: "de-DE-Wavenet-A" - Options populated from `/api/voice-google/voices?language_code={ttsLanguage}` **Optional Fields:** - `translationEnabled` - Enable translation features (checkbox, editable, not required, visible) - Type: boolean - Default: true - `targetLanguage` - Target language for translation (text/select, editable, not required, visible) - Format: Language code with region (e.g., "en-US", "fr-FR") - Default: "en-US" - Shown when `translationEnabled` is true ### Language and Voice Data Structures **Language Object:** - `code` - Language code (e.g., "de-DE", "en-US") - `name` - Display name (e.g., "German (Germany)", "English (United States)") **Voice Object:** - `name` - Voice identifier (e.g., "de-DE-Wavenet-A") - `language` - Language code (e.g., "de-DE") - `gender` - Voice gender (e.g., "FEMALE", "MALE", "NEUTRAL") - `ssml_gender` - SSML gender identifier - Additional metadata may be available --- ## Dynamic Rendering Guidelines ### Settings-Driven Defaults All voice operations should use user settings as defaults: 1. **On Page Load:** - Call `GET /api/voice-google/settings` - Store settings in component state - Pre-populate all selectors with settings values - Use `default_settings` if `user_settings` is null 2. **Language Selectors:** - Speech-to-Text: Default to `sttLanguage` from settings - Text-to-Speech: Default to `ttsLanguage` from settings - Translation Source: Default to language derived from `sttLanguage` (remove region code) - Translation Target: Default to `targetLanguage` from settings 3. **Voice Selector:** - Default to `ttsVoice` from settings - When language changes, fetch voices for new language - If default voice not available in new language, select first available 4. **After Settings Update:** - Refresh settings from backend - Update all selectors with new defaults - Show success message ### Audio Recording Implementation **MediaRecorder API:** 1. Request microphone access: `navigator.mediaDevices.getUserMedia({ audio: true })` 2. Create MediaRecorder with appropriate MIME type: - Prefer WebM: `new MediaRecorder(stream, { mimeType: 'audio/webm' })` - Fallback to browser default 3. Start recording: `mediaRecorder.start()` 4. Capture data chunks: `mediaRecorder.ondataavailable` 5. Stop recording: `mediaRecorder.stop()` 6. Convert chunks to Blob: `new Blob(chunks, { type: 'audio/webm' })` 7. Create File object for upload: `new File([blob], 'recording.webm', { type: 'audio/webm' })` **Recording Indicators:** - Show visual indicator (red dot, pulsing animation) - Display timer (MM:SS format) - Show audio level/waveform (optional, using AudioContext API) **Error Handling:** - Microphone permission denied → Show clear error message with instructions - No microphone available → Disable recording, show message - Recording errors → Show error, allow retry ### Language and Voice Selection **Language Dropdowns:** 1. Optionally fetch languages from `/api/voice-google/languages` on page load 2. Cache languages in component state 3. Display languages with code and name 4. Allow search/filter 5. Use settings default as initial selection **Voice Dropdown:** 1. When TTS language changes: - Call `GET /api/voice-google/voices?language_code={selectedLanguage}` - Update dropdown with filtered voices - Select default voice if available, otherwise first voice 2. Display voice name and gender/type 3. Allow search/filter 4. Show loading state while fetching voices ### WebSocket Implementation **Connection Management:** 1. Create WebSocket connection with query parameters 2. Handle connection events: - `onopen` → Show connected status - `onmessage` → Process incoming messages - `onerror` → Show error, allow retry - `onclose` → Show disconnected status, cleanup 3. Send keep-alive pings periodically (every 30 seconds) 4. Handle reconnection on disconnect **Message Handling:** 1. Parse incoming JSON messages 2. Handle message types: - `connected` → Update connection status - `transcription_result` → Update transcription display (interim vs final) - `audio_data` → Decode base64, create blob, play audio - `error` → Show error message - `pong` → Update last ping time 3. Send messages: - `audio_chunk` → Encode audio to base64, include timestamp - `text_to_speak` → Include text - `ping` → Include timestamp **Audio Chunking (Speech-to-Text):** 1. Capture audio chunks from MediaRecorder 2. Encode chunks to base64 3. Send chunks via WebSocket as they're captured 4. Buffer chunks if needed for reliable transmission **Audio Playback (Text-to-Speech):** 1. Receive base64 audio chunks 2. Decode base64 to binary 3. Create audio blob 4. Queue for sequential playback or play immediately 5. Use HTML5 Audio API for playback ### Form Validation **Voice Recording Section:** - **Speech-to-Text Mode:** Language must be selected, audio file must exist and be valid format, audio file size must be within limits - **Voice Interpretation Mode:** Source and target languages must be selected, audio file must be valid **Text Processing Section:** - **Translation Mode:** Text must not be empty, source and target languages must be selected, source and target languages should be different (optional validation) - **Text-to-Speech Mode:** Text must not be empty, language must be selected, voice must be selected **Settings Section:** - `sttLanguage` is required - `ttsLanguage` is required - `ttsVoice` is required - `targetLanguage` is required if `translationEnabled` is true ### Error Display **Error Messages:** - Display errors prominently (red text, error icon) - Show specific error details from backend when available - Provide actionable guidance (e.g., "Please enable microphone permissions") - Allow retry for transient errors **Loading States:** - Show loading spinner during API calls - Disable form inputs during processing - Show progress indicators for long operations ### Audio Playback **HTML5 Audio Player:** 1. Create audio element: `