40 KiB
Voice Service Page Requirements
This document contains the complete frontend requirements for the voice service page, enabling users to interact with Google Cloud voice services including Speech-to-Text, Translation, Text-to-Speech, and real-time voice interpretation. All UI components should use user settings as defaults but allow overrides.
Table of Contents
- Overview
- Page Structure and Layout
- User Interactions and Functionality
- Backend Routes and API Integration
- Field and Attribute Reference
- Dynamic Rendering Guidelines
Overview
The voice service page enables users to interact with Google Cloud voice services through a unified interface. The frontend consists of a single page (/voice or /voice-service) with different sections/tabs for each feature:
- Speech-to-Text Section - Record speech and convert to text
- Translation Section - Translate text between languages
- Voice Interpretation Section - Record speech and get translated text
- Text-to-Speech Section - Convert text to spoken audio
- Settings Section - Configure default voice settings
- Real-time WebSocket Features - Live streaming for speech-to-text and text-to-speech
All features use user settings as defaults (loaded from /api/voice-google/settings) but allow users to override settings per operation. The page supports both HTTP REST endpoints for file-based processing and WebSocket endpoints for real-time streaming.
Key Principles:
- Settings-Driven: User preferences loaded and applied automatically
- Multi-Modal: Supports both recording and real-time streaming
- Language-Aware: All operations support multiple languages
- User-Scoped: Settings are user-specific and stored per user
Page Structure and Layout
Voice Service Page (/voice or /voice-service)
The voice service page uses a tabbed or sectioned interface to organize different voice features. Each section is self-contained but shares common settings and reusable UI components.
Reusable UI Components
The page is composed of reusable components that can be combined in different ways to create each feature section:
Language Selector Component
A dropdown selector for choosing languages, used across multiple sections:
-
Properties:
- Default value from user settings (configurable:
sttLanguage,ttsLanguage, or custom) - Supports language codes with region (e.g., "de-DE", "en-US") or without region (e.g., "de", "en")
- Options populated from
/api/voice-google/languages(cached) - Shows language code and display name
- Optional: Search/filter functionality
- Default value from user settings (configurable:
-
Usage:
- Voice Recording Section (Speech-to-Text mode): Single language selector (default:
sttLanguage) - Text Processing Section (Translation mode): Two language selectors (source and target, defaults vary)
- Voice Recording Section (Voice Interpretation mode): Two language selectors (source and target with region codes)
- Text Processing Section (Text-to-Speech mode): Single language selector (default:
ttsLanguage) - Settings Section: Multiple language selectors for different purposes
- Voice Recording Section (Speech-to-Text mode): Single language selector (default:
Voice Selector Component
A dropdown selector for choosing TTS voices, filtered by selected language:
-
Properties:
- Default value from user settings (
ttsVoice) - Options populated from
/api/voice-google/voices?language_code={selectedLanguage} - Automatically updates when language changes
- Shows voice name, language, and gender/type information
- Optional: Search/filter functionality
- Default value from user settings (
-
Usage:
- Text Processing Section (Text-to-Speech mode): Single voice selector
- Settings Section: Single voice selector (filtered by TTS language)
Recording Controls Component
Controls for starting and stopping audio recording:
-
Properties:
- "Start Recording" button (shown when not recording)
- "Stop Recording" button (shown when recording)
- Recording indicator (red dot, timer showing duration)
- Optional: Visual waveform or audio level indicator
- Handles microphone permission requests
-
Usage:
- Voice Recording Section (Speech-to-Text mode): Recording controls + language selector
- Voice Recording Section (Voice Interpretation mode): Recording controls + source/target language selectors
- Voice Recording Section (WebSocket mode): Recording controls + WebSocket connection controls
Text Input Component
Multi-line textarea for text input:
-
Properties:
- Multi-line textarea
- Optional: Character count
- Optional: Placeholder text
- Validation: Required or optional based on context
-
Usage:
- Text Processing Section (Translation mode): Text input + language selectors
- Text Processing Section (Text-to-Speech mode): Text input + language/voice selectors
- Text Processing Section (WebSocket mode): Text input + WebSocket controls
Result Display Component
Generic component for displaying operation results:
-
Properties:
- Text display area (read-only or editable)
- Optional: Multiple text areas (e.g., original and translated)
- Metadata display (confidence, language, audio info, etc.)
- Action buttons: Copy, Clear, Download (context-dependent)
- Optional: Additional actions (Swap, Regenerate, etc.)
-
Variants:
- Text Result: Displays transcribed or translated text with metadata
- Dual Text Result: Displays original and translated text side-by-side
- Audio Result: Displays audio player with download option and metadata
-
Usage:
- Voice Recording Section (Speech-to-Text mode): Text result with confidence, language, audio metadata
- Text Processing Section (Translation mode): Dual text result with language information
- Voice Recording Section (Voice Interpretation mode): Dual text result with confidence, languages, audio metadata
- Text Processing Section (Text-to-Speech mode): Audio result with voice name, language, download option
Audio Player Component
HTML5 audio player for playback:
-
Properties:
- HTML5 audio element with controls (play, pause, volume, progress)
- Download button
- Metadata display (voice name, language, format)
- Optional: Regenerate button
-
Usage:
- Text Processing Section (Text-to-Speech mode): Audio playback of generated speech
- Text Processing Section (WebSocket mode): Sequential playback of audio chunks
WebSocket Connection Controls Component
Controls for managing WebSocket connections:
-
Properties:
- "Connect" button (establishes WebSocket connection)
- "Disconnect" button (closes WebSocket connection)
- Connection status indicator (connected, disconnected, connecting, error)
- Connection ID display (optional)
-
Usage:
- Voice Recording Section (WebSocket mode): Connection controls + recording controls + live transcription
- Text Processing Section (WebSocket mode): Connection controls + text input + audio playback
Live Transcription Display Component
Real-time transcription display for WebSocket mode:
-
Properties:
- Updates in real-time as results arrive
- Distinguishes between interim results (grayed out, updating) and final results (confirmed, stable)
- Shows confidence scores (optional)
- Scrolls to latest content automatically
-
Usage:
- Voice Recording Section (WebSocket mode): Live transcription display
Status Indicators Component
Visual indicators for operation status:
-
Properties:
- Loading state (spinner, progress bar)
- Error messages (red text, error icon)
- Success messages (green text, success icon)
- Microphone permission status
- Recording status (idle, recording, processing)
- Connection status (for WebSocket mode)
-
Usage:
- All sections: Loading, error, and success states
- Recording sections: Microphone permission and recording status
- WebSocket sections: Connection status
Settings Form Component
Form for managing user voice settings:
-
Properties:
- Multiple language selectors (STT, TTS, target)
- Voice selector (filtered by TTS language)
- Translation enabled toggle/checkbox
- Conditional fields (target language shown when translation enabled)
- Action buttons: Save, Reset to Defaults, Browse Languages, Browse Voices
-
Usage:
- Settings Section: Complete settings form
Language/Voice Browser Component
Modal or sidebar for browsing available languages and voices:
-
Properties:
- List display with search/filter functionality
- Language browser: Shows language code and display name
- Voice browser: Shows voice name, language, gender/type, with optional language filter
- Select button to use selected item
- Can be opened from language/voice selectors or settings
-
Usage:
- Language Browser: Opened from language selectors or settings
- Voice Browser: Opened from voice selector or settings
Page Sections
The page is organized into three main sections, each supporting multiple operation modes:
Text Processing Section
Supports operations that use text input: Translation and Text-to-Speech.
Mode Toggle:
- User can switch between "Translation" and "Text-to-Speech" modes
- Mode selection determines which components and options are displayed
Translation Mode:
- Text Input + Language Selectors (source/target) + Dual Text Result Display + Status Indicators
- Action button: "Translate"
Text-to-Speech Mode:
- Text Input + Language Selector + Voice Selector + Audio Result Display + Status Indicators
- Action button: "Generate Speech"
WebSocket Real-time Mode (optional toggle):
- For Text-to-Speech: WebSocket Connection Controls + Text Input + Audio Player + Status Indicators
- User can toggle between HTTP and WebSocket modes
Voice Recording Section
Supports operations that use audio recording: Speech-to-Text and Voice Interpretation.
Mode Toggle:
- User can switch between "Speech-to-Text" and "Voice Interpretation" modes
- Mode selection determines which components and options are displayed
Speech-to-Text Mode:
- Language Selector + Recording Controls + Text Result Display + Status Indicators
Voice Interpretation Mode:
- Language Selectors (source/target) + Recording Controls + Dual Text Result Display + Status Indicators
WebSocket Real-time Mode (optional toggle):
- For Speech-to-Text: WebSocket Connection Controls + Recording Controls + Live Transcription Display + Status Indicators
- User can toggle between HTTP and WebSocket modes
Settings Section
Standalone section for managing user voice settings:
- Settings Form + Status Indicators
- Includes: Language selectors, voice selector, translation toggle, action buttons
User Interactions and Functionality
Loading User Settings
On Page Load:
- Frontend calls
GET /api/voice-google/settings - If user settings exist, use them as defaults
- If no user settings, use default settings from response
- Pre-populate all language and voice selectors with settings
- Store settings in component state for quick access
Settings Structure:
{
"success": true,
"data": {
"user_settings": {
"sttLanguage": "de-DE",
"ttsLanguage": "de-DE",
"ttsVoice": "de-DE-Wavenet-A",
"translationEnabled": true,
"targetLanguage": "en-US"
},
"default_settings": {
"sttLanguage": "de-DE",
"ttsLanguage": "de-DE",
"ttsVoice": "de-DE-Wavenet-A",
"translationEnabled": true,
"targetLanguage": "en-US"
}
}
}
Text Processing Section
The Text Processing Section supports two modes: Translation and Text-to-Speech. Users can switch between modes using a mode toggle. The section also supports an optional WebSocket real-time mode toggle.
Translation Mode (HTTP)
Workflow:
- User switches to Translation mode (if not already selected)
- User enters text in input field
- User selects source language (default from settings or "de")
- User selects target language (default from
targetLanguageor "en") - User clicks "Translate" button
- Frontend validates text is not empty
- If validation fails:
- Show error: "Please enter text to translate"
- If validation passes:
- Show loading state
- Create FormData with text, sourceLanguage, targetLanguage
- Submit to
POST /api/voice-google/translate - Handle response:
- Success: Display original and translated text side-by-side
- Error: Show error message
Additional Features:
- Swap languages button: Swaps source and target languages
- Copy buttons: Copy original or translated text to clipboard
- Clear button: Clear input and results
Form Validation:
- Text input must not be empty
- Source and target languages must be different (optional validation)
Error Handling:
- 401 Unauthorized → Show authentication error
- 400 Bad Request → Show error details (empty text, invalid languages, etc.)
- 500 Internal Server Error → Show generic error message
Text-to-Speech Mode (HTTP)
Workflow:
- User switches to Text-to-Speech mode (if not already selected)
- User enters text in input field
- User selects language (default from
ttsLanguagesetting) - User selects voice (default from
ttsVoicesetting)- Voice dropdown updates when language changes
- Frontend calls
GET /api/voice-google/voices?language_code={selectedLanguage} - Populate voice dropdown with available voices
- User clicks "Generate Speech" button
- Frontend validates text is not empty
- If validation fails:
- Show error: "Please enter text to convert to speech"
- If validation passes:
- Show loading state
- Create FormData with text, language, voice
- Submit to
POST /api/voice-google/text-to-speech - Handle response:
- Success: Create audio blob from response, display audio player, show download button
- Error: Show error message
Audio Playback:
- Create HTML5 audio element from response blob
- Display audio controls (play, pause, volume, progress)
- Extract voice name and language from response headers (
X-Voice-Name,X-Language-Code) - Display voice information
Form Validation:
- Text input must not be empty
- Language must be selected
- Voice must be selected
Error Handling:
- 401 Unauthorized → Show authentication error
- 400 Bad Request → Show error details
- 500 Internal Server Error → Show generic error message
Text-to-Speech Mode (WebSocket Real-time)
Connection Workflow:
- User switches to Text-to-Speech mode
- User toggles to WebSocket real-time mode
- User selects language and voice (defaults from settings)
- User clicks "Connect" button
- Frontend establishes WebSocket connection to
/api/voice-google/ws/text-to-speech?userId={userId}&language={language}&voice={voice} - Backend sends connection confirmation
- Frontend shows "Connected" status
- User enters text and clicks "Speak" button
- Frontend sends text via WebSocket:
{type: "text_to_speak", text: "..."} - Backend processes text and sends audio:
{type: "audio_data", audio: base64_audio, format: "mp3"} - Frontend decodes base64 audio, creates audio blob, plays audio
- User can send multiple texts while connected
- User clicks "Disconnect" to close WebSocket
Audio Playback:
- Queue audio chunks for sequential playback
- Or play chunks immediately as received
- Show playback progress
Error Handling:
- WebSocket connection errors → Show error message, allow retry
- Processing errors → Show error message from backend:
{type: "error", error: "..."}
Voice Recording Section
The Voice Recording Section supports two modes: Speech-to-Text and Voice Interpretation. Users can switch between modes using a mode toggle. The section also supports an optional WebSocket real-time mode toggle.
Speech-to-Text Mode (HTTP)
Recording Workflow:
- User switches to Speech-to-Text mode (if not already selected)
- User selects language (default from
sttLanguagesetting) - User clicks "Start Recording"
- Frontend requests microphone access via
navigator.mediaDevices.getUserMedia() - If permission denied:
- Show error message: "Microphone access denied. Please enable microphone permissions in your browser settings."
- Disable recording button
- If permission granted:
- Start MediaRecorder API
- Show recording indicator (red dot, start timer)
- Capture audio stream
- User speaks into microphone
- User clicks "Stop Recording"
- Frontend stops MediaRecorder
- Convert audio stream to audio file (Blob)
- Validate audio file (size, format)
- If validation fails:
- Show error: "Invalid audio format or file too large"
- If validation passes:
- Show loading state
- Create FormData with audio file and language
- Submit to
POST /api/voice-google/speech-to-text - Handle response:
- Success: Display transcribed text, confidence, language, metadata
- Error: Show error message
Form Validation:
- Language must be selected
- Audio file must be valid format (WebM, WAV, MP3, etc.)
- Audio file size must be within limits
Error Handling:
- 401 Unauthorized → Show authentication error, redirect to login
- 400 Bad Request → Show error details from response
- 500 Internal Server Error → Show generic error message
Voice Interpretation Mode (HTTP)
Recording and Interpretation Workflow:
- User switches to Voice Interpretation mode (if not already selected)
- User selects source language (default from
sttLanguagesetting) - User selects target language (default from
targetLanguagesetting) - User clicks "Start Recording"
- Frontend requests microphone access
- If permission granted:
- Start recording
- Show recording indicator
- User speaks into microphone
- User clicks "Stop Recording"
- Frontend stops recording and converts to audio file
- Validate audio file
- If validation passes:
- Show loading state
- Create FormData with audioFile, fromLanguage, toLanguage
- Submit to
POST /api/voice-google/realtime-interpreter - Handle response:
- Success: Display original text and translated text
- Error: Show error message
Form Validation:
- Source and target languages must be selected
- Audio file must be valid
Error Handling:
- Same as Speech-to-Text errors
Speech-to-Text Mode (WebSocket Real-time)
Connection Workflow:
- User switches to Speech-to-Text mode
- User toggles to WebSocket real-time mode
- User selects language (default from settings)
- User clicks "Connect" button or switches to WebSocket mode
- Frontend establishes WebSocket connection to
/api/voice-google/ws/speech-to-text?userId={userId}&language={language} - Backend sends connection confirmation:
{type: "connected", connection_id, message} - Frontend shows "Connected" status
- User clicks "Start Recording"
- Frontend requests microphone access
- If permission granted:
- Start MediaRecorder
- Start capturing audio chunks
- Encode chunks to base64
- Send chunks via WebSocket:
{type: "audio_chunk", data: base64_audio, timestamp}
- Backend processes chunks and sends results:
{type: "transcription_result", text, confidence, is_final}
- Frontend displays results:
- Interim results (is_final: false) → Grayed out, updating
- Final results (is_final: true) → Confirmed, stable
- User clicks "Stop Recording" or "Disconnect"
- Frontend stops recording and closes WebSocket
Keep-Alive:
- Frontend sends ping messages periodically:
{type: "ping", timestamp} - Backend responds with:
{type: "pong", timestamp}
Error Handling:
- WebSocket connection errors → Show error message, allow retry
- Processing errors → Show error message from backend:
{type: "error", error: "..."}
Settings Section
Viewing Settings:
- User navigates to Settings section
- Frontend calls
GET /api/voice-google/settings - Display current settings in form
- If no user settings, show default settings
Updating Settings:
- User modifies settings in form:
sttLanguage- RequiredttsLanguage- RequiredttsVoice- RequiredtranslationEnabled- Optional (default: true)targetLanguage- Optional (default: "en-US")
- User clicks "Save Settings" button
- Frontend validates required fields
- If validation fails:
- Show error: "Please fill in all required fields"
- If validation passes:
- Show loading state
- Build settings object
- Submit to
POST /api/voice-google/settingswith settings object - Handle response:
- Success: Show success message, update UI, update defaults for other sections
- Error: Show error message
Voice Selection:
- When user changes
ttsLanguage, frontend should:- Call
GET /api/voice-google/voices?language_code={newLanguage} - Update voice dropdown with filtered voices
- If current voice is not available in new language, select first available voice or clear selection
- Call
Form Validation:
sttLanguageis requiredttsLanguageis requiredttsVoiceis requiredtargetLanguageis required iftranslationEnabledis true
Error Handling:
- 401 Unauthorized → Show authentication error
- 400 Bad Request → Show error details (missing required fields)
- 500 Internal Server Error → Show generic error message
Voice Selection:
- When user changes
ttsLanguage, frontend should:- Call
GET /api/voice-google/voices?language_code={newLanguage} - Update voice dropdown with filtered voices
- If current voice is not available in new language, select first available voice or clear selection
- Call
Form Validation:
sttLanguageis requiredttsLanguageis requiredttsVoiceis requiredtargetLanguageis required iftranslationEnabledis true
Error Handling:
- 401 Unauthorized → Show authentication error
- 400 Bad Request → Show error details (missing required fields)
- 500 Internal Server Error → Show generic error message
Discovering Available Languages and Voices
This functionality is available from multiple locations (Settings section, language selectors, voice selectors) and opens a modal/sidebar browser.
Browsing Languages:
- User clicks "Load Available Languages" button (in Settings or language selector)
- Frontend calls
GET /api/voice-google/languages - Display languages in modal/sidebar
- Show language code and display name
- User can search/filter languages
- User clicks "Select" to use language
- Update language selector with selected language
Browsing Voices:
- User clicks "Load Available Voices" button (in Settings or voice selector)
- User optionally selects language filter
- Frontend calls
GET /api/voice-google/voices?language_code={selectedLanguage}(if filter selected)- Or
GET /api/voice-google/voices(if no filter)
- Or
- Display voices in modal/sidebar
- Show voice name, language, gender/type
- Group voices by language if no filter
- User can search/filter voices
- User clicks "Select" to use voice
- Update voice selector with selected voice
Backend Routes and API Integration
Complete Route Reference
All backend routes used by voice service page:
| Route | Method | Purpose | When Used | Access Control |
|---|---|---|---|---|
/api/voice-google/speech-to-text |
POST | Convert speech to text | User stops recording | Current user only |
/api/voice-google/translate |
POST | Translate text | User clicks "Translate" | Current user only |
/api/voice-google/realtime-interpreter |
POST | Speech to translated text | User stops recording in interpreter mode | Current user only |
/api/voice-google/text-to-speech |
POST | Convert text to speech | User clicks "Generate Speech" | Current user only |
/api/voice-google/languages |
GET | Get available languages | User browses languages | Current user only |
/api/voice-google/voices |
GET | Get available voices | User browses voices or changes language | Current user only |
/api/voice-google/settings |
GET | Get voice settings | Page load, settings view | Current user only |
/api/voice-google/settings |
POST | Save voice settings | User saves settings | Current user only |
/api/voice-google/health |
GET | Health check | Optional: on page load | Current user only |
/api/voice-google/ws/speech-to-text |
WebSocket | Real-time speech-to-text | User connects WebSocket | Current user only |
/api/voice-google/ws/text-to-speech |
WebSocket | Real-time text-to-speech | User connects WebSocket | Current user only |
/api/voice-google/ws/realtime-interpreter |
WebSocket | Real-time interpretation | User connects WebSocket (future) | Current user only |
API Request Patterns
Speech-to-Text Request:
POST /api/voice-google/speech-to-text
Content-Type: multipart/form-data
Body: {
audioFile: <file>,
language: "de-DE"
}
audioFileis required (audio file from recording)languageis required (language code like "de-DE", "en-US")- Handle 400 (invalid format), 401 (unauthorized), 500 errors
Translation Request:
POST /api/voice-google/translate
Content-Type: multipart/form-data
Body: {
text: "Text to translate",
sourceLanguage: "de",
targetLanguage: "en"
}
textis required (non-empty string)sourceLanguageis required (language code like "de", "en")targetLanguageis required (language code like "de", "en")- Handle 400 (empty text, invalid languages), 401, 500 errors
Real-time Interpreter Request:
POST /api/voice-google/realtime-interpreter
Content-Type: multipart/form-data
Body: {
audioFile: <file>,
fromLanguage: "de-DE",
toLanguage: "en-US",
connectionId: "optional-connection-id"
}
audioFileis requiredfromLanguageis required (language code with region)toLanguageis required (language code with region)connectionIdis optional- Handle same errors as speech-to-text
Text-to-Speech Request:
POST /api/voice-google/text-to-speech
Content-Type: multipart/form-data
Body: {
text: "Text to speak",
language: "de-DE",
voice: "de-DE-Wavenet-A"
}
textis required (non-empty string)languageis required (language code with region)voiceis optional (voice name, defaults to system default)- Response is audio file (audio/mpeg) with headers:
X-Voice-Name: Voice name usedX-Language-Code: Language code used
- Handle 400 (empty text), 401, 500 errors
Get Languages Request:
GET /api/voice-google/languages
- Returns:
{success: true, languages: [...]} - Languages array contains language objects with code and name
- Handle 400, 401, 500 errors
Get Voices Request:
GET /api/voice-google/voices?language_code=de-DE
language_codequery parameter is optional- If provided, filters voices by language
- Returns:
{success: true, voices: [...], language_filter: "de-DE"} - Voices array contains voice objects with name, language, gender, etc.
- Handle 400, 401, 500 errors
Get Settings Request:
GET /api/voice-google/settings
- Returns:
{success: true, data: {user_settings, default_settings}} user_settingsmay be null if no settings existdefault_settingsalways present- Handle 401, 500 errors
Save Settings Request:
POST /api/voice-google/settings
Content-Type: application/json
Body: {
"sttLanguage": "de-DE",
"ttsLanguage": "de-DE",
"ttsVoice": "de-DE-Wavenet-A",
"translationEnabled": true,
"targetLanguage": "en-US"
}
- Required fields:
sttLanguage,ttsLanguage,ttsVoice - Optional fields:
translationEnabled(default: true),targetLanguage(default: "en-US") - Returns:
{success: true, message: "...", data: settings} - Handle 400 (missing required fields), 401, 500 errors
WebSocket Connection:
WebSocket: /api/voice-google/ws/speech-to-text?userId={userId}&language={language}
WebSocket: /api/voice-google/ws/text-to-speech?userId={userId}&language={language}&voice={voice}
- Query parameters:
userId,language,voice(for TTS) - Backend sends connection confirmation on connect
- Client sends messages:
{type: "audio_chunk", data: base64, timestamp}or{type: "text_to_speak", text: "..."} - Backend sends messages:
{type: "transcription_result", text, confidence, is_final}or{type: "audio_data", audio: base64, format: "mp3"} - Handle connection errors, processing errors
Response Handling
Speech-to-Text Response:
{
"success": true,
"text": "Transcribed text here",
"confidence": 0.95,
"language": "de-DE",
"audio_info": {
"size": 12345,
"format": "webm",
"estimated_duration": 3.5
}
}
Translation Response:
{
"success": true,
"original_text": "Original text",
"translated_text": "Translated text",
"source_language": "de",
"target_language": "en"
}
Real-time Interpreter Response:
{
"success": true,
"original_text": "Original transcribed text",
"translated_text": "Translated text",
"confidence": 0.95,
"source_language": "de-DE",
"target_language": "en-US",
"audio_info": {
"size": 12345,
"format": "webm",
"estimated_duration": 3.5
}
}
Text-to-Speech Response:
- Binary audio file (audio/mpeg)
- Headers:
Content-Type: audio/mpegContent-Disposition: attachment; filename=speech.mp3X-Voice-Name: de-DE-Wavenet-AX-Language-Code: de-DE
Languages Response:
{
"success": true,
"languages": [
{
"code": "de-DE",
"name": "German (Germany)"
},
...
]
}
Voices Response:
{
"success": true,
"voices": [
{
"name": "de-DE-Wavenet-A",
"language": "de-DE",
"gender": "FEMALE",
"ssml_gender": "FEMALE"
},
...
],
"language_filter": "de-DE"
}
Settings Response:
{
"success": true,
"data": {
"user_settings": {
"sttLanguage": "de-DE",
"ttsLanguage": "de-DE",
"ttsVoice": "de-DE-Wavenet-A",
"translationEnabled": true,
"targetLanguage": "en-US"
},
"default_settings": {
"sttLanguage": "de-DE",
"ttsLanguage": "de-DE",
"ttsVoice": "de-DE-Wavenet-A",
"translationEnabled": true,
"targetLanguage": "en-US"
}
}
}
Error Responses:
- 400 Bad Request → Display validation errors from response
detailfield - 401 Unauthorized → Show authentication error, redirect to login
- 500 Internal Server Error → Show generic error message
Field and Attribute Reference
Voice Settings Fields
The following fields are used for voice settings. These are not provided by a backend attributes endpoint but are defined by the API contract:
Required Fields:
sttLanguage- Speech-to-Text language (text/select, editable, required, visible)- Format: Language code with region (e.g., "de-DE", "en-US")
- Default: "de-DE"
ttsLanguage- Text-to-Speech language (text/select, editable, required, visible)- Format: Language code with region (e.g., "de-DE", "en-US")
- Default: "de-DE"
ttsVoice- Text-to-Speech voice name (text/select, editable, required, visible)- Format: Voice identifier (e.g., "de-DE-Wavenet-A")
- Default: "de-DE-Wavenet-A"
- Options populated from
/api/voice-google/voices?language_code={ttsLanguage}
Optional Fields:
translationEnabled- Enable translation features (checkbox, editable, not required, visible)- Type: boolean
- Default: true
targetLanguage- Target language for translation (text/select, editable, not required, visible)- Format: Language code with region (e.g., "en-US", "fr-FR")
- Default: "en-US"
- Shown when
translationEnabledis true
Language and Voice Data Structures
Language Object:
code- Language code (e.g., "de-DE", "en-US")name- Display name (e.g., "German (Germany)", "English (United States)")
Voice Object:
name- Voice identifier (e.g., "de-DE-Wavenet-A")language- Language code (e.g., "de-DE")gender- Voice gender (e.g., "FEMALE", "MALE", "NEUTRAL")ssml_gender- SSML gender identifier- Additional metadata may be available
Dynamic Rendering Guidelines
Settings-Driven Defaults
All voice operations should use user settings as defaults:
-
On Page Load:
- Call
GET /api/voice-google/settings - Store settings in component state
- Pre-populate all selectors with settings values
- Use
default_settingsifuser_settingsis null
- Call
-
Language Selectors:
- Speech-to-Text: Default to
sttLanguagefrom settings - Text-to-Speech: Default to
ttsLanguagefrom settings - Translation Source: Default to language derived from
sttLanguage(remove region code) - Translation Target: Default to
targetLanguagefrom settings
- Speech-to-Text: Default to
-
Voice Selector:
- Default to
ttsVoicefrom settings - When language changes, fetch voices for new language
- If default voice not available in new language, select first available
- Default to
-
After Settings Update:
- Refresh settings from backend
- Update all selectors with new defaults
- Show success message
Audio Recording Implementation
MediaRecorder API:
- Request microphone access:
navigator.mediaDevices.getUserMedia({ audio: true }) - Create MediaRecorder with appropriate MIME type:
- Prefer WebM:
new MediaRecorder(stream, { mimeType: 'audio/webm' }) - Fallback to browser default
- Prefer WebM:
- Start recording:
mediaRecorder.start() - Capture data chunks:
mediaRecorder.ondataavailable - Stop recording:
mediaRecorder.stop() - Convert chunks to Blob:
new Blob(chunks, { type: 'audio/webm' }) - Create File object for upload:
new File([blob], 'recording.webm', { type: 'audio/webm' })
Recording Indicators:
- Show visual indicator (red dot, pulsing animation)
- Display timer (MM:SS format)
- Show audio level/waveform (optional, using AudioContext API)
Error Handling:
- Microphone permission denied → Show clear error message with instructions
- No microphone available → Disable recording, show message
- Recording errors → Show error, allow retry
Language and Voice Selection
Language Dropdowns:
- Optionally fetch languages from
/api/voice-google/languageson page load - Cache languages in component state
- Display languages with code and name
- Allow search/filter
- Use settings default as initial selection
Voice Dropdown:
- When TTS language changes:
- Call
GET /api/voice-google/voices?language_code={selectedLanguage} - Update dropdown with filtered voices
- Select default voice if available, otherwise first voice
- Call
- Display voice name and gender/type
- Allow search/filter
- Show loading state while fetching voices
WebSocket Implementation
Connection Management:
- Create WebSocket connection with query parameters
- Handle connection events:
onopen→ Show connected statusonmessage→ Process incoming messagesonerror→ Show error, allow retryonclose→ Show disconnected status, cleanup
- Send keep-alive pings periodically (every 30 seconds)
- Handle reconnection on disconnect
Message Handling:
- Parse incoming JSON messages
- Handle message types:
connected→ Update connection statustranscription_result→ Update transcription display (interim vs final)audio_data→ Decode base64, create blob, play audioerror→ Show error messagepong→ Update last ping time
- Send messages:
audio_chunk→ Encode audio to base64, include timestamptext_to_speak→ Include textping→ Include timestamp
Audio Chunking (Speech-to-Text):
- Capture audio chunks from MediaRecorder
- Encode chunks to base64
- Send chunks via WebSocket as they're captured
- Buffer chunks if needed for reliable transmission
Audio Playback (Text-to-Speech):
- Receive base64 audio chunks
- Decode base64 to binary
- Create audio blob
- Queue for sequential playback or play immediately
- Use HTML5 Audio API for playback
Form Validation
Voice Recording Section:
- Speech-to-Text Mode: Language must be selected, audio file must exist and be valid format, audio file size must be within limits
- Voice Interpretation Mode: Source and target languages must be selected, audio file must be valid
Text Processing Section:
- Translation Mode: Text must not be empty, source and target languages must be selected, source and target languages should be different (optional validation)
- Text-to-Speech Mode: Text must not be empty, language must be selected, voice must be selected
Settings Section:
sttLanguageis requiredttsLanguageis requiredttsVoiceis requiredtargetLanguageis required iftranslationEnabledis true
Error Display
Error Messages:
- Display errors prominently (red text, error icon)
- Show specific error details from backend when available
- Provide actionable guidance (e.g., "Please enable microphone permissions")
- Allow retry for transient errors
Loading States:
- Show loading spinner during API calls
- Disable form inputs during processing
- Show progress indicators for long operations
Audio Playback
HTML5 Audio Player:
- Create audio element:
<audio controls src={audioUrl} /> - Handle audio events:
onloadeddata→ Enable playback controlsonerror→ Show error messageonended→ Reset player state
- Provide download button: Create download link with audio blob
Audio Format Handling:
- Support MP3 (primary format from backend)
- Handle other formats if backend provides them
- Show format information to user
Key Principles
- Settings as Defaults: Always use user settings as defaults, but allow overrides
- Progressive Enhancement: HTTP mode works without WebSocket, WebSocket adds real-time features
- Error Recovery: Provide clear error messages and retry options
- User Feedback: Show loading states, success messages, and progress indicators
- Accessibility: Support keyboard navigation, screen readers, and proper ARIA labels
- Performance: Cache languages and voices, debounce API calls, optimize audio processing
- Browser Compatibility: Handle browser differences in MediaRecorder, WebSocket, and audio APIs
- Security: Validate all user inputs, handle authentication errors gracefully
- User-Scoped: All settings and operations are user-specific (backend enforces this)
Summary
This document provides complete frontend requirements for the voice service page. The page enables users to interact with Google Cloud voice services through a unified interface organized into three main sections.
Key Architecture Pattern: The voice service interface is a single page (/voice or /voice-service) organized into three main sections:
- Text Processing Section - Supports Translation and Text-to-Speech modes (both use text input)
- Voice Recording Section - Supports Speech-to-Text and Voice Interpretation modes (both use audio recording)
- Settings Section - Manages user voice settings
Each section supports mode toggles to switch between different operations, and optional WebSocket real-time mode toggles for live streaming. All features share common settings but allow per-operation overrides.
Settings-Driven: All operations use user settings as defaults, loaded from /api/voice-google/settings on page load. Users can override settings per operation.
Multi-Modal: Supports both HTTP REST endpoints for file-based processing and WebSocket endpoints for real-time streaming. WebSocket modes are integrated as optional toggles within the relevant sections.
Component-Based: The page is built from reusable UI components (Language Selector, Voice Selector, Recording Controls, Text Input, Result Display, etc.) that are combined in different ways to create each section and mode.
Security Note: All operations are user-scoped. Users can only access their own settings. The backend enforces this security.