gateway/docs/frontend-documentation/voice-service-page-requirements.md

40 KiB

Voice Service Page Requirements

This document contains the complete frontend requirements for the voice service page, enabling users to interact with Google Cloud voice services including Speech-to-Text, Translation, Text-to-Speech, and real-time voice interpretation. All UI components should use user settings as defaults but allow overrides.

Table of Contents

  1. Overview
  2. Page Structure and Layout
  3. User Interactions and Functionality
  4. Backend Routes and API Integration
  5. Field and Attribute Reference
  6. Dynamic Rendering Guidelines

Overview

The voice service page enables users to interact with Google Cloud voice services through a unified interface. The frontend consists of a single page (/voice or /voice-service) with different sections/tabs for each feature:

  • Speech-to-Text Section - Record speech and convert to text
  • Translation Section - Translate text between languages
  • Voice Interpretation Section - Record speech and get translated text
  • Text-to-Speech Section - Convert text to spoken audio
  • Settings Section - Configure default voice settings
  • Real-time WebSocket Features - Live streaming for speech-to-text and text-to-speech

All features use user settings as defaults (loaded from /api/voice-google/settings) but allow users to override settings per operation. The page supports both HTTP REST endpoints for file-based processing and WebSocket endpoints for real-time streaming.

Key Principles:

  • Settings-Driven: User preferences loaded and applied automatically
  • Multi-Modal: Supports both recording and real-time streaming
  • Language-Aware: All operations support multiple languages
  • User-Scoped: Settings are user-specific and stored per user

Page Structure and Layout

Voice Service Page (/voice or /voice-service)

The voice service page uses a tabbed or sectioned interface to organize different voice features. Each section is self-contained but shares common settings and reusable UI components.

Reusable UI Components

The page is composed of reusable components that can be combined in different ways to create each feature section:

Language Selector Component

A dropdown selector for choosing languages, used across multiple sections:

  • Properties:

    • Default value from user settings (configurable: sttLanguage, ttsLanguage, or custom)
    • Supports language codes with region (e.g., "de-DE", "en-US") or without region (e.g., "de", "en")
    • Options populated from /api/voice-google/languages (cached)
    • Shows language code and display name
    • Optional: Search/filter functionality
  • Usage:

    • Voice Recording Section (Speech-to-Text mode): Single language selector (default: sttLanguage)
    • Text Processing Section (Translation mode): Two language selectors (source and target, defaults vary)
    • Voice Recording Section (Voice Interpretation mode): Two language selectors (source and target with region codes)
    • Text Processing Section (Text-to-Speech mode): Single language selector (default: ttsLanguage)
    • Settings Section: Multiple language selectors for different purposes

Voice Selector Component

A dropdown selector for choosing TTS voices, filtered by selected language:

  • Properties:

    • Default value from user settings (ttsVoice)
    • Options populated from /api/voice-google/voices?language_code={selectedLanguage}
    • Automatically updates when language changes
    • Shows voice name, language, and gender/type information
    • Optional: Search/filter functionality
  • Usage:

    • Text Processing Section (Text-to-Speech mode): Single voice selector
    • Settings Section: Single voice selector (filtered by TTS language)

Recording Controls Component

Controls for starting and stopping audio recording:

  • Properties:

    • "Start Recording" button (shown when not recording)
    • "Stop Recording" button (shown when recording)
    • Recording indicator (red dot, timer showing duration)
    • Optional: Visual waveform or audio level indicator
    • Handles microphone permission requests
  • Usage:

    • Voice Recording Section (Speech-to-Text mode): Recording controls + language selector
    • Voice Recording Section (Voice Interpretation mode): Recording controls + source/target language selectors
    • Voice Recording Section (WebSocket mode): Recording controls + WebSocket connection controls

Text Input Component

Multi-line textarea for text input:

  • Properties:

    • Multi-line textarea
    • Optional: Character count
    • Optional: Placeholder text
    • Validation: Required or optional based on context
  • Usage:

    • Text Processing Section (Translation mode): Text input + language selectors
    • Text Processing Section (Text-to-Speech mode): Text input + language/voice selectors
    • Text Processing Section (WebSocket mode): Text input + WebSocket controls

Result Display Component

Generic component for displaying operation results:

  • Properties:

    • Text display area (read-only or editable)
    • Optional: Multiple text areas (e.g., original and translated)
    • Metadata display (confidence, language, audio info, etc.)
    • Action buttons: Copy, Clear, Download (context-dependent)
    • Optional: Additional actions (Swap, Regenerate, etc.)
  • Variants:

    • Text Result: Displays transcribed or translated text with metadata
    • Dual Text Result: Displays original and translated text side-by-side
    • Audio Result: Displays audio player with download option and metadata
  • Usage:

    • Voice Recording Section (Speech-to-Text mode): Text result with confidence, language, audio metadata
    • Text Processing Section (Translation mode): Dual text result with language information
    • Voice Recording Section (Voice Interpretation mode): Dual text result with confidence, languages, audio metadata
    • Text Processing Section (Text-to-Speech mode): Audio result with voice name, language, download option

Audio Player Component

HTML5 audio player for playback:

  • Properties:

    • HTML5 audio element with controls (play, pause, volume, progress)
    • Download button
    • Metadata display (voice name, language, format)
    • Optional: Regenerate button
  • Usage:

    • Text Processing Section (Text-to-Speech mode): Audio playback of generated speech
    • Text Processing Section (WebSocket mode): Sequential playback of audio chunks

WebSocket Connection Controls Component

Controls for managing WebSocket connections:

  • Properties:

    • "Connect" button (establishes WebSocket connection)
    • "Disconnect" button (closes WebSocket connection)
    • Connection status indicator (connected, disconnected, connecting, error)
    • Connection ID display (optional)
  • Usage:

    • Voice Recording Section (WebSocket mode): Connection controls + recording controls + live transcription
    • Text Processing Section (WebSocket mode): Connection controls + text input + audio playback

Live Transcription Display Component

Real-time transcription display for WebSocket mode:

  • Properties:

    • Updates in real-time as results arrive
    • Distinguishes between interim results (grayed out, updating) and final results (confirmed, stable)
    • Shows confidence scores (optional)
    • Scrolls to latest content automatically
  • Usage:

    • Voice Recording Section (WebSocket mode): Live transcription display

Status Indicators Component

Visual indicators for operation status:

  • Properties:

    • Loading state (spinner, progress bar)
    • Error messages (red text, error icon)
    • Success messages (green text, success icon)
    • Microphone permission status
    • Recording status (idle, recording, processing)
    • Connection status (for WebSocket mode)
  • Usage:

    • All sections: Loading, error, and success states
    • Recording sections: Microphone permission and recording status
    • WebSocket sections: Connection status

Settings Form Component

Form for managing user voice settings:

  • Properties:

    • Multiple language selectors (STT, TTS, target)
    • Voice selector (filtered by TTS language)
    • Translation enabled toggle/checkbox
    • Conditional fields (target language shown when translation enabled)
    • Action buttons: Save, Reset to Defaults, Browse Languages, Browse Voices
  • Usage:

    • Settings Section: Complete settings form

Language/Voice Browser Component

Modal or sidebar for browsing available languages and voices:

  • Properties:

    • List display with search/filter functionality
    • Language browser: Shows language code and display name
    • Voice browser: Shows voice name, language, gender/type, with optional language filter
    • Select button to use selected item
    • Can be opened from language/voice selectors or settings
  • Usage:

    • Language Browser: Opened from language selectors or settings
    • Voice Browser: Opened from voice selector or settings

Page Sections

The page is organized into three main sections, each supporting multiple operation modes:

Text Processing Section

Supports operations that use text input: Translation and Text-to-Speech.

Mode Toggle:

  • User can switch between "Translation" and "Text-to-Speech" modes
  • Mode selection determines which components and options are displayed

Translation Mode:

  • Text Input + Language Selectors (source/target) + Dual Text Result Display + Status Indicators
  • Action button: "Translate"

Text-to-Speech Mode:

  • Text Input + Language Selector + Voice Selector + Audio Result Display + Status Indicators
  • Action button: "Generate Speech"

WebSocket Real-time Mode (optional toggle):

  • For Text-to-Speech: WebSocket Connection Controls + Text Input + Audio Player + Status Indicators
  • User can toggle between HTTP and WebSocket modes

Voice Recording Section

Supports operations that use audio recording: Speech-to-Text and Voice Interpretation.

Mode Toggle:

  • User can switch between "Speech-to-Text" and "Voice Interpretation" modes
  • Mode selection determines which components and options are displayed

Speech-to-Text Mode:

  • Language Selector + Recording Controls + Text Result Display + Status Indicators

Voice Interpretation Mode:

  • Language Selectors (source/target) + Recording Controls + Dual Text Result Display + Status Indicators

WebSocket Real-time Mode (optional toggle):

  • For Speech-to-Text: WebSocket Connection Controls + Recording Controls + Live Transcription Display + Status Indicators
  • User can toggle between HTTP and WebSocket modes

Settings Section

Standalone section for managing user voice settings:

  • Settings Form + Status Indicators
  • Includes: Language selectors, voice selector, translation toggle, action buttons

User Interactions and Functionality

Loading User Settings

On Page Load:

  • Frontend calls GET /api/voice-google/settings
  • If user settings exist, use them as defaults
  • If no user settings, use default settings from response
  • Pre-populate all language and voice selectors with settings
  • Store settings in component state for quick access

Settings Structure:

{
  "success": true,
  "data": {
    "user_settings": {
      "sttLanguage": "de-DE",
      "ttsLanguage": "de-DE",
      "ttsVoice": "de-DE-Wavenet-A",
      "translationEnabled": true,
      "targetLanguage": "en-US"
    },
    "default_settings": {
      "sttLanguage": "de-DE",
      "ttsLanguage": "de-DE",
      "ttsVoice": "de-DE-Wavenet-A",
      "translationEnabled": true,
      "targetLanguage": "en-US"
    }
  }
}

Text Processing Section

The Text Processing Section supports two modes: Translation and Text-to-Speech. Users can switch between modes using a mode toggle. The section also supports an optional WebSocket real-time mode toggle.

Translation Mode (HTTP)

Workflow:

  1. User switches to Translation mode (if not already selected)
  2. User enters text in input field
  3. User selects source language (default from settings or "de")
  4. User selects target language (default from targetLanguage or "en")
  5. User clicks "Translate" button
  6. Frontend validates text is not empty
  7. If validation fails:
    • Show error: "Please enter text to translate"
  8. If validation passes:
    • Show loading state
    • Create FormData with text, sourceLanguage, targetLanguage
    • Submit to POST /api/voice-google/translate
    • Handle response:
      • Success: Display original and translated text side-by-side
      • Error: Show error message

Additional Features:

  • Swap languages button: Swaps source and target languages
  • Copy buttons: Copy original or translated text to clipboard
  • Clear button: Clear input and results

Form Validation:

  • Text input must not be empty
  • Source and target languages must be different (optional validation)

Error Handling:

  • 401 Unauthorized → Show authentication error
  • 400 Bad Request → Show error details (empty text, invalid languages, etc.)
  • 500 Internal Server Error → Show generic error message

Text-to-Speech Mode (HTTP)

Workflow:

  1. User switches to Text-to-Speech mode (if not already selected)
  2. User enters text in input field
  3. User selects language (default from ttsLanguage setting)
  4. User selects voice (default from ttsVoice setting)
    • Voice dropdown updates when language changes
    • Frontend calls GET /api/voice-google/voices?language_code={selectedLanguage}
    • Populate voice dropdown with available voices
  5. User clicks "Generate Speech" button
  6. Frontend validates text is not empty
  7. If validation fails:
    • Show error: "Please enter text to convert to speech"
  8. If validation passes:
    • Show loading state
    • Create FormData with text, language, voice
    • Submit to POST /api/voice-google/text-to-speech
    • Handle response:
      • Success: Create audio blob from response, display audio player, show download button
      • Error: Show error message

Audio Playback:

  • Create HTML5 audio element from response blob
  • Display audio controls (play, pause, volume, progress)
  • Extract voice name and language from response headers (X-Voice-Name, X-Language-Code)
  • Display voice information

Form Validation:

  • Text input must not be empty
  • Language must be selected
  • Voice must be selected

Error Handling:

  • 401 Unauthorized → Show authentication error
  • 400 Bad Request → Show error details
  • 500 Internal Server Error → Show generic error message

Text-to-Speech Mode (WebSocket Real-time)

Connection Workflow:

  1. User switches to Text-to-Speech mode
  2. User toggles to WebSocket real-time mode
  3. User selects language and voice (defaults from settings)
  4. User clicks "Connect" button
  5. Frontend establishes WebSocket connection to /api/voice-google/ws/text-to-speech?userId={userId}&language={language}&voice={voice}
  6. Backend sends connection confirmation
  7. Frontend shows "Connected" status
  8. User enters text and clicks "Speak" button
  9. Frontend sends text via WebSocket: {type: "text_to_speak", text: "..."}
  10. Backend processes text and sends audio: {type: "audio_data", audio: base64_audio, format: "mp3"}
  11. Frontend decodes base64 audio, creates audio blob, plays audio
  12. User can send multiple texts while connected
  13. User clicks "Disconnect" to close WebSocket

Audio Playback:

  • Queue audio chunks for sequential playback
  • Or play chunks immediately as received
  • Show playback progress

Error Handling:

  • WebSocket connection errors → Show error message, allow retry
  • Processing errors → Show error message from backend: {type: "error", error: "..."}

Voice Recording Section

The Voice Recording Section supports two modes: Speech-to-Text and Voice Interpretation. Users can switch between modes using a mode toggle. The section also supports an optional WebSocket real-time mode toggle.

Speech-to-Text Mode (HTTP)

Recording Workflow:

  1. User switches to Speech-to-Text mode (if not already selected)
  2. User selects language (default from sttLanguage setting)
  3. User clicks "Start Recording"
  4. Frontend requests microphone access via navigator.mediaDevices.getUserMedia()
  5. If permission denied:
    • Show error message: "Microphone access denied. Please enable microphone permissions in your browser settings."
    • Disable recording button
  6. If permission granted:
    • Start MediaRecorder API
    • Show recording indicator (red dot, start timer)
    • Capture audio stream
  7. User speaks into microphone
  8. User clicks "Stop Recording"
  9. Frontend stops MediaRecorder
  10. Convert audio stream to audio file (Blob)
  11. Validate audio file (size, format)
  12. If validation fails:
    • Show error: "Invalid audio format or file too large"
  13. If validation passes:
    • Show loading state
    • Create FormData with audio file and language
    • Submit to POST /api/voice-google/speech-to-text
    • Handle response:
      • Success: Display transcribed text, confidence, language, metadata
      • Error: Show error message

Form Validation:

  • Language must be selected
  • Audio file must be valid format (WebM, WAV, MP3, etc.)
  • Audio file size must be within limits

Error Handling:

  • 401 Unauthorized → Show authentication error, redirect to login
  • 400 Bad Request → Show error details from response
  • 500 Internal Server Error → Show generic error message

Voice Interpretation Mode (HTTP)

Recording and Interpretation Workflow:

  1. User switches to Voice Interpretation mode (if not already selected)
  2. User selects source language (default from sttLanguage setting)
  3. User selects target language (default from targetLanguage setting)
  4. User clicks "Start Recording"
  5. Frontend requests microphone access
  6. If permission granted:
    • Start recording
    • Show recording indicator
  7. User speaks into microphone
  8. User clicks "Stop Recording"
  9. Frontend stops recording and converts to audio file
  10. Validate audio file
  11. If validation passes:
    • Show loading state
    • Create FormData with audioFile, fromLanguage, toLanguage
    • Submit to POST /api/voice-google/realtime-interpreter
    • Handle response:
      • Success: Display original text and translated text
      • Error: Show error message

Form Validation:

  • Source and target languages must be selected
  • Audio file must be valid

Error Handling:

  • Same as Speech-to-Text errors

Speech-to-Text Mode (WebSocket Real-time)

Connection Workflow:

  1. User switches to Speech-to-Text mode
  2. User toggles to WebSocket real-time mode
  3. User selects language (default from settings)
  4. User clicks "Connect" button or switches to WebSocket mode
  5. Frontend establishes WebSocket connection to /api/voice-google/ws/speech-to-text?userId={userId}&language={language}
  6. Backend sends connection confirmation: {type: "connected", connection_id, message}
  7. Frontend shows "Connected" status
  8. User clicks "Start Recording"
  9. Frontend requests microphone access
  10. If permission granted:
    • Start MediaRecorder
    • Start capturing audio chunks
    • Encode chunks to base64
    • Send chunks via WebSocket: {type: "audio_chunk", data: base64_audio, timestamp}
  11. Backend processes chunks and sends results:
    • {type: "transcription_result", text, confidence, is_final}
  12. Frontend displays results:
    • Interim results (is_final: false) → Grayed out, updating
    • Final results (is_final: true) → Confirmed, stable
  13. User clicks "Stop Recording" or "Disconnect"
  14. Frontend stops recording and closes WebSocket

Keep-Alive:

  • Frontend sends ping messages periodically: {type: "ping", timestamp}
  • Backend responds with: {type: "pong", timestamp}

Error Handling:

  • WebSocket connection errors → Show error message, allow retry
  • Processing errors → Show error message from backend: {type: "error", error: "..."}

Settings Section

Viewing Settings:

  1. User navigates to Settings section
  2. Frontend calls GET /api/voice-google/settings
  3. Display current settings in form
  4. If no user settings, show default settings

Updating Settings:

  1. User modifies settings in form:
    • sttLanguage - Required
    • ttsLanguage - Required
    • ttsVoice - Required
    • translationEnabled - Optional (default: true)
    • targetLanguage - Optional (default: "en-US")
  2. User clicks "Save Settings" button
  3. Frontend validates required fields
  4. If validation fails:
    • Show error: "Please fill in all required fields"
  5. If validation passes:
    • Show loading state
    • Build settings object
    • Submit to POST /api/voice-google/settings with settings object
    • Handle response:
      • Success: Show success message, update UI, update defaults for other sections
      • Error: Show error message

Voice Selection:

  • When user changes ttsLanguage, frontend should:
    1. Call GET /api/voice-google/voices?language_code={newLanguage}
    2. Update voice dropdown with filtered voices
    3. If current voice is not available in new language, select first available voice or clear selection

Form Validation:

  • sttLanguage is required
  • ttsLanguage is required
  • ttsVoice is required
  • targetLanguage is required if translationEnabled is true

Error Handling:

  • 401 Unauthorized → Show authentication error
  • 400 Bad Request → Show error details (missing required fields)
  • 500 Internal Server Error → Show generic error message

Voice Selection:

  • When user changes ttsLanguage, frontend should:
    1. Call GET /api/voice-google/voices?language_code={newLanguage}
    2. Update voice dropdown with filtered voices
    3. If current voice is not available in new language, select first available voice or clear selection

Form Validation:

  • sttLanguage is required
  • ttsLanguage is required
  • ttsVoice is required
  • targetLanguage is required if translationEnabled is true

Error Handling:

  • 401 Unauthorized → Show authentication error
  • 400 Bad Request → Show error details (missing required fields)
  • 500 Internal Server Error → Show generic error message

Discovering Available Languages and Voices

This functionality is available from multiple locations (Settings section, language selectors, voice selectors) and opens a modal/sidebar browser.

Browsing Languages:

  1. User clicks "Load Available Languages" button (in Settings or language selector)
  2. Frontend calls GET /api/voice-google/languages
  3. Display languages in modal/sidebar
  4. Show language code and display name
  5. User can search/filter languages
  6. User clicks "Select" to use language
  7. Update language selector with selected language

Browsing Voices:

  1. User clicks "Load Available Voices" button (in Settings or voice selector)
  2. User optionally selects language filter
  3. Frontend calls GET /api/voice-google/voices?language_code={selectedLanguage} (if filter selected)
    • Or GET /api/voice-google/voices (if no filter)
  4. Display voices in modal/sidebar
  5. Show voice name, language, gender/type
  6. Group voices by language if no filter
  7. User can search/filter voices
  8. User clicks "Select" to use voice
  9. Update voice selector with selected voice

Backend Routes and API Integration

Complete Route Reference

All backend routes used by voice service page:

Route Method Purpose When Used Access Control
/api/voice-google/speech-to-text POST Convert speech to text User stops recording Current user only
/api/voice-google/translate POST Translate text User clicks "Translate" Current user only
/api/voice-google/realtime-interpreter POST Speech to translated text User stops recording in interpreter mode Current user only
/api/voice-google/text-to-speech POST Convert text to speech User clicks "Generate Speech" Current user only
/api/voice-google/languages GET Get available languages User browses languages Current user only
/api/voice-google/voices GET Get available voices User browses voices or changes language Current user only
/api/voice-google/settings GET Get voice settings Page load, settings view Current user only
/api/voice-google/settings POST Save voice settings User saves settings Current user only
/api/voice-google/health GET Health check Optional: on page load Current user only
/api/voice-google/ws/speech-to-text WebSocket Real-time speech-to-text User connects WebSocket Current user only
/api/voice-google/ws/text-to-speech WebSocket Real-time text-to-speech User connects WebSocket Current user only
/api/voice-google/ws/realtime-interpreter WebSocket Real-time interpretation User connects WebSocket (future) Current user only

API Request Patterns

Speech-to-Text Request:

POST /api/voice-google/speech-to-text
Content-Type: multipart/form-data
Body: {
  audioFile: <file>,
  language: "de-DE"
}
  • audioFile is required (audio file from recording)
  • language is required (language code like "de-DE", "en-US")
  • Handle 400 (invalid format), 401 (unauthorized), 500 errors

Translation Request:

POST /api/voice-google/translate
Content-Type: multipart/form-data
Body: {
  text: "Text to translate",
  sourceLanguage: "de",
  targetLanguage: "en"
}
  • text is required (non-empty string)
  • sourceLanguage is required (language code like "de", "en")
  • targetLanguage is required (language code like "de", "en")
  • Handle 400 (empty text, invalid languages), 401, 500 errors

Real-time Interpreter Request:

POST /api/voice-google/realtime-interpreter
Content-Type: multipart/form-data
Body: {
  audioFile: <file>,
  fromLanguage: "de-DE",
  toLanguage: "en-US",
  connectionId: "optional-connection-id"
}
  • audioFile is required
  • fromLanguage is required (language code with region)
  • toLanguage is required (language code with region)
  • connectionId is optional
  • Handle same errors as speech-to-text

Text-to-Speech Request:

POST /api/voice-google/text-to-speech
Content-Type: multipart/form-data
Body: {
  text: "Text to speak",
  language: "de-DE",
  voice: "de-DE-Wavenet-A"
}
  • text is required (non-empty string)
  • language is required (language code with region)
  • voice is optional (voice name, defaults to system default)
  • Response is audio file (audio/mpeg) with headers:
    • X-Voice-Name: Voice name used
    • X-Language-Code: Language code used
  • Handle 400 (empty text), 401, 500 errors

Get Languages Request:

GET /api/voice-google/languages
  • Returns: {success: true, languages: [...]}
  • Languages array contains language objects with code and name
  • Handle 400, 401, 500 errors

Get Voices Request:

GET /api/voice-google/voices?language_code=de-DE
  • language_code query parameter is optional
  • If provided, filters voices by language
  • Returns: {success: true, voices: [...], language_filter: "de-DE"}
  • Voices array contains voice objects with name, language, gender, etc.
  • Handle 400, 401, 500 errors

Get Settings Request:

GET /api/voice-google/settings
  • Returns: {success: true, data: {user_settings, default_settings}}
  • user_settings may be null if no settings exist
  • default_settings always present
  • Handle 401, 500 errors

Save Settings Request:

POST /api/voice-google/settings
Content-Type: application/json
Body: {
  "sttLanguage": "de-DE",
  "ttsLanguage": "de-DE",
  "ttsVoice": "de-DE-Wavenet-A",
  "translationEnabled": true,
  "targetLanguage": "en-US"
}
  • Required fields: sttLanguage, ttsLanguage, ttsVoice
  • Optional fields: translationEnabled (default: true), targetLanguage (default: "en-US")
  • Returns: {success: true, message: "...", data: settings}
  • Handle 400 (missing required fields), 401, 500 errors

WebSocket Connection:

WebSocket: /api/voice-google/ws/speech-to-text?userId={userId}&language={language}
WebSocket: /api/voice-google/ws/text-to-speech?userId={userId}&language={language}&voice={voice}
  • Query parameters: userId, language, voice (for TTS)
  • Backend sends connection confirmation on connect
  • Client sends messages: {type: "audio_chunk", data: base64, timestamp} or {type: "text_to_speak", text: "..."}
  • Backend sends messages: {type: "transcription_result", text, confidence, is_final} or {type: "audio_data", audio: base64, format: "mp3"}
  • Handle connection errors, processing errors

Response Handling

Speech-to-Text Response:

{
  "success": true,
  "text": "Transcribed text here",
  "confidence": 0.95,
  "language": "de-DE",
  "audio_info": {
    "size": 12345,
    "format": "webm",
    "estimated_duration": 3.5
  }
}

Translation Response:

{
  "success": true,
  "original_text": "Original text",
  "translated_text": "Translated text",
  "source_language": "de",
  "target_language": "en"
}

Real-time Interpreter Response:

{
  "success": true,
  "original_text": "Original transcribed text",
  "translated_text": "Translated text",
  "confidence": 0.95,
  "source_language": "de-DE",
  "target_language": "en-US",
  "audio_info": {
    "size": 12345,
    "format": "webm",
    "estimated_duration": 3.5
  }
}

Text-to-Speech Response:

  • Binary audio file (audio/mpeg)
  • Headers:
    • Content-Type: audio/mpeg
    • Content-Disposition: attachment; filename=speech.mp3
    • X-Voice-Name: de-DE-Wavenet-A
    • X-Language-Code: de-DE

Languages Response:

{
  "success": true,
  "languages": [
    {
      "code": "de-DE",
      "name": "German (Germany)"
    },
    ...
  ]
}

Voices Response:

{
  "success": true,
  "voices": [
    {
      "name": "de-DE-Wavenet-A",
      "language": "de-DE",
      "gender": "FEMALE",
      "ssml_gender": "FEMALE"
    },
    ...
  ],
  "language_filter": "de-DE"
}

Settings Response:

{
  "success": true,
  "data": {
    "user_settings": {
      "sttLanguage": "de-DE",
      "ttsLanguage": "de-DE",
      "ttsVoice": "de-DE-Wavenet-A",
      "translationEnabled": true,
      "targetLanguage": "en-US"
    },
    "default_settings": {
      "sttLanguage": "de-DE",
      "ttsLanguage": "de-DE",
      "ttsVoice": "de-DE-Wavenet-A",
      "translationEnabled": true,
      "targetLanguage": "en-US"
    }
  }
}

Error Responses:

  • 400 Bad Request → Display validation errors from response detail field
  • 401 Unauthorized → Show authentication error, redirect to login
  • 500 Internal Server Error → Show generic error message

Field and Attribute Reference

Voice Settings Fields

The following fields are used for voice settings. These are not provided by a backend attributes endpoint but are defined by the API contract:

Required Fields:

  • sttLanguage - Speech-to-Text language (text/select, editable, required, visible)
    • Format: Language code with region (e.g., "de-DE", "en-US")
    • Default: "de-DE"
  • ttsLanguage - Text-to-Speech language (text/select, editable, required, visible)
    • Format: Language code with region (e.g., "de-DE", "en-US")
    • Default: "de-DE"
  • ttsVoice - Text-to-Speech voice name (text/select, editable, required, visible)
    • Format: Voice identifier (e.g., "de-DE-Wavenet-A")
    • Default: "de-DE-Wavenet-A"
    • Options populated from /api/voice-google/voices?language_code={ttsLanguage}

Optional Fields:

  • translationEnabled - Enable translation features (checkbox, editable, not required, visible)
    • Type: boolean
    • Default: true
  • targetLanguage - Target language for translation (text/select, editable, not required, visible)
    • Format: Language code with region (e.g., "en-US", "fr-FR")
    • Default: "en-US"
    • Shown when translationEnabled is true

Language and Voice Data Structures

Language Object:

  • code - Language code (e.g., "de-DE", "en-US")
  • name - Display name (e.g., "German (Germany)", "English (United States)")

Voice Object:

  • name - Voice identifier (e.g., "de-DE-Wavenet-A")
  • language - Language code (e.g., "de-DE")
  • gender - Voice gender (e.g., "FEMALE", "MALE", "NEUTRAL")
  • ssml_gender - SSML gender identifier
  • Additional metadata may be available

Dynamic Rendering Guidelines

Settings-Driven Defaults

All voice operations should use user settings as defaults:

  1. On Page Load:

    • Call GET /api/voice-google/settings
    • Store settings in component state
    • Pre-populate all selectors with settings values
    • Use default_settings if user_settings is null
  2. Language Selectors:

    • Speech-to-Text: Default to sttLanguage from settings
    • Text-to-Speech: Default to ttsLanguage from settings
    • Translation Source: Default to language derived from sttLanguage (remove region code)
    • Translation Target: Default to targetLanguage from settings
  3. Voice Selector:

    • Default to ttsVoice from settings
    • When language changes, fetch voices for new language
    • If default voice not available in new language, select first available
  4. After Settings Update:

    • Refresh settings from backend
    • Update all selectors with new defaults
    • Show success message

Audio Recording Implementation

MediaRecorder API:

  1. Request microphone access: navigator.mediaDevices.getUserMedia({ audio: true })
  2. Create MediaRecorder with appropriate MIME type:
    • Prefer WebM: new MediaRecorder(stream, { mimeType: 'audio/webm' })
    • Fallback to browser default
  3. Start recording: mediaRecorder.start()
  4. Capture data chunks: mediaRecorder.ondataavailable
  5. Stop recording: mediaRecorder.stop()
  6. Convert chunks to Blob: new Blob(chunks, { type: 'audio/webm' })
  7. Create File object for upload: new File([blob], 'recording.webm', { type: 'audio/webm' })

Recording Indicators:

  • Show visual indicator (red dot, pulsing animation)
  • Display timer (MM:SS format)
  • Show audio level/waveform (optional, using AudioContext API)

Error Handling:

  • Microphone permission denied → Show clear error message with instructions
  • No microphone available → Disable recording, show message
  • Recording errors → Show error, allow retry

Language and Voice Selection

Language Dropdowns:

  1. Optionally fetch languages from /api/voice-google/languages on page load
  2. Cache languages in component state
  3. Display languages with code and name
  4. Allow search/filter
  5. Use settings default as initial selection

Voice Dropdown:

  1. When TTS language changes:
    • Call GET /api/voice-google/voices?language_code={selectedLanguage}
    • Update dropdown with filtered voices
    • Select default voice if available, otherwise first voice
  2. Display voice name and gender/type
  3. Allow search/filter
  4. Show loading state while fetching voices

WebSocket Implementation

Connection Management:

  1. Create WebSocket connection with query parameters
  2. Handle connection events:
    • onopen → Show connected status
    • onmessage → Process incoming messages
    • onerror → Show error, allow retry
    • onclose → Show disconnected status, cleanup
  3. Send keep-alive pings periodically (every 30 seconds)
  4. Handle reconnection on disconnect

Message Handling:

  1. Parse incoming JSON messages
  2. Handle message types:
    • connected → Update connection status
    • transcription_result → Update transcription display (interim vs final)
    • audio_data → Decode base64, create blob, play audio
    • error → Show error message
    • pong → Update last ping time
  3. Send messages:
    • audio_chunk → Encode audio to base64, include timestamp
    • text_to_speak → Include text
    • ping → Include timestamp

Audio Chunking (Speech-to-Text):

  1. Capture audio chunks from MediaRecorder
  2. Encode chunks to base64
  3. Send chunks via WebSocket as they're captured
  4. Buffer chunks if needed for reliable transmission

Audio Playback (Text-to-Speech):

  1. Receive base64 audio chunks
  2. Decode base64 to binary
  3. Create audio blob
  4. Queue for sequential playback or play immediately
  5. Use HTML5 Audio API for playback

Form Validation

Voice Recording Section:

  • Speech-to-Text Mode: Language must be selected, audio file must exist and be valid format, audio file size must be within limits
  • Voice Interpretation Mode: Source and target languages must be selected, audio file must be valid

Text Processing Section:

  • Translation Mode: Text must not be empty, source and target languages must be selected, source and target languages should be different (optional validation)
  • Text-to-Speech Mode: Text must not be empty, language must be selected, voice must be selected

Settings Section:

  • sttLanguage is required
  • ttsLanguage is required
  • ttsVoice is required
  • targetLanguage is required if translationEnabled is true

Error Display

Error Messages:

  • Display errors prominently (red text, error icon)
  • Show specific error details from backend when available
  • Provide actionable guidance (e.g., "Please enable microphone permissions")
  • Allow retry for transient errors

Loading States:

  • Show loading spinner during API calls
  • Disable form inputs during processing
  • Show progress indicators for long operations

Audio Playback

HTML5 Audio Player:

  1. Create audio element: <audio controls src={audioUrl} />
  2. Handle audio events:
    • onloadeddata → Enable playback controls
    • onerror → Show error message
    • onended → Reset player state
  3. Provide download button: Create download link with audio blob

Audio Format Handling:

  • Support MP3 (primary format from backend)
  • Handle other formats if backend provides them
  • Show format information to user

Key Principles

  • Settings as Defaults: Always use user settings as defaults, but allow overrides
  • Progressive Enhancement: HTTP mode works without WebSocket, WebSocket adds real-time features
  • Error Recovery: Provide clear error messages and retry options
  • User Feedback: Show loading states, success messages, and progress indicators
  • Accessibility: Support keyboard navigation, screen readers, and proper ARIA labels
  • Performance: Cache languages and voices, debounce API calls, optimize audio processing
  • Browser Compatibility: Handle browser differences in MediaRecorder, WebSocket, and audio APIs
  • Security: Validate all user inputs, handle authentication errors gracefully
  • User-Scoped: All settings and operations are user-specific (backend enforces this)

Summary

This document provides complete frontend requirements for the voice service page. The page enables users to interact with Google Cloud voice services through a unified interface organized into three main sections.

Key Architecture Pattern: The voice service interface is a single page (/voice or /voice-service) organized into three main sections:

  • Text Processing Section - Supports Translation and Text-to-Speech modes (both use text input)
  • Voice Recording Section - Supports Speech-to-Text and Voice Interpretation modes (both use audio recording)
  • Settings Section - Manages user voice settings

Each section supports mode toggles to switch between different operations, and optional WebSocket real-time mode toggles for live streaming. All features share common settings but allow per-operation overrides.

Settings-Driven: All operations use user settings as defaults, loaded from /api/voice-google/settings on page load. Users can override settings per operation.

Multi-Modal: Supports both HTTP REST endpoints for file-based processing and WebSocket endpoints for real-time streaming. WebSocket modes are integrated as optional toggles within the relevant sections.

Component-Based: The page is built from reusable UI components (Language Selector, Voice Selector, Recording Controls, Text Input, Result Display, etc.) that are combined in different ways to create each section and mode.

Security Note: All operations are user-scoped. Users can only access their own settings. The backend enforces this security.