gateway/docs/frontend-documentation/voice-service-customer-journeys.md

18 KiB

Voice Service Customer Journeys

This document describes customer journeys for the Google Cloud Voice Services, focusing on how users interact with speech-to-text, translation, text-to-speech, and real-time voice interpretation features.

Table of Contents

  1. Overview
  2. Customer Journey 1: Converting Speech to Text
  3. Customer Journey 2: Translating Text
  4. Customer Journey 3: Real-time Voice Interpretation
  5. Customer Journey 4: Converting Text to Speech
  6. Customer Journey 5: Managing Voice Settings
  7. Customer Journey 6: Discovering Available Languages and Voices
  8. Customer Journey 7: Real-time Speech-to-Text via WebSocket
  9. Customer Journey 8: Real-time Text-to-Speech via WebSocket

Overview

The voice service routes (/api/voice-google) enable users to interact with Google Cloud voice services including Speech-to-Text, Translation, Text-to-Speech, and real-time voice interpretation. These routes support both HTTP REST endpoints for file-based processing and WebSocket endpoints for real-time streaming.

Key Principles:

  • User-Centric: Documentation organized around what users want to accomplish
  • Multi-Modal: Supports both file upload and real-time streaming
  • Language-Aware: All operations support multiple languages with user-configurable defaults
  • Settings-Driven: User preferences stored and applied automatically

Customer Journey 1: Converting Speech to Text

User Goal

"I want to record my speech and convert it into text."

User Story

As a user, I want to record my speech through the frontend microphone and get the transcribed text with confidence scores and language detection.

User Story Flow

sequenceDiagram
    participant User
    participant Frontend
    participant Backend
    
    User->>Frontend: Select language (default: de-DE)
    User->>Frontend: Click "Start Recording"
    Frontend->>Frontend: Request microphone access
    alt Microphone access denied
        Frontend-->>User: Show permission error
    else Microphone access granted
        Frontend->>Frontend: Start audio recording
        Frontend-->>User: Show recording indicator (red dot, timer)
        
        User->>Frontend: Speak into microphone
        Frontend->>Frontend: Capture audio stream
        
        User->>Frontend: Click "Stop Recording"
        Frontend->>Frontend: Stop audio recording
        Frontend->>Frontend: Convert audio stream to audio file
        Frontend->>Frontend: Validate audio file (size, format)
        
        alt File validation fails
            Frontend-->>User: Show validation error
        else File validation passes
            Frontend->>Frontend: Show loading state
            Frontend->>Backend: POST /api/voice-google/speech-to-text<br/>(audioFile, language)
            
            alt Authentication fails (401)
                Backend-->>Frontend: 401 Unauthorized
                Frontend-->>User: Show authentication error
            else Invalid audio format (400)
                Backend-->>Frontend: 400 Bad Request + error details
                Frontend-->>User: Show format error message
            else Speech recognition fails (400)
                Backend-->>Frontend: 400 Bad Request + error details
                Frontend-->>User: Show recognition error
            else Success (200)
                Backend-->>Frontend: {success: true, text, confidence, language, audio_info}
                Frontend->>Frontend: Display transcribed text
                Frontend->>Frontend: Display confidence score
                Frontend->>Frontend: Display detected language
                Frontend->>Frontend: Display audio metadata
                Frontend-->>User: Show transcription result with metadata
            end
        end
    end

Customer Journey 2: Translating Text

User Goal

"I want to translate text from one language to another."

User Story

As a user, I want to enter text and translate it between languages, seeing both the original and translated text.

User Story Flow

sequenceDiagram
    participant User
    participant Frontend
    participant Backend
    
    User->>Frontend: Enter text to translate
    User->>Frontend: Select source language (default: de)
    User->>Frontend: Select target language (default: en)
    User->>Frontend: Click "Translate"
    
    Frontend->>Frontend: Validate text is not empty
    alt Text is empty
        Frontend-->>User: Show validation error
    else Text validation passes
        Frontend->>Frontend: Show loading state
        Frontend->>Backend: POST /api/voice-google/translate<br/>(text, sourceLanguage, targetLanguage)
        
        alt Authentication fails (401)
            Backend-->>Frontend: 401 Unauthorized
            Frontend-->>User: Show authentication error
        else Empty text (400)
            Backend-->>Frontend: 400 Bad Request
            Frontend-->>User: Show empty text error
        else Translation fails (400)
            Backend-->>Frontend: 400 Bad Request + error details
            Frontend-->>User: Show translation error
        else Success (200)
            Backend-->>Frontend: {success: true, original_text, translated_text, source_language, target_language}
            Frontend->>Frontend: Display original text
            Frontend->>Frontend: Display translated text
            Frontend->>Frontend: Display language information
            Frontend-->>User: Show translation result
        end
    end

Customer Journey 3: Real-time Voice Interpretation

User Goal

"I want to speak in one language and get the translated text in another language."

User Story

As a user, I want to record my speech through the frontend microphone and receive both the transcribed text in the source language and its translation in the target language.

User Story Flow

sequenceDiagram
    participant User
    participant Frontend
    participant Backend
    
    User->>Frontend: Select source language (default: de-DE)
    User->>Frontend: Select target language (default: en-US)
    User->>Frontend: Click "Start Recording"
    Frontend->>Frontend: Request microphone access
    alt Microphone access denied
        Frontend-->>User: Show permission error
    else Microphone access granted
        Frontend->>Frontend: Start audio recording
        Frontend-->>User: Show recording indicator (red dot, timer)
        
        User->>Frontend: Speak into microphone
        Frontend->>Frontend: Capture audio stream
        
        User->>Frontend: Click "Stop Recording"
        Frontend->>Frontend: Stop audio recording
        Frontend->>Frontend: Convert audio stream to audio file
        Frontend->>Frontend: Validate audio file
        
        alt File validation fails
            Frontend-->>User: Show validation error
        else File validation passes
            Frontend->>Frontend: Show loading state
            Frontend->>Backend: POST /api/voice-google/realtime-interpreter<br/>(audioFile, fromLanguage, toLanguage, connectionId?)
            
            alt Authentication fails (401)
                Backend-->>Frontend: 401 Unauthorized
                Frontend-->>User: Show authentication error
            else Invalid audio format (400)
                Backend-->>Frontend: 400 Bad Request + error details
                Frontend-->>User: Show format error
            else Interpretation fails (400)
                Backend-->>Frontend: 400 Bad Request + error details
                Frontend-->>User: Show interpretation error
            else Success (200)
                Backend-->>Frontend: {success: true, original_text, translated_text, confidence, source_language, target_language, audio_info}
                Frontend->>Frontend: Display original transcribed text
                Frontend->>Frontend: Display translated text
                Frontend->>Frontend: Display confidence score
                Frontend->>Frontend: Display language information
                Frontend->>Frontend: Display audio metadata
                Frontend-->>User: Show interpretation result with both texts
            end
        end
    end

Customer Journey 4: Converting Text to Speech

User Goal

"I want to convert written text into spoken audio."

User Story

As a user, I want to enter text, select a language and voice, and receive an audio file of the spoken text.

User Story Flow

sequenceDiagram
    participant User
    participant Frontend
    participant Backend
    
    User->>Frontend: Enter text to speak
    User->>Frontend: Select language (default: de-DE)
    User->>Frontend: Select voice (optional, default from settings)
    User->>Frontend: Click "Generate Speech"
    
    Frontend->>Frontend: Validate text is not empty
    alt Text is empty
        Frontend-->>User: Show validation error
    else Text validation passes
        Frontend->>Frontend: Show loading state
        Frontend->>Backend: POST /api/voice-google/text-to-speech<br/>(text, language, voice?)
        
        alt Authentication fails (401)
            Backend-->>Frontend: 401 Unauthorized
            Frontend-->>User: Show authentication error
        else Empty text (400)
            Backend-->>Frontend: 400 Bad Request
            Frontend-->>User: Show empty text error
        else Text-to-Speech fails (400)
            Backend-->>Frontend: 400 Bad Request + error details
            Frontend-->>User: Show TTS error
        else Success (200)
            Backend-->>Frontend: Audio file (audio/mpeg)<br/>Headers: X-Voice-Name, X-Language-Code
            Frontend->>Frontend: Create audio blob from response
            Frontend->>Frontend: Create download link or audio player
            Frontend->>Frontend: Display voice name and language
            Frontend-->>User: Show audio player with download option
        end
    end

Customer Journey 5: Managing Voice Settings

User Goal

"I want to configure my default voice settings for speech recognition, translation, and text-to-speech."

User Story

As a user, I want to view and update my voice settings including default languages, voice preferences, and translation settings.

User Story Flow

sequenceDiagram
    participant User
    participant Frontend
    participant Backend
    
    User->>Frontend: Navigate to voice settings
    Frontend->>Backend: GET /api/voice-google/settings
    Backend-->>Frontend: {success: true, data: {user_settings, default_settings}}
    Frontend->>Frontend: Pre-populate form with user_settings or default_settings
    Frontend-->>User: Display settings form
    
    alt User views settings only
        User->>Frontend: View current settings
        Frontend-->>User: Display settings (read-only or editable)
    else User updates settings
        User->>Frontend: Modify settings (sttLanguage, ttsLanguage, ttsVoice, translationEnabled, targetLanguage)
        User->>Frontend: Click "Save Settings"
        
        Frontend->>Frontend: Validate required fields (sttLanguage, ttsLanguage, ttsVoice)
        alt Validation fails
            Frontend-->>User: Show validation errors
        else Validation passes
            Frontend->>Frontend: Show loading state
            Frontend->>Backend: POST /api/voice-google/settings<br/>(settings object)
            
            alt Authentication fails (401)
                Backend-->>Frontend: 401 Unauthorized
                Frontend-->>User: Show authentication error
            else Missing required field (400)
                Backend-->>Frontend: 400 Bad Request + field name
                Frontend-->>User: Show missing field error
            else Save fails (500)
                Backend-->>Frontend: 500 Internal Server Error
                Frontend-->>User: Show save error
            else Success (200)
                Backend-->>Frontend: {success: true, message, data: settings}
                Frontend->>Frontend: Update UI with saved settings
                Frontend->>Frontend: Show success message
                Frontend-->>User: Display confirmation and updated settings
            end
        end
    end

Customer Journey 6: Discovering Available Languages and Voices

User Goal

"I want to see what languages and voices are available for speech and translation services."

User Story

As a user, I want to browse available languages and filter voices by language to configure my preferences.

User Story Flow

sequenceDiagram
    participant User
    participant Frontend
    participant Backend
    
    alt User wants to see available languages
        User->>Frontend: Navigate to language selection
        Frontend->>Backend: GET /api/voice-google/languages
        Backend-->>Frontend: {success: true, languages: [...]}
        Frontend->>Frontend: Display language list
        Frontend-->>User: Show available languages
    end
    
    alt User wants to see available voices
        User->>Frontend: Navigate to voice selection
        User->>Frontend: Optionally select language filter
        Frontend->>Backend: GET /api/voice-google/voices<br/>?language_code=de-DE (optional)
        Backend-->>Frontend: {success: true, voices: [...], language_filter: "de-DE"}
        Frontend->>Frontend: Display voice list
        Frontend->>Frontend: Group voices by language if no filter
        Frontend-->>User: Show available voices
    end
    
    alt User filters voices by language
        User->>Frontend: Select language in filter
        Frontend->>Backend: GET /api/voice-google/voices?language_code=selected
        Backend-->>Frontend: Filtered voices list
        Frontend->>Frontend: Update voice list
        Frontend-->>User: Show filtered voices
    end

Customer Journey 7: Real-time Speech-to-Text via WebSocket

User Goal

"I want to get real-time transcription as I speak, without uploading a complete audio file."

User Story

As a user, I want to establish a WebSocket connection and stream audio chunks to receive live transcription results.

User Story Flow

sequenceDiagram
    participant User
    participant Frontend
    participant Backend
    
    User->>Frontend: Initiate real-time speech-to-text
    User->>Frontend: Select language (default: de-DE)
    Frontend->>Frontend: Start audio capture
    Frontend->>Backend: WebSocket: /api/voice-google/ws/speech-to-text<br/>?userId=user&language=de-DE
    
    Backend->>Backend: Accept WebSocket connection
    Backend->>Backend: Initialize voice interface
    Backend-->>Frontend: {type: "connected", connection_id, message}
    Frontend-->>User: Show "Connected" status
    
    loop User speaks
        User->>Frontend: Speak into microphone
        Frontend->>Frontend: Capture audio chunk
        Frontend->>Frontend: Encode audio chunk to base64
        Frontend->>Backend: {type: "audio_chunk", data: base64_audio, timestamp}
        
        alt Processing error
            Backend->>Backend: Log error
            Backend-->>Frontend: {type: "error", error: "..."}
            Frontend-->>User: Show error message
        else Success
            Backend->>Backend: Process audio chunk (Speech-to-Text)
            Backend-->>Frontend: {type: "transcription_result", text, confidence, is_final}
            Frontend->>Frontend: Update transcription display
            Frontend-->>User: Show live transcription (interim or final)
        end
    end
    
    alt User sends ping
        User->>Frontend: Keep-alive ping
        Frontend->>Backend: {type: "ping", timestamp}
        Backend-->>Frontend: {type: "pong", timestamp}
    end
    
    alt User stops or disconnects
        User->>Frontend: Stop recording
        Frontend->>Backend: Close WebSocket
        Backend->>Backend: Cleanup connection
        Frontend-->>User: Show disconnected status
    end

Customer Journey 8: Real-time Text-to-Speech via WebSocket

User Goal

"I want to send text and receive audio streams in real-time without waiting for a complete file."

User Story

As a user, I want to establish a WebSocket connection, send text messages, and receive audio chunks for playback.

User Story Flow

sequenceDiagram
    participant User
    participant Frontend
    participant Backend
    
    User->>Frontend: Initiate real-time text-to-speech
    User->>Frontend: Select language (default: de-DE)
    User->>Frontend: Select voice (default: de-DE-Wavenet-A)
    Frontend->>Backend: WebSocket: /api/voice-google/ws/text-to-speech<br/>?userId=user&language=de-DE&voice=de-DE-Wavenet-A
    
    Backend->>Backend: Accept WebSocket connection
    Backend-->>Frontend: {type: "connected", connection_id, message}
    Frontend-->>User: Show "Connected" status
    
    loop User sends text
        User->>Frontend: Enter text to speak
        User->>Frontend: Click "Speak" or send
        Frontend->>Backend: {type: "text_to_speak", text: "..."}
        
        alt Processing error
            Backend->>Backend: Log error
            Backend-->>Frontend: {type: "error", error: "..."}
            Frontend-->>User: Show error message
        else Success
            Backend->>Backend: Process text-to-speech
            Backend-->>Frontend: {type: "audio_data", audio: base64_audio, format: "mp3"}
            Frontend->>Frontend: Decode base64 audio
            Frontend->>Frontend: Create audio blob
            Frontend->>Frontend: Play audio or queue for playback
            Frontend-->>User: Play generated speech audio
        end
    end
    
    alt User sends ping
        User->>Frontend: Keep-alive ping
        Frontend->>Backend: {type: "ping", timestamp}
        Backend-->>Frontend: {type: "pong", timestamp}
    end
    
    alt User disconnects
        User->>Frontend: Close connection
        Frontend->>Backend: Close WebSocket
        Backend->>Backend: Cleanup connection
        Frontend-->>User: Show disconnected status
    end