gateway/docs/frontend-documentation/voice-service-customer-journeys.md

# Voice Service Customer Journeys

This document describes customer journeys for the Google Cloud Voice Services, focusing on how users interact with speech-to-text, translation, text-to-speech, and real-time voice interpretation features.

## Table of Contents

1. [Overview](#overview)
2. [Customer Journey 1: Converting Speech to Text](#customer-journey-1-converting-speech-to-text)
3. [Customer Journey 2: Translating Text](#customer-journey-2-translating-text)
4. [Customer Journey 3: Real-time Voice Interpretation](#customer-journey-3-real-time-voice-interpretation)
5. [Customer Journey 4: Converting Text to Speech](#customer-journey-4-converting-text-to-speech)
6. [Customer Journey 5: Managing Voice Settings](#customer-journey-5-managing-voice-settings)
7. [Customer Journey 6: Discovering Available Languages and Voices](#customer-journey-6-discovering-available-languages-and-voices)
8. [Customer Journey 7: Real-time Speech-to-Text via WebSocket](#customer-journey-7-real-time-speech-to-text-via-websocket)
9. [Customer Journey 8: Real-time Text-to-Speech via WebSocket](#customer-journey-8-real-time-text-to-speech-via-websocket)

---

## Overview

The voice service routes (`/api/voice-google`) enable users to interact with Google Cloud voice services including Speech-to-Text, Translation, Text-to-Speech, and real-time voice interpretation. These routes support both HTTP REST endpoints for file-based processing and WebSocket endpoints for real-time streaming.

**Key Principles:**
- **User-Centric**: Documentation organized around what users want to accomplish
- **Multi-Modal**: Supports both file upload and real-time streaming
- **Language-Aware**: All operations support multiple languages with user-configurable defaults
- **Settings-Driven**: User preferences stored and applied automatically

---

## Customer Journey 1: Converting Speech to Text

### User Goal
"I want to record my speech and convert it into text."

### User Story
As a user, I want to record my speech through the frontend microphone and get the transcribed text with confidence scores and language detection.

### User Story Flow

```mermaid
sequenceDiagram
    participant User
    participant Frontend
    participant Backend

    User->>Frontend: Select language (default: de-DE)
    User->>Frontend: Click "Start Recording"
    Frontend->>Frontend: Request microphone access
    alt Microphone access denied
        Frontend-->>User: Show permission error
    else Microphone access granted
        Frontend->>Frontend: Start audio recording
        Frontend-->>User: Show recording indicator (red dot, timer)

        User->>Frontend: Speak into microphone
        Frontend->>Frontend: Capture audio stream

        User->>Frontend: Click "Stop Recording"
        Frontend->>Frontend: Stop audio recording
        Frontend->>Frontend: Convert audio stream to audio file
        Frontend->>Frontend: Validate audio file (size, format)

        alt File validation fails
            Frontend-->>User: Show validation error
        else File validation passes
            Frontend->>Frontend: Show loading state
            Frontend->>Backend: POST /api/voice-google/speech-to-text<br/>(audioFile, language)

            alt Authentication fails (401)
                Backend-->>Frontend: 401 Unauthorized
                Frontend-->>User: Show authentication error
            else Invalid audio format (400)
                Backend-->>Frontend: 400 Bad Request + error details
                Frontend-->>User: Show format error message
            else Speech recognition fails (400)
                Backend-->>Frontend: 400 Bad Request + error details
                Frontend-->>User: Show recognition error
            else Success (200)
                Backend-->>Frontend: {success: true, text, confidence, language, audio_info}
                Frontend->>Frontend: Display transcribed text
                Frontend->>Frontend: Display confidence score
                Frontend->>Frontend: Display detected language
                Frontend->>Frontend: Display audio metadata
                Frontend-->>User: Show transcription result with metadata
            end
        end
    end
```

---

## Customer Journey 2: Translating Text

### User Goal
"I want to translate text from one language to another."

### User Story
As a user, I want to enter text and translate it between languages, seeing both the original and translated text.

### User Story Flow

```mermaid
sequenceDiagram
    participant User
    participant Frontend
    participant Backend

    User->>Frontend: Enter text to translate
    User->>Frontend: Select source language (default: de)
    User->>Frontend: Select target language (default: en)
    User->>Frontend: Click "Translate"

    Frontend->>Frontend: Validate text is not empty
    alt Text is empty
        Frontend-->>User: Show validation error
    else Text validation passes
        Frontend->>Frontend: Show loading state
        Frontend->>Backend: POST /api/voice-google/translate<br/>(text, sourceLanguage, targetLanguage)

        alt Authentication fails (401)
            Backend-->>Frontend: 401 Unauthorized
            Frontend-->>User: Show authentication error
        else Empty text (400)
            Backend-->>Frontend: 400 Bad Request
            Frontend-->>User: Show empty text error
        else Translation fails (400)
            Backend-->>Frontend: 400 Bad Request + error details
            Frontend-->>User: Show translation error
        else Success (200)
            Backend-->>Frontend: {success: true, original_text, translated_text, source_language, target_language}
            Frontend->>Frontend: Display original text
            Frontend->>Frontend: Display translated text
            Frontend->>Frontend: Display language information
            Frontend-->>User: Show translation result
        end
    end
```

---

## Customer Journey 3: Real-time Voice Interpretation

### User Goal
"I want to speak in one language and get the translated text in another language."

### User Story
As a user, I want to record my speech through the frontend microphone and receive both the transcribed text in the source language and its translation in the target language.

### User Story Flow

```mermaid
sequenceDiagram
    participant User
    participant Frontend
    participant Backend

    User->>Frontend: Select source language (default: de-DE)
    User->>Frontend: Select target language (default: en-US)
    User->>Frontend: Click "Start Recording"
    Frontend->>Frontend: Request microphone access
    alt Microphone access denied
        Frontend-->>User: Show permission error
    else Microphone access granted
        Frontend->>Frontend: Start audio recording
        Frontend-->>User: Show recording indicator (red dot, timer)

        User->>Frontend: Speak into microphone
        Frontend->>Frontend: Capture audio stream

        User->>Frontend: Click "Stop Recording"
        Frontend->>Frontend: Stop audio recording
        Frontend->>Frontend: Convert audio stream to audio file
        Frontend->>Frontend: Validate audio file

        alt File validation fails
            Frontend-->>User: Show validation error
        else File validation passes
            Frontend->>Frontend: Show loading state
            Frontend->>Backend: POST /api/voice-google/realtime-interpreter<br/>(audioFile, fromLanguage, toLanguage, connectionId?)

            alt Authentication fails (401)
                Backend-->>Frontend: 401 Unauthorized
                Frontend-->>User: Show authentication error
            else Invalid audio format (400)
                Backend-->>Frontend: 400 Bad Request + error details
                Frontend-->>User: Show format error
            else Interpretation fails (400)
                Backend-->>Frontend: 400 Bad Request + error details
                Frontend-->>User: Show interpretation error
            else Success (200)
                Backend-->>Frontend: {success: true, original_text, translated_text, confidence, source_language, target_language, audio_info}
                Frontend->>Frontend: Display original transcribed text
                Frontend->>Frontend: Display translated text
                Frontend->>Frontend: Display confidence score
                Frontend->>Frontend: Display language information
                Frontend->>Frontend: Display audio metadata
                Frontend-->>User: Show interpretation result with both texts
            end
        end
    end
```

---

## Customer Journey 4: Converting Text to Speech

### User Goal
"I want to convert written text into spoken audio."

### User Story
As a user, I want to enter text, select a language and voice, and receive an audio file of the spoken text.

### User Story Flow

```mermaid
sequenceDiagram
    participant User
    participant Frontend
    participant Backend

    User->>Frontend: Enter text to speak
    User->>Frontend: Select language (default: de-DE)
    User->>Frontend: Select voice (optional, default from settings)
    User->>Frontend: Click "Generate Speech"

    Frontend->>Frontend: Validate text is not empty
    alt Text is empty
        Frontend-->>User: Show validation error
    else Text validation passes
        Frontend->>Frontend: Show loading state
        Frontend->>Backend: POST /api/voice-google/text-to-speech<br/>(text, language, voice?)

        alt Authentication fails (401)
            Backend-->>Frontend: 401 Unauthorized
            Frontend-->>User: Show authentication error
        else Empty text (400)
            Backend-->>Frontend: 400 Bad Request
            Frontend-->>User: Show empty text error
        else Text-to-Speech fails (400)
            Backend-->>Frontend: 400 Bad Request + error details
            Frontend-->>User: Show TTS error
        else Success (200)
            Backend-->>Frontend: Audio file (audio/mpeg)<br/>Headers: X-Voice-Name, X-Language-Code
            Frontend->>Frontend: Create audio blob from response
            Frontend->>Frontend: Create download link or audio player
            Frontend->>Frontend: Display voice name and language
            Frontend-->>User: Show audio player with download option
        end
    end
```

---

## Customer Journey 5: Managing Voice Settings

### User Goal
"I want to configure my default voice settings for speech recognition, translation, and text-to-speech."

### User Story
As a user, I want to view and update my voice settings including default languages, voice preferences, and translation settings.

### User Story Flow

```mermaid
sequenceDiagram
    participant User
    participant Frontend
    participant Backend

    User->>Frontend: Navigate to voice settings
    Frontend->>Backend: GET /api/voice-google/settings
    Backend-->>Frontend: {success: true, data: {user_settings, default_settings}}
    Frontend->>Frontend: Pre-populate form with user_settings or default_settings
    Frontend-->>User: Display settings form

    alt User views settings only
        User->>Frontend: View current settings
        Frontend-->>User: Display settings (read-only or editable)
    else User updates settings
        User->>Frontend: Modify settings (sttLanguage, ttsLanguage, ttsVoice, translationEnabled, targetLanguage)
        User->>Frontend: Click "Save Settings"

        Frontend->>Frontend: Validate required fields (sttLanguage, ttsLanguage, ttsVoice)
        alt Validation fails
            Frontend-->>User: Show validation errors
        else Validation passes
            Frontend->>Frontend: Show loading state
            Frontend->>Backend: POST /api/voice-google/settings<br/>(settings object)

            alt Authentication fails (401)
                Backend-->>Frontend: 401 Unauthorized
                Frontend-->>User: Show authentication error
            else Missing required field (400)
                Backend-->>Frontend: 400 Bad Request + field name
                Frontend-->>User: Show missing field error
            else Save fails (500)
                Backend-->>Frontend: 500 Internal Server Error
                Frontend-->>User: Show save error
            else Success (200)
                Backend-->>Frontend: {success: true, message, data: settings}
                Frontend->>Frontend: Update UI with saved settings
                Frontend->>Frontend: Show success message
                Frontend-->>User: Display confirmation and updated settings
            end
        end
    end
```

---

## Customer Journey 6: Discovering Available Languages and Voices

### User Goal
"I want to see what languages and voices are available for speech and translation services."

### User Story
As a user, I want to browse available languages and filter voices by language to configure my preferences.

### User Story Flow

```mermaid
sequenceDiagram
    participant User
    participant Frontend
    participant Backend

    alt User wants to see available languages
        User->>Frontend: Navigate to language selection
        Frontend->>Backend: GET /api/voice-google/languages
        Backend-->>Frontend: {success: true, languages: [...]}
        Frontend->>Frontend: Display language list
        Frontend-->>User: Show available languages
    end

    alt User wants to see available voices
        User->>Frontend: Navigate to voice selection
        User->>Frontend: Optionally select language filter
        Frontend->>Backend: GET /api/voice-google/voices<br/>?language_code=de-DE (optional)
        Backend-->>Frontend: {success: true, voices: [...], language_filter: "de-DE"}
        Frontend->>Frontend: Display voice list
        Frontend->>Frontend: Group voices by language if no filter
        Frontend-->>User: Show available voices
    end

    alt User filters voices by language
        User->>Frontend: Select language in filter
        Frontend->>Backend: GET /api/voice-google/voices?language_code=selected
        Backend-->>Frontend: Filtered voices list
        Frontend->>Frontend: Update voice list
        Frontend-->>User: Show filtered voices
    end
```

---

## Customer Journey 7: Real-time Speech-to-Text via WebSocket

### User Goal
"I want to get real-time transcription as I speak, without uploading a complete audio file."

### User Story
As a user, I want to establish a WebSocket connection and stream audio chunks to receive live transcription results.

### User Story Flow

```mermaid
sequenceDiagram
    participant User
    participant Frontend
    participant Backend

    User->>Frontend: Initiate real-time speech-to-text
    User->>Frontend: Select language (default: de-DE)
    Frontend->>Frontend: Start audio capture
    Frontend->>Backend: WebSocket: /api/voice-google/ws/speech-to-text<br/>?userId=user&language=de-DE

    Backend->>Backend: Accept WebSocket connection
    Backend->>Backend: Initialize voice interface
    Backend-->>Frontend: {type: "connected", connection_id, message}
    Frontend-->>User: Show "Connected" status

    loop User speaks
        User->>Frontend: Speak into microphone
        Frontend->>Frontend: Capture audio chunk
        Frontend->>Frontend: Encode audio chunk to base64
        Frontend->>Backend: {type: "audio_chunk", data: base64_audio, timestamp}

        alt Processing error
            Backend->>Backend: Log error
            Backend-->>Frontend: {type: "error", error: "..."}
            Frontend-->>User: Show error message
        else Success
            Backend->>Backend: Process audio chunk (Speech-to-Text)
            Backend-->>Frontend: {type: "transcription_result", text, confidence, is_final}
            Frontend->>Frontend: Update transcription display
            Frontend-->>User: Show live transcription (interim or final)
        end
    end

    alt User sends ping
        User->>Frontend: Keep-alive ping
        Frontend->>Backend: {type: "ping", timestamp}
        Backend-->>Frontend: {type: "pong", timestamp}
    end

    alt User stops or disconnects
        User->>Frontend: Stop recording
        Frontend->>Backend: Close WebSocket
        Backend->>Backend: Cleanup connection
        Frontend-->>User: Show disconnected status
    end
```

---

## Customer Journey 8: Real-time Text-to-Speech via WebSocket

### User Goal
"I want to send text and receive audio streams in real-time without waiting for a complete file."

### User Story
As a user, I want to establish a WebSocket connection, send text messages, and receive audio chunks for playback.

### User Story Flow

```mermaid
sequenceDiagram
    participant User
    participant Frontend
    participant Backend

    User->>Frontend: Initiate real-time text-to-speech
    User->>Frontend: Select language (default: de-DE)
    User->>Frontend: Select voice (default: de-DE-Wavenet-A)
    Frontend->>Backend: WebSocket: /api/voice-google/ws/text-to-speech<br/>?userId=user&language=de-DE&voice=de-DE-Wavenet-A

    Backend->>Backend: Accept WebSocket connection
    Backend-->>Frontend: {type: "connected", connection_id, message}
    Frontend-->>User: Show "Connected" status

    loop User sends text
        User->>Frontend: Enter text to speak
        User->>Frontend: Click "Speak" or send
        Frontend->>Backend: {type: "text_to_speak", text: "..."}

        alt Processing error
            Backend->>Backend: Log error
            Backend-->>Frontend: {type: "error", error: "..."}
            Frontend-->>User: Show error message
        else Success
            Backend->>Backend: Process text-to-speech
            Backend-->>Frontend: {type: "audio_data", audio: base64_audio, format: "mp3"}
            Frontend->>Frontend: Decode base64 audio
            Frontend->>Frontend: Create audio blob
            Frontend->>Frontend: Play audio or queue for playback
            Frontend-->>User: Play generated speech audio
        end
    end

    alt User sends ping
        User->>Frontend: Keep-alive ping
        Frontend->>Backend: {type: "ping", timestamp}
        Backend-->>Frontend: {type: "pong", timestamp}
    end

    alt User disconnects
        User->>Frontend: Close connection
        Frontend->>Backend: Close WebSocket
        Backend->>Backend: Cleanup connection
        Frontend-->>User: Show disconnected status
    end
```