gateway/docs/frontend-documentation/voice-service-customer-journeys.md

474 lines
18 KiB
Markdown

# Voice Service Customer Journeys
This document describes customer journeys for the Google Cloud Voice Services, focusing on how users interact with speech-to-text, translation, text-to-speech, and real-time voice interpretation features.
## Table of Contents
1. [Overview](#overview)
2. [Customer Journey 1: Converting Speech to Text](#customer-journey-1-converting-speech-to-text)
3. [Customer Journey 2: Translating Text](#customer-journey-2-translating-text)
4. [Customer Journey 3: Real-time Voice Interpretation](#customer-journey-3-real-time-voice-interpretation)
5. [Customer Journey 4: Converting Text to Speech](#customer-journey-4-converting-text-to-speech)
6. [Customer Journey 5: Managing Voice Settings](#customer-journey-5-managing-voice-settings)
7. [Customer Journey 6: Discovering Available Languages and Voices](#customer-journey-6-discovering-available-languages-and-voices)
8. [Customer Journey 7: Real-time Speech-to-Text via WebSocket](#customer-journey-7-real-time-speech-to-text-via-websocket)
9. [Customer Journey 8: Real-time Text-to-Speech via WebSocket](#customer-journey-8-real-time-text-to-speech-via-websocket)
---
## Overview
The voice service routes (`/api/voice-google`) enable users to interact with Google Cloud voice services including Speech-to-Text, Translation, Text-to-Speech, and real-time voice interpretation. These routes support both HTTP REST endpoints for file-based processing and WebSocket endpoints for real-time streaming.
**Key Principles:**
- **User-Centric**: Documentation organized around what users want to accomplish
- **Multi-Modal**: Supports both file upload and real-time streaming
- **Language-Aware**: All operations support multiple languages with user-configurable defaults
- **Settings-Driven**: User preferences stored and applied automatically
---
## Customer Journey 1: Converting Speech to Text
### User Goal
"I want to record my speech and convert it into text."
### User Story
As a user, I want to record my speech through the frontend microphone and get the transcribed text with confidence scores and language detection.
### User Story Flow
```mermaid
sequenceDiagram
participant User
participant Frontend
participant Backend
User->>Frontend: Select language (default: de-DE)
User->>Frontend: Click "Start Recording"
Frontend->>Frontend: Request microphone access
alt Microphone access denied
Frontend-->>User: Show permission error
else Microphone access granted
Frontend->>Frontend: Start audio recording
Frontend-->>User: Show recording indicator (red dot, timer)
User->>Frontend: Speak into microphone
Frontend->>Frontend: Capture audio stream
User->>Frontend: Click "Stop Recording"
Frontend->>Frontend: Stop audio recording
Frontend->>Frontend: Convert audio stream to audio file
Frontend->>Frontend: Validate audio file (size, format)
alt File validation fails
Frontend-->>User: Show validation error
else File validation passes
Frontend->>Frontend: Show loading state
Frontend->>Backend: POST /api/voice-google/speech-to-text<br/>(audioFile, language)
alt Authentication fails (401)
Backend-->>Frontend: 401 Unauthorized
Frontend-->>User: Show authentication error
else Invalid audio format (400)
Backend-->>Frontend: 400 Bad Request + error details
Frontend-->>User: Show format error message
else Speech recognition fails (400)
Backend-->>Frontend: 400 Bad Request + error details
Frontend-->>User: Show recognition error
else Success (200)
Backend-->>Frontend: {success: true, text, confidence, language, audio_info}
Frontend->>Frontend: Display transcribed text
Frontend->>Frontend: Display confidence score
Frontend->>Frontend: Display detected language
Frontend->>Frontend: Display audio metadata
Frontend-->>User: Show transcription result with metadata
end
end
end
```
---
## Customer Journey 2: Translating Text
### User Goal
"I want to translate text from one language to another."
### User Story
As a user, I want to enter text and translate it between languages, seeing both the original and translated text.
### User Story Flow
```mermaid
sequenceDiagram
participant User
participant Frontend
participant Backend
User->>Frontend: Enter text to translate
User->>Frontend: Select source language (default: de)
User->>Frontend: Select target language (default: en)
User->>Frontend: Click "Translate"
Frontend->>Frontend: Validate text is not empty
alt Text is empty
Frontend-->>User: Show validation error
else Text validation passes
Frontend->>Frontend: Show loading state
Frontend->>Backend: POST /api/voice-google/translate<br/>(text, sourceLanguage, targetLanguage)
alt Authentication fails (401)
Backend-->>Frontend: 401 Unauthorized
Frontend-->>User: Show authentication error
else Empty text (400)
Backend-->>Frontend: 400 Bad Request
Frontend-->>User: Show empty text error
else Translation fails (400)
Backend-->>Frontend: 400 Bad Request + error details
Frontend-->>User: Show translation error
else Success (200)
Backend-->>Frontend: {success: true, original_text, translated_text, source_language, target_language}
Frontend->>Frontend: Display original text
Frontend->>Frontend: Display translated text
Frontend->>Frontend: Display language information
Frontend-->>User: Show translation result
end
end
```
---
## Customer Journey 3: Real-time Voice Interpretation
### User Goal
"I want to speak in one language and get the translated text in another language."
### User Story
As a user, I want to record my speech through the frontend microphone and receive both the transcribed text in the source language and its translation in the target language.
### User Story Flow
```mermaid
sequenceDiagram
participant User
participant Frontend
participant Backend
User->>Frontend: Select source language (default: de-DE)
User->>Frontend: Select target language (default: en-US)
User->>Frontend: Click "Start Recording"
Frontend->>Frontend: Request microphone access
alt Microphone access denied
Frontend-->>User: Show permission error
else Microphone access granted
Frontend->>Frontend: Start audio recording
Frontend-->>User: Show recording indicator (red dot, timer)
User->>Frontend: Speak into microphone
Frontend->>Frontend: Capture audio stream
User->>Frontend: Click "Stop Recording"
Frontend->>Frontend: Stop audio recording
Frontend->>Frontend: Convert audio stream to audio file
Frontend->>Frontend: Validate audio file
alt File validation fails
Frontend-->>User: Show validation error
else File validation passes
Frontend->>Frontend: Show loading state
Frontend->>Backend: POST /api/voice-google/realtime-interpreter<br/>(audioFile, fromLanguage, toLanguage, connectionId?)
alt Authentication fails (401)
Backend-->>Frontend: 401 Unauthorized
Frontend-->>User: Show authentication error
else Invalid audio format (400)
Backend-->>Frontend: 400 Bad Request + error details
Frontend-->>User: Show format error
else Interpretation fails (400)
Backend-->>Frontend: 400 Bad Request + error details
Frontend-->>User: Show interpretation error
else Success (200)
Backend-->>Frontend: {success: true, original_text, translated_text, confidence, source_language, target_language, audio_info}
Frontend->>Frontend: Display original transcribed text
Frontend->>Frontend: Display translated text
Frontend->>Frontend: Display confidence score
Frontend->>Frontend: Display language information
Frontend->>Frontend: Display audio metadata
Frontend-->>User: Show interpretation result with both texts
end
end
end
```
---
## Customer Journey 4: Converting Text to Speech
### User Goal
"I want to convert written text into spoken audio."
### User Story
As a user, I want to enter text, select a language and voice, and receive an audio file of the spoken text.
### User Story Flow
```mermaid
sequenceDiagram
participant User
participant Frontend
participant Backend
User->>Frontend: Enter text to speak
User->>Frontend: Select language (default: de-DE)
User->>Frontend: Select voice (optional, default from settings)
User->>Frontend: Click "Generate Speech"
Frontend->>Frontend: Validate text is not empty
alt Text is empty
Frontend-->>User: Show validation error
else Text validation passes
Frontend->>Frontend: Show loading state
Frontend->>Backend: POST /api/voice-google/text-to-speech<br/>(text, language, voice?)
alt Authentication fails (401)
Backend-->>Frontend: 401 Unauthorized
Frontend-->>User: Show authentication error
else Empty text (400)
Backend-->>Frontend: 400 Bad Request
Frontend-->>User: Show empty text error
else Text-to-Speech fails (400)
Backend-->>Frontend: 400 Bad Request + error details
Frontend-->>User: Show TTS error
else Success (200)
Backend-->>Frontend: Audio file (audio/mpeg)<br/>Headers: X-Voice-Name, X-Language-Code
Frontend->>Frontend: Create audio blob from response
Frontend->>Frontend: Create download link or audio player
Frontend->>Frontend: Display voice name and language
Frontend-->>User: Show audio player with download option
end
end
```
---
## Customer Journey 5: Managing Voice Settings
### User Goal
"I want to configure my default voice settings for speech recognition, translation, and text-to-speech."
### User Story
As a user, I want to view and update my voice settings including default languages, voice preferences, and translation settings.
### User Story Flow
```mermaid
sequenceDiagram
participant User
participant Frontend
participant Backend
User->>Frontend: Navigate to voice settings
Frontend->>Backend: GET /api/voice-google/settings
Backend-->>Frontend: {success: true, data: {user_settings, default_settings}}
Frontend->>Frontend: Pre-populate form with user_settings or default_settings
Frontend-->>User: Display settings form
alt User views settings only
User->>Frontend: View current settings
Frontend-->>User: Display settings (read-only or editable)
else User updates settings
User->>Frontend: Modify settings (sttLanguage, ttsLanguage, ttsVoice, translationEnabled, targetLanguage)
User->>Frontend: Click "Save Settings"
Frontend->>Frontend: Validate required fields (sttLanguage, ttsLanguage, ttsVoice)
alt Validation fails
Frontend-->>User: Show validation errors
else Validation passes
Frontend->>Frontend: Show loading state
Frontend->>Backend: POST /api/voice-google/settings<br/>(settings object)
alt Authentication fails (401)
Backend-->>Frontend: 401 Unauthorized
Frontend-->>User: Show authentication error
else Missing required field (400)
Backend-->>Frontend: 400 Bad Request + field name
Frontend-->>User: Show missing field error
else Save fails (500)
Backend-->>Frontend: 500 Internal Server Error
Frontend-->>User: Show save error
else Success (200)
Backend-->>Frontend: {success: true, message, data: settings}
Frontend->>Frontend: Update UI with saved settings
Frontend->>Frontend: Show success message
Frontend-->>User: Display confirmation and updated settings
end
end
end
```
---
## Customer Journey 6: Discovering Available Languages and Voices
### User Goal
"I want to see what languages and voices are available for speech and translation services."
### User Story
As a user, I want to browse available languages and filter voices by language to configure my preferences.
### User Story Flow
```mermaid
sequenceDiagram
participant User
participant Frontend
participant Backend
alt User wants to see available languages
User->>Frontend: Navigate to language selection
Frontend->>Backend: GET /api/voice-google/languages
Backend-->>Frontend: {success: true, languages: [...]}
Frontend->>Frontend: Display language list
Frontend-->>User: Show available languages
end
alt User wants to see available voices
User->>Frontend: Navigate to voice selection
User->>Frontend: Optionally select language filter
Frontend->>Backend: GET /api/voice-google/voices<br/>?language_code=de-DE (optional)
Backend-->>Frontend: {success: true, voices: [...], language_filter: "de-DE"}
Frontend->>Frontend: Display voice list
Frontend->>Frontend: Group voices by language if no filter
Frontend-->>User: Show available voices
end
alt User filters voices by language
User->>Frontend: Select language in filter
Frontend->>Backend: GET /api/voice-google/voices?language_code=selected
Backend-->>Frontend: Filtered voices list
Frontend->>Frontend: Update voice list
Frontend-->>User: Show filtered voices
end
```
---
## Customer Journey 7: Real-time Speech-to-Text via WebSocket
### User Goal
"I want to get real-time transcription as I speak, without uploading a complete audio file."
### User Story
As a user, I want to establish a WebSocket connection and stream audio chunks to receive live transcription results.
### User Story Flow
```mermaid
sequenceDiagram
participant User
participant Frontend
participant Backend
User->>Frontend: Initiate real-time speech-to-text
User->>Frontend: Select language (default: de-DE)
Frontend->>Frontend: Start audio capture
Frontend->>Backend: WebSocket: /api/voice-google/ws/speech-to-text<br/>?userId=user&language=de-DE
Backend->>Backend: Accept WebSocket connection
Backend->>Backend: Initialize voice interface
Backend-->>Frontend: {type: "connected", connection_id, message}
Frontend-->>User: Show "Connected" status
loop User speaks
User->>Frontend: Speak into microphone
Frontend->>Frontend: Capture audio chunk
Frontend->>Frontend: Encode audio chunk to base64
Frontend->>Backend: {type: "audio_chunk", data: base64_audio, timestamp}
alt Processing error
Backend->>Backend: Log error
Backend-->>Frontend: {type: "error", error: "..."}
Frontend-->>User: Show error message
else Success
Backend->>Backend: Process audio chunk (Speech-to-Text)
Backend-->>Frontend: {type: "transcription_result", text, confidence, is_final}
Frontend->>Frontend: Update transcription display
Frontend-->>User: Show live transcription (interim or final)
end
end
alt User sends ping
User->>Frontend: Keep-alive ping
Frontend->>Backend: {type: "ping", timestamp}
Backend-->>Frontend: {type: "pong", timestamp}
end
alt User stops or disconnects
User->>Frontend: Stop recording
Frontend->>Backend: Close WebSocket
Backend->>Backend: Cleanup connection
Frontend-->>User: Show disconnected status
end
```
---
## Customer Journey 8: Real-time Text-to-Speech via WebSocket
### User Goal
"I want to send text and receive audio streams in real-time without waiting for a complete file."
### User Story
As a user, I want to establish a WebSocket connection, send text messages, and receive audio chunks for playback.
### User Story Flow
```mermaid
sequenceDiagram
participant User
participant Frontend
participant Backend
User->>Frontend: Initiate real-time text-to-speech
User->>Frontend: Select language (default: de-DE)
User->>Frontend: Select voice (default: de-DE-Wavenet-A)
Frontend->>Backend: WebSocket: /api/voice-google/ws/text-to-speech<br/>?userId=user&language=de-DE&voice=de-DE-Wavenet-A
Backend->>Backend: Accept WebSocket connection
Backend-->>Frontend: {type: "connected", connection_id, message}
Frontend-->>User: Show "Connected" status
loop User sends text
User->>Frontend: Enter text to speak
User->>Frontend: Click "Speak" or send
Frontend->>Backend: {type: "text_to_speak", text: "..."}
alt Processing error
Backend->>Backend: Log error
Backend-->>Frontend: {type: "error", error: "..."}
Frontend-->>User: Show error message
else Success
Backend->>Backend: Process text-to-speech
Backend-->>Frontend: {type: "audio_data", audio: base64_audio, format: "mp3"}
Frontend->>Frontend: Decode base64 audio
Frontend->>Frontend: Create audio blob
Frontend->>Frontend: Play audio or queue for playback
Frontend-->>User: Play generated speech audio
end
end
alt User sends ping
User->>Frontend: Keep-alive ping
Frontend->>Backend: {type: "ping", timestamp}
Backend-->>Frontend: {type: "pong", timestamp}
end
alt User disconnects
User->>Frontend: Close connection
Frontend->>Backend: Close WebSocket
Backend->>Backend: Cleanup connection
Frontend-->>User: Show disconnected status
end
```