18 KiB
Voice Service Customer Journeys
This document describes customer journeys for the Google Cloud Voice Services, focusing on how users interact with speech-to-text, translation, text-to-speech, and real-time voice interpretation features.
Table of Contents
- Overview
- Customer Journey 1: Converting Speech to Text
- Customer Journey 2: Translating Text
- Customer Journey 3: Real-time Voice Interpretation
- Customer Journey 4: Converting Text to Speech
- Customer Journey 5: Managing Voice Settings
- Customer Journey 6: Discovering Available Languages and Voices
- Customer Journey 7: Real-time Speech-to-Text via WebSocket
- Customer Journey 8: Real-time Text-to-Speech via WebSocket
Overview
The voice service routes (/api/voice-google) enable users to interact with Google Cloud voice services including Speech-to-Text, Translation, Text-to-Speech, and real-time voice interpretation. These routes support both HTTP REST endpoints for file-based processing and WebSocket endpoints for real-time streaming.
Key Principles:
- User-Centric: Documentation organized around what users want to accomplish
- Multi-Modal: Supports both file upload and real-time streaming
- Language-Aware: All operations support multiple languages with user-configurable defaults
- Settings-Driven: User preferences stored and applied automatically
Customer Journey 1: Converting Speech to Text
User Goal
"I want to record my speech and convert it into text."
User Story
As a user, I want to record my speech through the frontend microphone and get the transcribed text with confidence scores and language detection.
User Story Flow
sequenceDiagram
participant User
participant Frontend
participant Backend
User->>Frontend: Select language (default: de-DE)
User->>Frontend: Click "Start Recording"
Frontend->>Frontend: Request microphone access
alt Microphone access denied
Frontend-->>User: Show permission error
else Microphone access granted
Frontend->>Frontend: Start audio recording
Frontend-->>User: Show recording indicator (red dot, timer)
User->>Frontend: Speak into microphone
Frontend->>Frontend: Capture audio stream
User->>Frontend: Click "Stop Recording"
Frontend->>Frontend: Stop audio recording
Frontend->>Frontend: Convert audio stream to audio file
Frontend->>Frontend: Validate audio file (size, format)
alt File validation fails
Frontend-->>User: Show validation error
else File validation passes
Frontend->>Frontend: Show loading state
Frontend->>Backend: POST /api/voice-google/speech-to-text<br/>(audioFile, language)
alt Authentication fails (401)
Backend-->>Frontend: 401 Unauthorized
Frontend-->>User: Show authentication error
else Invalid audio format (400)
Backend-->>Frontend: 400 Bad Request + error details
Frontend-->>User: Show format error message
else Speech recognition fails (400)
Backend-->>Frontend: 400 Bad Request + error details
Frontend-->>User: Show recognition error
else Success (200)
Backend-->>Frontend: {success: true, text, confidence, language, audio_info}
Frontend->>Frontend: Display transcribed text
Frontend->>Frontend: Display confidence score
Frontend->>Frontend: Display detected language
Frontend->>Frontend: Display audio metadata
Frontend-->>User: Show transcription result with metadata
end
end
end
Customer Journey 2: Translating Text
User Goal
"I want to translate text from one language to another."
User Story
As a user, I want to enter text and translate it between languages, seeing both the original and translated text.
User Story Flow
sequenceDiagram
participant User
participant Frontend
participant Backend
User->>Frontend: Enter text to translate
User->>Frontend: Select source language (default: de)
User->>Frontend: Select target language (default: en)
User->>Frontend: Click "Translate"
Frontend->>Frontend: Validate text is not empty
alt Text is empty
Frontend-->>User: Show validation error
else Text validation passes
Frontend->>Frontend: Show loading state
Frontend->>Backend: POST /api/voice-google/translate<br/>(text, sourceLanguage, targetLanguage)
alt Authentication fails (401)
Backend-->>Frontend: 401 Unauthorized
Frontend-->>User: Show authentication error
else Empty text (400)
Backend-->>Frontend: 400 Bad Request
Frontend-->>User: Show empty text error
else Translation fails (400)
Backend-->>Frontend: 400 Bad Request + error details
Frontend-->>User: Show translation error
else Success (200)
Backend-->>Frontend: {success: true, original_text, translated_text, source_language, target_language}
Frontend->>Frontend: Display original text
Frontend->>Frontend: Display translated text
Frontend->>Frontend: Display language information
Frontend-->>User: Show translation result
end
end
Customer Journey 3: Real-time Voice Interpretation
User Goal
"I want to speak in one language and get the translated text in another language."
User Story
As a user, I want to record my speech through the frontend microphone and receive both the transcribed text in the source language and its translation in the target language.
User Story Flow
sequenceDiagram
participant User
participant Frontend
participant Backend
User->>Frontend: Select source language (default: de-DE)
User->>Frontend: Select target language (default: en-US)
User->>Frontend: Click "Start Recording"
Frontend->>Frontend: Request microphone access
alt Microphone access denied
Frontend-->>User: Show permission error
else Microphone access granted
Frontend->>Frontend: Start audio recording
Frontend-->>User: Show recording indicator (red dot, timer)
User->>Frontend: Speak into microphone
Frontend->>Frontend: Capture audio stream
User->>Frontend: Click "Stop Recording"
Frontend->>Frontend: Stop audio recording
Frontend->>Frontend: Convert audio stream to audio file
Frontend->>Frontend: Validate audio file
alt File validation fails
Frontend-->>User: Show validation error
else File validation passes
Frontend->>Frontend: Show loading state
Frontend->>Backend: POST /api/voice-google/realtime-interpreter<br/>(audioFile, fromLanguage, toLanguage, connectionId?)
alt Authentication fails (401)
Backend-->>Frontend: 401 Unauthorized
Frontend-->>User: Show authentication error
else Invalid audio format (400)
Backend-->>Frontend: 400 Bad Request + error details
Frontend-->>User: Show format error
else Interpretation fails (400)
Backend-->>Frontend: 400 Bad Request + error details
Frontend-->>User: Show interpretation error
else Success (200)
Backend-->>Frontend: {success: true, original_text, translated_text, confidence, source_language, target_language, audio_info}
Frontend->>Frontend: Display original transcribed text
Frontend->>Frontend: Display translated text
Frontend->>Frontend: Display confidence score
Frontend->>Frontend: Display language information
Frontend->>Frontend: Display audio metadata
Frontend-->>User: Show interpretation result with both texts
end
end
end
Customer Journey 4: Converting Text to Speech
User Goal
"I want to convert written text into spoken audio."
User Story
As a user, I want to enter text, select a language and voice, and receive an audio file of the spoken text.
User Story Flow
sequenceDiagram
participant User
participant Frontend
participant Backend
User->>Frontend: Enter text to speak
User->>Frontend: Select language (default: de-DE)
User->>Frontend: Select voice (optional, default from settings)
User->>Frontend: Click "Generate Speech"
Frontend->>Frontend: Validate text is not empty
alt Text is empty
Frontend-->>User: Show validation error
else Text validation passes
Frontend->>Frontend: Show loading state
Frontend->>Backend: POST /api/voice-google/text-to-speech<br/>(text, language, voice?)
alt Authentication fails (401)
Backend-->>Frontend: 401 Unauthorized
Frontend-->>User: Show authentication error
else Empty text (400)
Backend-->>Frontend: 400 Bad Request
Frontend-->>User: Show empty text error
else Text-to-Speech fails (400)
Backend-->>Frontend: 400 Bad Request + error details
Frontend-->>User: Show TTS error
else Success (200)
Backend-->>Frontend: Audio file (audio/mpeg)<br/>Headers: X-Voice-Name, X-Language-Code
Frontend->>Frontend: Create audio blob from response
Frontend->>Frontend: Create download link or audio player
Frontend->>Frontend: Display voice name and language
Frontend-->>User: Show audio player with download option
end
end
Customer Journey 5: Managing Voice Settings
User Goal
"I want to configure my default voice settings for speech recognition, translation, and text-to-speech."
User Story
As a user, I want to view and update my voice settings including default languages, voice preferences, and translation settings.
User Story Flow
sequenceDiagram
participant User
participant Frontend
participant Backend
User->>Frontend: Navigate to voice settings
Frontend->>Backend: GET /api/voice-google/settings
Backend-->>Frontend: {success: true, data: {user_settings, default_settings}}
Frontend->>Frontend: Pre-populate form with user_settings or default_settings
Frontend-->>User: Display settings form
alt User views settings only
User->>Frontend: View current settings
Frontend-->>User: Display settings (read-only or editable)
else User updates settings
User->>Frontend: Modify settings (sttLanguage, ttsLanguage, ttsVoice, translationEnabled, targetLanguage)
User->>Frontend: Click "Save Settings"
Frontend->>Frontend: Validate required fields (sttLanguage, ttsLanguage, ttsVoice)
alt Validation fails
Frontend-->>User: Show validation errors
else Validation passes
Frontend->>Frontend: Show loading state
Frontend->>Backend: POST /api/voice-google/settings<br/>(settings object)
alt Authentication fails (401)
Backend-->>Frontend: 401 Unauthorized
Frontend-->>User: Show authentication error
else Missing required field (400)
Backend-->>Frontend: 400 Bad Request + field name
Frontend-->>User: Show missing field error
else Save fails (500)
Backend-->>Frontend: 500 Internal Server Error
Frontend-->>User: Show save error
else Success (200)
Backend-->>Frontend: {success: true, message, data: settings}
Frontend->>Frontend: Update UI with saved settings
Frontend->>Frontend: Show success message
Frontend-->>User: Display confirmation and updated settings
end
end
end
Customer Journey 6: Discovering Available Languages and Voices
User Goal
"I want to see what languages and voices are available for speech and translation services."
User Story
As a user, I want to browse available languages and filter voices by language to configure my preferences.
User Story Flow
sequenceDiagram
participant User
participant Frontend
participant Backend
alt User wants to see available languages
User->>Frontend: Navigate to language selection
Frontend->>Backend: GET /api/voice-google/languages
Backend-->>Frontend: {success: true, languages: [...]}
Frontend->>Frontend: Display language list
Frontend-->>User: Show available languages
end
alt User wants to see available voices
User->>Frontend: Navigate to voice selection
User->>Frontend: Optionally select language filter
Frontend->>Backend: GET /api/voice-google/voices<br/>?language_code=de-DE (optional)
Backend-->>Frontend: {success: true, voices: [...], language_filter: "de-DE"}
Frontend->>Frontend: Display voice list
Frontend->>Frontend: Group voices by language if no filter
Frontend-->>User: Show available voices
end
alt User filters voices by language
User->>Frontend: Select language in filter
Frontend->>Backend: GET /api/voice-google/voices?language_code=selected
Backend-->>Frontend: Filtered voices list
Frontend->>Frontend: Update voice list
Frontend-->>User: Show filtered voices
end
Customer Journey 7: Real-time Speech-to-Text via WebSocket
User Goal
"I want to get real-time transcription as I speak, without uploading a complete audio file."
User Story
As a user, I want to establish a WebSocket connection and stream audio chunks to receive live transcription results.
User Story Flow
sequenceDiagram
participant User
participant Frontend
participant Backend
User->>Frontend: Initiate real-time speech-to-text
User->>Frontend: Select language (default: de-DE)
Frontend->>Frontend: Start audio capture
Frontend->>Backend: WebSocket: /api/voice-google/ws/speech-to-text<br/>?userId=user&language=de-DE
Backend->>Backend: Accept WebSocket connection
Backend->>Backend: Initialize voice interface
Backend-->>Frontend: {type: "connected", connection_id, message}
Frontend-->>User: Show "Connected" status
loop User speaks
User->>Frontend: Speak into microphone
Frontend->>Frontend: Capture audio chunk
Frontend->>Frontend: Encode audio chunk to base64
Frontend->>Backend: {type: "audio_chunk", data: base64_audio, timestamp}
alt Processing error
Backend->>Backend: Log error
Backend-->>Frontend: {type: "error", error: "..."}
Frontend-->>User: Show error message
else Success
Backend->>Backend: Process audio chunk (Speech-to-Text)
Backend-->>Frontend: {type: "transcription_result", text, confidence, is_final}
Frontend->>Frontend: Update transcription display
Frontend-->>User: Show live transcription (interim or final)
end
end
alt User sends ping
User->>Frontend: Keep-alive ping
Frontend->>Backend: {type: "ping", timestamp}
Backend-->>Frontend: {type: "pong", timestamp}
end
alt User stops or disconnects
User->>Frontend: Stop recording
Frontend->>Backend: Close WebSocket
Backend->>Backend: Cleanup connection
Frontend-->>User: Show disconnected status
end
Customer Journey 8: Real-time Text-to-Speech via WebSocket
User Goal
"I want to send text and receive audio streams in real-time without waiting for a complete file."
User Story
As a user, I want to establish a WebSocket connection, send text messages, and receive audio chunks for playback.
User Story Flow
sequenceDiagram
participant User
participant Frontend
participant Backend
User->>Frontend: Initiate real-time text-to-speech
User->>Frontend: Select language (default: de-DE)
User->>Frontend: Select voice (default: de-DE-Wavenet-A)
Frontend->>Backend: WebSocket: /api/voice-google/ws/text-to-speech<br/>?userId=user&language=de-DE&voice=de-DE-Wavenet-A
Backend->>Backend: Accept WebSocket connection
Backend-->>Frontend: {type: "connected", connection_id, message}
Frontend-->>User: Show "Connected" status
loop User sends text
User->>Frontend: Enter text to speak
User->>Frontend: Click "Speak" or send
Frontend->>Backend: {type: "text_to_speak", text: "..."}
alt Processing error
Backend->>Backend: Log error
Backend-->>Frontend: {type: "error", error: "..."}
Frontend-->>User: Show error message
else Success
Backend->>Backend: Process text-to-speech
Backend-->>Frontend: {type: "audio_data", audio: base64_audio, format: "mp3"}
Frontend->>Frontend: Decode base64 audio
Frontend->>Frontend: Create audio blob
Frontend->>Frontend: Play audio or queue for playback
Frontend-->>User: Play generated speech audio
end
end
alt User sends ping
User->>Frontend: Keep-alive ping
Frontend->>Backend: {type: "ping", timestamp}
Backend-->>Frontend: {type: "pong", timestamp}
end
alt User disconnects
User->>Frontend: Close connection
Frontend->>Backend: Close WebSocket
Backend->>Backend: Cleanup connection
Frontend-->>User: Show disconnected status
end