# Voice Service Customer Journeys This document describes customer journeys for the Google Cloud Voice Services, focusing on how users interact with speech-to-text, translation, text-to-speech, and real-time voice interpretation features. ## Table of Contents 1. [Overview](#overview) 2. [Customer Journey 1: Converting Speech to Text](#customer-journey-1-converting-speech-to-text) 3. [Customer Journey 2: Translating Text](#customer-journey-2-translating-text) 4. [Customer Journey 3: Real-time Voice Interpretation](#customer-journey-3-real-time-voice-interpretation) 5. [Customer Journey 4: Converting Text to Speech](#customer-journey-4-converting-text-to-speech) 6. [Customer Journey 5: Managing Voice Settings](#customer-journey-5-managing-voice-settings) 7. [Customer Journey 6: Discovering Available Languages and Voices](#customer-journey-6-discovering-available-languages-and-voices) 8. [Customer Journey 7: Real-time Speech-to-Text via WebSocket](#customer-journey-7-real-time-speech-to-text-via-websocket) 9. [Customer Journey 8: Real-time Text-to-Speech via WebSocket](#customer-journey-8-real-time-text-to-speech-via-websocket) --- ## Overview The voice service routes (`/api/voice-google`) enable users to interact with Google Cloud voice services including Speech-to-Text, Translation, Text-to-Speech, and real-time voice interpretation. These routes support both HTTP REST endpoints for file-based processing and WebSocket endpoints for real-time streaming. **Key Principles:** - **User-Centric**: Documentation organized around what users want to accomplish - **Multi-Modal**: Supports both file upload and real-time streaming - **Language-Aware**: All operations support multiple languages with user-configurable defaults - **Settings-Driven**: User preferences stored and applied automatically --- ## Customer Journey 1: Converting Speech to Text ### User Goal "I want to record my speech and convert it into text." ### User Story As a user, I want to record my speech through the frontend microphone and get the transcribed text with confidence scores and language detection. ### User Story Flow ```mermaid sequenceDiagram participant User participant Frontend participant Backend User->>Frontend: Select language (default: de-DE) User->>Frontend: Click "Start Recording" Frontend->>Frontend: Request microphone access alt Microphone access denied Frontend-->>User: Show permission error else Microphone access granted Frontend->>Frontend: Start audio recording Frontend-->>User: Show recording indicator (red dot, timer) User->>Frontend: Speak into microphone Frontend->>Frontend: Capture audio stream User->>Frontend: Click "Stop Recording" Frontend->>Frontend: Stop audio recording Frontend->>Frontend: Convert audio stream to audio file Frontend->>Frontend: Validate audio file (size, format) alt File validation fails Frontend-->>User: Show validation error else File validation passes Frontend->>Frontend: Show loading state Frontend->>Backend: POST /api/voice-google/speech-to-text
(audioFile, language) alt Authentication fails (401) Backend-->>Frontend: 401 Unauthorized Frontend-->>User: Show authentication error else Invalid audio format (400) Backend-->>Frontend: 400 Bad Request + error details Frontend-->>User: Show format error message else Speech recognition fails (400) Backend-->>Frontend: 400 Bad Request + error details Frontend-->>User: Show recognition error else Success (200) Backend-->>Frontend: {success: true, text, confidence, language, audio_info} Frontend->>Frontend: Display transcribed text Frontend->>Frontend: Display confidence score Frontend->>Frontend: Display detected language Frontend->>Frontend: Display audio metadata Frontend-->>User: Show transcription result with metadata end end end ``` --- ## Customer Journey 2: Translating Text ### User Goal "I want to translate text from one language to another." ### User Story As a user, I want to enter text and translate it between languages, seeing both the original and translated text. ### User Story Flow ```mermaid sequenceDiagram participant User participant Frontend participant Backend User->>Frontend: Enter text to translate User->>Frontend: Select source language (default: de) User->>Frontend: Select target language (default: en) User->>Frontend: Click "Translate" Frontend->>Frontend: Validate text is not empty alt Text is empty Frontend-->>User: Show validation error else Text validation passes Frontend->>Frontend: Show loading state Frontend->>Backend: POST /api/voice-google/translate
(text, sourceLanguage, targetLanguage) alt Authentication fails (401) Backend-->>Frontend: 401 Unauthorized Frontend-->>User: Show authentication error else Empty text (400) Backend-->>Frontend: 400 Bad Request Frontend-->>User: Show empty text error else Translation fails (400) Backend-->>Frontend: 400 Bad Request + error details Frontend-->>User: Show translation error else Success (200) Backend-->>Frontend: {success: true, original_text, translated_text, source_language, target_language} Frontend->>Frontend: Display original text Frontend->>Frontend: Display translated text Frontend->>Frontend: Display language information Frontend-->>User: Show translation result end end ``` --- ## Customer Journey 3: Real-time Voice Interpretation ### User Goal "I want to speak in one language and get the translated text in another language." ### User Story As a user, I want to record my speech through the frontend microphone and receive both the transcribed text in the source language and its translation in the target language. ### User Story Flow ```mermaid sequenceDiagram participant User participant Frontend participant Backend User->>Frontend: Select source language (default: de-DE) User->>Frontend: Select target language (default: en-US) User->>Frontend: Click "Start Recording" Frontend->>Frontend: Request microphone access alt Microphone access denied Frontend-->>User: Show permission error else Microphone access granted Frontend->>Frontend: Start audio recording Frontend-->>User: Show recording indicator (red dot, timer) User->>Frontend: Speak into microphone Frontend->>Frontend: Capture audio stream User->>Frontend: Click "Stop Recording" Frontend->>Frontend: Stop audio recording Frontend->>Frontend: Convert audio stream to audio file Frontend->>Frontend: Validate audio file alt File validation fails Frontend-->>User: Show validation error else File validation passes Frontend->>Frontend: Show loading state Frontend->>Backend: POST /api/voice-google/realtime-interpreter
(audioFile, fromLanguage, toLanguage, connectionId?) alt Authentication fails (401) Backend-->>Frontend: 401 Unauthorized Frontend-->>User: Show authentication error else Invalid audio format (400) Backend-->>Frontend: 400 Bad Request + error details Frontend-->>User: Show format error else Interpretation fails (400) Backend-->>Frontend: 400 Bad Request + error details Frontend-->>User: Show interpretation error else Success (200) Backend-->>Frontend: {success: true, original_text, translated_text, confidence, source_language, target_language, audio_info} Frontend->>Frontend: Display original transcribed text Frontend->>Frontend: Display translated text Frontend->>Frontend: Display confidence score Frontend->>Frontend: Display language information Frontend->>Frontend: Display audio metadata Frontend-->>User: Show interpretation result with both texts end end end ``` --- ## Customer Journey 4: Converting Text to Speech ### User Goal "I want to convert written text into spoken audio." ### User Story As a user, I want to enter text, select a language and voice, and receive an audio file of the spoken text. ### User Story Flow ```mermaid sequenceDiagram participant User participant Frontend participant Backend User->>Frontend: Enter text to speak User->>Frontend: Select language (default: de-DE) User->>Frontend: Select voice (optional, default from settings) User->>Frontend: Click "Generate Speech" Frontend->>Frontend: Validate text is not empty alt Text is empty Frontend-->>User: Show validation error else Text validation passes Frontend->>Frontend: Show loading state Frontend->>Backend: POST /api/voice-google/text-to-speech
(text, language, voice?) alt Authentication fails (401) Backend-->>Frontend: 401 Unauthorized Frontend-->>User: Show authentication error else Empty text (400) Backend-->>Frontend: 400 Bad Request Frontend-->>User: Show empty text error else Text-to-Speech fails (400) Backend-->>Frontend: 400 Bad Request + error details Frontend-->>User: Show TTS error else Success (200) Backend-->>Frontend: Audio file (audio/mpeg)
Headers: X-Voice-Name, X-Language-Code Frontend->>Frontend: Create audio blob from response Frontend->>Frontend: Create download link or audio player Frontend->>Frontend: Display voice name and language Frontend-->>User: Show audio player with download option end end ``` --- ## Customer Journey 5: Managing Voice Settings ### User Goal "I want to configure my default voice settings for speech recognition, translation, and text-to-speech." ### User Story As a user, I want to view and update my voice settings including default languages, voice preferences, and translation settings. ### User Story Flow ```mermaid sequenceDiagram participant User participant Frontend participant Backend User->>Frontend: Navigate to voice settings Frontend->>Backend: GET /api/voice-google/settings Backend-->>Frontend: {success: true, data: {user_settings, default_settings}} Frontend->>Frontend: Pre-populate form with user_settings or default_settings Frontend-->>User: Display settings form alt User views settings only User->>Frontend: View current settings Frontend-->>User: Display settings (read-only or editable) else User updates settings User->>Frontend: Modify settings (sttLanguage, ttsLanguage, ttsVoice, translationEnabled, targetLanguage) User->>Frontend: Click "Save Settings" Frontend->>Frontend: Validate required fields (sttLanguage, ttsLanguage, ttsVoice) alt Validation fails Frontend-->>User: Show validation errors else Validation passes Frontend->>Frontend: Show loading state Frontend->>Backend: POST /api/voice-google/settings
(settings object) alt Authentication fails (401) Backend-->>Frontend: 401 Unauthorized Frontend-->>User: Show authentication error else Missing required field (400) Backend-->>Frontend: 400 Bad Request + field name Frontend-->>User: Show missing field error else Save fails (500) Backend-->>Frontend: 500 Internal Server Error Frontend-->>User: Show save error else Success (200) Backend-->>Frontend: {success: true, message, data: settings} Frontend->>Frontend: Update UI with saved settings Frontend->>Frontend: Show success message Frontend-->>User: Display confirmation and updated settings end end end ``` --- ## Customer Journey 6: Discovering Available Languages and Voices ### User Goal "I want to see what languages and voices are available for speech and translation services." ### User Story As a user, I want to browse available languages and filter voices by language to configure my preferences. ### User Story Flow ```mermaid sequenceDiagram participant User participant Frontend participant Backend alt User wants to see available languages User->>Frontend: Navigate to language selection Frontend->>Backend: GET /api/voice-google/languages Backend-->>Frontend: {success: true, languages: [...]} Frontend->>Frontend: Display language list Frontend-->>User: Show available languages end alt User wants to see available voices User->>Frontend: Navigate to voice selection User->>Frontend: Optionally select language filter Frontend->>Backend: GET /api/voice-google/voices
?language_code=de-DE (optional) Backend-->>Frontend: {success: true, voices: [...], language_filter: "de-DE"} Frontend->>Frontend: Display voice list Frontend->>Frontend: Group voices by language if no filter Frontend-->>User: Show available voices end alt User filters voices by language User->>Frontend: Select language in filter Frontend->>Backend: GET /api/voice-google/voices?language_code=selected Backend-->>Frontend: Filtered voices list Frontend->>Frontend: Update voice list Frontend-->>User: Show filtered voices end ``` --- ## Customer Journey 7: Real-time Speech-to-Text via WebSocket ### User Goal "I want to get real-time transcription as I speak, without uploading a complete audio file." ### User Story As a user, I want to establish a WebSocket connection and stream audio chunks to receive live transcription results. ### User Story Flow ```mermaid sequenceDiagram participant User participant Frontend participant Backend User->>Frontend: Initiate real-time speech-to-text User->>Frontend: Select language (default: de-DE) Frontend->>Frontend: Start audio capture Frontend->>Backend: WebSocket: /api/voice-google/ws/speech-to-text
?userId=user&language=de-DE Backend->>Backend: Accept WebSocket connection Backend->>Backend: Initialize voice interface Backend-->>Frontend: {type: "connected", connection_id, message} Frontend-->>User: Show "Connected" status loop User speaks User->>Frontend: Speak into microphone Frontend->>Frontend: Capture audio chunk Frontend->>Frontend: Encode audio chunk to base64 Frontend->>Backend: {type: "audio_chunk", data: base64_audio, timestamp} alt Processing error Backend->>Backend: Log error Backend-->>Frontend: {type: "error", error: "..."} Frontend-->>User: Show error message else Success Backend->>Backend: Process audio chunk (Speech-to-Text) Backend-->>Frontend: {type: "transcription_result", text, confidence, is_final} Frontend->>Frontend: Update transcription display Frontend-->>User: Show live transcription (interim or final) end end alt User sends ping User->>Frontend: Keep-alive ping Frontend->>Backend: {type: "ping", timestamp} Backend-->>Frontend: {type: "pong", timestamp} end alt User stops or disconnects User->>Frontend: Stop recording Frontend->>Backend: Close WebSocket Backend->>Backend: Cleanup connection Frontend-->>User: Show disconnected status end ``` --- ## Customer Journey 8: Real-time Text-to-Speech via WebSocket ### User Goal "I want to send text and receive audio streams in real-time without waiting for a complete file." ### User Story As a user, I want to establish a WebSocket connection, send text messages, and receive audio chunks for playback. ### User Story Flow ```mermaid sequenceDiagram participant User participant Frontend participant Backend User->>Frontend: Initiate real-time text-to-speech User->>Frontend: Select language (default: de-DE) User->>Frontend: Select voice (default: de-DE-Wavenet-A) Frontend->>Backend: WebSocket: /api/voice-google/ws/text-to-speech
?userId=user&language=de-DE&voice=de-DE-Wavenet-A Backend->>Backend: Accept WebSocket connection Backend-->>Frontend: {type: "connected", connection_id, message} Frontend-->>User: Show "Connected" status loop User sends text User->>Frontend: Enter text to speak User->>Frontend: Click "Speak" or send Frontend->>Backend: {type: "text_to_speak", text: "..."} alt Processing error Backend->>Backend: Log error Backend-->>Frontend: {type: "error", error: "..."} Frontend-->>User: Show error message else Success Backend->>Backend: Process text-to-speech Backend-->>Frontend: {type: "audio_data", audio: base64_audio, format: "mp3"} Frontend->>Frontend: Decode base64 audio Frontend->>Frontend: Create audio blob Frontend->>Frontend: Play audio or queue for playback Frontend-->>User: Play generated speech audio end end alt User sends ping User->>Frontend: Keep-alive ping Frontend->>Backend: {type: "ping", timestamp} Backend-->>Frontend: {type: "pong", timestamp} end alt User disconnects User->>Frontend: Close connection Frontend->>Backend: Close WebSocket Backend->>Backend: Cleanup connection Frontend-->>User: Show disconnected status end ```