474 lines
18 KiB
Markdown
474 lines
18 KiB
Markdown
# Voice Service Customer Journeys
|
|
|
|
This document describes customer journeys for the Google Cloud Voice Services, focusing on how users interact with speech-to-text, translation, text-to-speech, and real-time voice interpretation features.
|
|
|
|
## Table of Contents
|
|
|
|
1. [Overview](#overview)
|
|
2. [Customer Journey 1: Converting Speech to Text](#customer-journey-1-converting-speech-to-text)
|
|
3. [Customer Journey 2: Translating Text](#customer-journey-2-translating-text)
|
|
4. [Customer Journey 3: Real-time Voice Interpretation](#customer-journey-3-real-time-voice-interpretation)
|
|
5. [Customer Journey 4: Converting Text to Speech](#customer-journey-4-converting-text-to-speech)
|
|
6. [Customer Journey 5: Managing Voice Settings](#customer-journey-5-managing-voice-settings)
|
|
7. [Customer Journey 6: Discovering Available Languages and Voices](#customer-journey-6-discovering-available-languages-and-voices)
|
|
8. [Customer Journey 7: Real-time Speech-to-Text via WebSocket](#customer-journey-7-real-time-speech-to-text-via-websocket)
|
|
9. [Customer Journey 8: Real-time Text-to-Speech via WebSocket](#customer-journey-8-real-time-text-to-speech-via-websocket)
|
|
|
|
---
|
|
|
|
## Overview
|
|
|
|
The voice service routes (`/api/voice-google`) enable users to interact with Google Cloud voice services including Speech-to-Text, Translation, Text-to-Speech, and real-time voice interpretation. These routes support both HTTP REST endpoints for file-based processing and WebSocket endpoints for real-time streaming.
|
|
|
|
**Key Principles:**
|
|
- **User-Centric**: Documentation organized around what users want to accomplish
|
|
- **Multi-Modal**: Supports both file upload and real-time streaming
|
|
- **Language-Aware**: All operations support multiple languages with user-configurable defaults
|
|
- **Settings-Driven**: User preferences stored and applied automatically
|
|
|
|
---
|
|
|
|
## Customer Journey 1: Converting Speech to Text
|
|
|
|
### User Goal
|
|
"I want to record my speech and convert it into text."
|
|
|
|
### User Story
|
|
As a user, I want to record my speech through the frontend microphone and get the transcribed text with confidence scores and language detection.
|
|
|
|
### User Story Flow
|
|
|
|
```mermaid
|
|
sequenceDiagram
|
|
participant User
|
|
participant Frontend
|
|
participant Backend
|
|
|
|
User->>Frontend: Select language (default: de-DE)
|
|
User->>Frontend: Click "Start Recording"
|
|
Frontend->>Frontend: Request microphone access
|
|
alt Microphone access denied
|
|
Frontend-->>User: Show permission error
|
|
else Microphone access granted
|
|
Frontend->>Frontend: Start audio recording
|
|
Frontend-->>User: Show recording indicator (red dot, timer)
|
|
|
|
User->>Frontend: Speak into microphone
|
|
Frontend->>Frontend: Capture audio stream
|
|
|
|
User->>Frontend: Click "Stop Recording"
|
|
Frontend->>Frontend: Stop audio recording
|
|
Frontend->>Frontend: Convert audio stream to audio file
|
|
Frontend->>Frontend: Validate audio file (size, format)
|
|
|
|
alt File validation fails
|
|
Frontend-->>User: Show validation error
|
|
else File validation passes
|
|
Frontend->>Frontend: Show loading state
|
|
Frontend->>Backend: POST /api/voice-google/speech-to-text<br/>(audioFile, language)
|
|
|
|
alt Authentication fails (401)
|
|
Backend-->>Frontend: 401 Unauthorized
|
|
Frontend-->>User: Show authentication error
|
|
else Invalid audio format (400)
|
|
Backend-->>Frontend: 400 Bad Request + error details
|
|
Frontend-->>User: Show format error message
|
|
else Speech recognition fails (400)
|
|
Backend-->>Frontend: 400 Bad Request + error details
|
|
Frontend-->>User: Show recognition error
|
|
else Success (200)
|
|
Backend-->>Frontend: {success: true, text, confidence, language, audio_info}
|
|
Frontend->>Frontend: Display transcribed text
|
|
Frontend->>Frontend: Display confidence score
|
|
Frontend->>Frontend: Display detected language
|
|
Frontend->>Frontend: Display audio metadata
|
|
Frontend-->>User: Show transcription result with metadata
|
|
end
|
|
end
|
|
end
|
|
```
|
|
|
|
---
|
|
|
|
## Customer Journey 2: Translating Text
|
|
|
|
### User Goal
|
|
"I want to translate text from one language to another."
|
|
|
|
### User Story
|
|
As a user, I want to enter text and translate it between languages, seeing both the original and translated text.
|
|
|
|
### User Story Flow
|
|
|
|
```mermaid
|
|
sequenceDiagram
|
|
participant User
|
|
participant Frontend
|
|
participant Backend
|
|
|
|
User->>Frontend: Enter text to translate
|
|
User->>Frontend: Select source language (default: de)
|
|
User->>Frontend: Select target language (default: en)
|
|
User->>Frontend: Click "Translate"
|
|
|
|
Frontend->>Frontend: Validate text is not empty
|
|
alt Text is empty
|
|
Frontend-->>User: Show validation error
|
|
else Text validation passes
|
|
Frontend->>Frontend: Show loading state
|
|
Frontend->>Backend: POST /api/voice-google/translate<br/>(text, sourceLanguage, targetLanguage)
|
|
|
|
alt Authentication fails (401)
|
|
Backend-->>Frontend: 401 Unauthorized
|
|
Frontend-->>User: Show authentication error
|
|
else Empty text (400)
|
|
Backend-->>Frontend: 400 Bad Request
|
|
Frontend-->>User: Show empty text error
|
|
else Translation fails (400)
|
|
Backend-->>Frontend: 400 Bad Request + error details
|
|
Frontend-->>User: Show translation error
|
|
else Success (200)
|
|
Backend-->>Frontend: {success: true, original_text, translated_text, source_language, target_language}
|
|
Frontend->>Frontend: Display original text
|
|
Frontend->>Frontend: Display translated text
|
|
Frontend->>Frontend: Display language information
|
|
Frontend-->>User: Show translation result
|
|
end
|
|
end
|
|
```
|
|
|
|
---
|
|
|
|
## Customer Journey 3: Real-time Voice Interpretation
|
|
|
|
### User Goal
|
|
"I want to speak in one language and get the translated text in another language."
|
|
|
|
### User Story
|
|
As a user, I want to record my speech through the frontend microphone and receive both the transcribed text in the source language and its translation in the target language.
|
|
|
|
### User Story Flow
|
|
|
|
```mermaid
|
|
sequenceDiagram
|
|
participant User
|
|
participant Frontend
|
|
participant Backend
|
|
|
|
User->>Frontend: Select source language (default: de-DE)
|
|
User->>Frontend: Select target language (default: en-US)
|
|
User->>Frontend: Click "Start Recording"
|
|
Frontend->>Frontend: Request microphone access
|
|
alt Microphone access denied
|
|
Frontend-->>User: Show permission error
|
|
else Microphone access granted
|
|
Frontend->>Frontend: Start audio recording
|
|
Frontend-->>User: Show recording indicator (red dot, timer)
|
|
|
|
User->>Frontend: Speak into microphone
|
|
Frontend->>Frontend: Capture audio stream
|
|
|
|
User->>Frontend: Click "Stop Recording"
|
|
Frontend->>Frontend: Stop audio recording
|
|
Frontend->>Frontend: Convert audio stream to audio file
|
|
Frontend->>Frontend: Validate audio file
|
|
|
|
alt File validation fails
|
|
Frontend-->>User: Show validation error
|
|
else File validation passes
|
|
Frontend->>Frontend: Show loading state
|
|
Frontend->>Backend: POST /api/voice-google/realtime-interpreter<br/>(audioFile, fromLanguage, toLanguage, connectionId?)
|
|
|
|
alt Authentication fails (401)
|
|
Backend-->>Frontend: 401 Unauthorized
|
|
Frontend-->>User: Show authentication error
|
|
else Invalid audio format (400)
|
|
Backend-->>Frontend: 400 Bad Request + error details
|
|
Frontend-->>User: Show format error
|
|
else Interpretation fails (400)
|
|
Backend-->>Frontend: 400 Bad Request + error details
|
|
Frontend-->>User: Show interpretation error
|
|
else Success (200)
|
|
Backend-->>Frontend: {success: true, original_text, translated_text, confidence, source_language, target_language, audio_info}
|
|
Frontend->>Frontend: Display original transcribed text
|
|
Frontend->>Frontend: Display translated text
|
|
Frontend->>Frontend: Display confidence score
|
|
Frontend->>Frontend: Display language information
|
|
Frontend->>Frontend: Display audio metadata
|
|
Frontend-->>User: Show interpretation result with both texts
|
|
end
|
|
end
|
|
end
|
|
```
|
|
|
|
---
|
|
|
|
## Customer Journey 4: Converting Text to Speech
|
|
|
|
### User Goal
|
|
"I want to convert written text into spoken audio."
|
|
|
|
### User Story
|
|
As a user, I want to enter text, select a language and voice, and receive an audio file of the spoken text.
|
|
|
|
### User Story Flow
|
|
|
|
```mermaid
|
|
sequenceDiagram
|
|
participant User
|
|
participant Frontend
|
|
participant Backend
|
|
|
|
User->>Frontend: Enter text to speak
|
|
User->>Frontend: Select language (default: de-DE)
|
|
User->>Frontend: Select voice (optional, default from settings)
|
|
User->>Frontend: Click "Generate Speech"
|
|
|
|
Frontend->>Frontend: Validate text is not empty
|
|
alt Text is empty
|
|
Frontend-->>User: Show validation error
|
|
else Text validation passes
|
|
Frontend->>Frontend: Show loading state
|
|
Frontend->>Backend: POST /api/voice-google/text-to-speech<br/>(text, language, voice?)
|
|
|
|
alt Authentication fails (401)
|
|
Backend-->>Frontend: 401 Unauthorized
|
|
Frontend-->>User: Show authentication error
|
|
else Empty text (400)
|
|
Backend-->>Frontend: 400 Bad Request
|
|
Frontend-->>User: Show empty text error
|
|
else Text-to-Speech fails (400)
|
|
Backend-->>Frontend: 400 Bad Request + error details
|
|
Frontend-->>User: Show TTS error
|
|
else Success (200)
|
|
Backend-->>Frontend: Audio file (audio/mpeg)<br/>Headers: X-Voice-Name, X-Language-Code
|
|
Frontend->>Frontend: Create audio blob from response
|
|
Frontend->>Frontend: Create download link or audio player
|
|
Frontend->>Frontend: Display voice name and language
|
|
Frontend-->>User: Show audio player with download option
|
|
end
|
|
end
|
|
```
|
|
|
|
---
|
|
|
|
## Customer Journey 5: Managing Voice Settings
|
|
|
|
### User Goal
|
|
"I want to configure my default voice settings for speech recognition, translation, and text-to-speech."
|
|
|
|
### User Story
|
|
As a user, I want to view and update my voice settings including default languages, voice preferences, and translation settings.
|
|
|
|
### User Story Flow
|
|
|
|
```mermaid
|
|
sequenceDiagram
|
|
participant User
|
|
participant Frontend
|
|
participant Backend
|
|
|
|
User->>Frontend: Navigate to voice settings
|
|
Frontend->>Backend: GET /api/voice-google/settings
|
|
Backend-->>Frontend: {success: true, data: {user_settings, default_settings}}
|
|
Frontend->>Frontend: Pre-populate form with user_settings or default_settings
|
|
Frontend-->>User: Display settings form
|
|
|
|
alt User views settings only
|
|
User->>Frontend: View current settings
|
|
Frontend-->>User: Display settings (read-only or editable)
|
|
else User updates settings
|
|
User->>Frontend: Modify settings (sttLanguage, ttsLanguage, ttsVoice, translationEnabled, targetLanguage)
|
|
User->>Frontend: Click "Save Settings"
|
|
|
|
Frontend->>Frontend: Validate required fields (sttLanguage, ttsLanguage, ttsVoice)
|
|
alt Validation fails
|
|
Frontend-->>User: Show validation errors
|
|
else Validation passes
|
|
Frontend->>Frontend: Show loading state
|
|
Frontend->>Backend: POST /api/voice-google/settings<br/>(settings object)
|
|
|
|
alt Authentication fails (401)
|
|
Backend-->>Frontend: 401 Unauthorized
|
|
Frontend-->>User: Show authentication error
|
|
else Missing required field (400)
|
|
Backend-->>Frontend: 400 Bad Request + field name
|
|
Frontend-->>User: Show missing field error
|
|
else Save fails (500)
|
|
Backend-->>Frontend: 500 Internal Server Error
|
|
Frontend-->>User: Show save error
|
|
else Success (200)
|
|
Backend-->>Frontend: {success: true, message, data: settings}
|
|
Frontend->>Frontend: Update UI with saved settings
|
|
Frontend->>Frontend: Show success message
|
|
Frontend-->>User: Display confirmation and updated settings
|
|
end
|
|
end
|
|
end
|
|
```
|
|
|
|
---
|
|
|
|
## Customer Journey 6: Discovering Available Languages and Voices
|
|
|
|
### User Goal
|
|
"I want to see what languages and voices are available for speech and translation services."
|
|
|
|
### User Story
|
|
As a user, I want to browse available languages and filter voices by language to configure my preferences.
|
|
|
|
### User Story Flow
|
|
|
|
```mermaid
|
|
sequenceDiagram
|
|
participant User
|
|
participant Frontend
|
|
participant Backend
|
|
|
|
alt User wants to see available languages
|
|
User->>Frontend: Navigate to language selection
|
|
Frontend->>Backend: GET /api/voice-google/languages
|
|
Backend-->>Frontend: {success: true, languages: [...]}
|
|
Frontend->>Frontend: Display language list
|
|
Frontend-->>User: Show available languages
|
|
end
|
|
|
|
alt User wants to see available voices
|
|
User->>Frontend: Navigate to voice selection
|
|
User->>Frontend: Optionally select language filter
|
|
Frontend->>Backend: GET /api/voice-google/voices<br/>?language_code=de-DE (optional)
|
|
Backend-->>Frontend: {success: true, voices: [...], language_filter: "de-DE"}
|
|
Frontend->>Frontend: Display voice list
|
|
Frontend->>Frontend: Group voices by language if no filter
|
|
Frontend-->>User: Show available voices
|
|
end
|
|
|
|
alt User filters voices by language
|
|
User->>Frontend: Select language in filter
|
|
Frontend->>Backend: GET /api/voice-google/voices?language_code=selected
|
|
Backend-->>Frontend: Filtered voices list
|
|
Frontend->>Frontend: Update voice list
|
|
Frontend-->>User: Show filtered voices
|
|
end
|
|
```
|
|
|
|
---
|
|
|
|
## Customer Journey 7: Real-time Speech-to-Text via WebSocket
|
|
|
|
### User Goal
|
|
"I want to get real-time transcription as I speak, without uploading a complete audio file."
|
|
|
|
### User Story
|
|
As a user, I want to establish a WebSocket connection and stream audio chunks to receive live transcription results.
|
|
|
|
### User Story Flow
|
|
|
|
```mermaid
|
|
sequenceDiagram
|
|
participant User
|
|
participant Frontend
|
|
participant Backend
|
|
|
|
User->>Frontend: Initiate real-time speech-to-text
|
|
User->>Frontend: Select language (default: de-DE)
|
|
Frontend->>Frontend: Start audio capture
|
|
Frontend->>Backend: WebSocket: /api/voice-google/ws/speech-to-text<br/>?userId=user&language=de-DE
|
|
|
|
Backend->>Backend: Accept WebSocket connection
|
|
Backend->>Backend: Initialize voice interface
|
|
Backend-->>Frontend: {type: "connected", connection_id, message}
|
|
Frontend-->>User: Show "Connected" status
|
|
|
|
loop User speaks
|
|
User->>Frontend: Speak into microphone
|
|
Frontend->>Frontend: Capture audio chunk
|
|
Frontend->>Frontend: Encode audio chunk to base64
|
|
Frontend->>Backend: {type: "audio_chunk", data: base64_audio, timestamp}
|
|
|
|
alt Processing error
|
|
Backend->>Backend: Log error
|
|
Backend-->>Frontend: {type: "error", error: "..."}
|
|
Frontend-->>User: Show error message
|
|
else Success
|
|
Backend->>Backend: Process audio chunk (Speech-to-Text)
|
|
Backend-->>Frontend: {type: "transcription_result", text, confidence, is_final}
|
|
Frontend->>Frontend: Update transcription display
|
|
Frontend-->>User: Show live transcription (interim or final)
|
|
end
|
|
end
|
|
|
|
alt User sends ping
|
|
User->>Frontend: Keep-alive ping
|
|
Frontend->>Backend: {type: "ping", timestamp}
|
|
Backend-->>Frontend: {type: "pong", timestamp}
|
|
end
|
|
|
|
alt User stops or disconnects
|
|
User->>Frontend: Stop recording
|
|
Frontend->>Backend: Close WebSocket
|
|
Backend->>Backend: Cleanup connection
|
|
Frontend-->>User: Show disconnected status
|
|
end
|
|
```
|
|
|
|
---
|
|
|
|
## Customer Journey 8: Real-time Text-to-Speech via WebSocket
|
|
|
|
### User Goal
|
|
"I want to send text and receive audio streams in real-time without waiting for a complete file."
|
|
|
|
### User Story
|
|
As a user, I want to establish a WebSocket connection, send text messages, and receive audio chunks for playback.
|
|
|
|
### User Story Flow
|
|
|
|
```mermaid
|
|
sequenceDiagram
|
|
participant User
|
|
participant Frontend
|
|
participant Backend
|
|
|
|
User->>Frontend: Initiate real-time text-to-speech
|
|
User->>Frontend: Select language (default: de-DE)
|
|
User->>Frontend: Select voice (default: de-DE-Wavenet-A)
|
|
Frontend->>Backend: WebSocket: /api/voice-google/ws/text-to-speech<br/>?userId=user&language=de-DE&voice=de-DE-Wavenet-A
|
|
|
|
Backend->>Backend: Accept WebSocket connection
|
|
Backend-->>Frontend: {type: "connected", connection_id, message}
|
|
Frontend-->>User: Show "Connected" status
|
|
|
|
loop User sends text
|
|
User->>Frontend: Enter text to speak
|
|
User->>Frontend: Click "Speak" or send
|
|
Frontend->>Backend: {type: "text_to_speak", text: "..."}
|
|
|
|
alt Processing error
|
|
Backend->>Backend: Log error
|
|
Backend-->>Frontend: {type: "error", error: "..."}
|
|
Frontend-->>User: Show error message
|
|
else Success
|
|
Backend->>Backend: Process text-to-speech
|
|
Backend-->>Frontend: {type: "audio_data", audio: base64_audio, format: "mp3"}
|
|
Frontend->>Frontend: Decode base64 audio
|
|
Frontend->>Frontend: Create audio blob
|
|
Frontend->>Frontend: Play audio or queue for playback
|
|
Frontend-->>User: Play generated speech audio
|
|
end
|
|
end
|
|
|
|
alt User sends ping
|
|
User->>Frontend: Keep-alive ping
|
|
Frontend->>Backend: {type: "ping", timestamp}
|
|
Backend-->>Frontend: {type: "pong", timestamp}
|
|
end
|
|
|
|
alt User disconnects
|
|
User->>Frontend: Close connection
|
|
Frontend->>Backend: Close WebSocket
|
|
Backend->>Backend: Cleanup connection
|
|
Frontend-->>User: Show disconnected status
|
|
end
|
|
```
|
|
|