STT: Google Speech-to-Text v2 / Chirp Evaluation (follow-up)

Status: in progress — interactive benchmark page available for SysAdmin (2026-05-15).

Related: gateway connectorVoiceGoogle.py uses Speech v1 SpeechClient only.

Goal

Benchmark STT v2 (e.g. Chirp / Chirp 2) for de-DE vs current v1 latest_short / latest_long on:

Latency (time-to-first-token, final latency)
WER / subjective quality in meeting + coaching scenarios
Cost and quota

Current State (2026-05-15)

Interactive benchmark page added: Administration > System > STT Benchmark (SysAdmin only).
- Upload or record audio; runs v1 and v2 (Chirp 2) simultaneously; shows transcription, confidence, latency side-by-side.
- Configurable: language, v1 model, v2 model, v2 region.
- Backend: routeAdminSttBenchmark.py using google.cloud.speech (v1) + google.cloud.speech_v2 (v2).
- Frontend: SttBenchmarkPage.tsx under /admin/stt-benchmark.
Production switch not yet done — connectorVoiceGoogle.py still uses v1 only.

Next Steps

Run benchmark with real meeting/coaching audio samples across de-DE, de-CH, en-US.
Compare latency + quality. Document in this file.
If Chirp 2 wins: add v2 client path to connectorVoiceGoogle.py behind feature flag.
Run A/B on CommCoach streaming and Teamsbot batch paths with identical audio fixtures.
Document decision in wiki/b-reference/ and remove flag or make v2 default.

Notes

Streaming and batch config differ between v1 and v2; keep VoiceObjects as the single facade.
Billing hooks (calculateSttCostCHF) must use measured duration (see streaming result_end_time), not compressed byte heuristics.
google-cloud-speech==2.21.0 includes speech_v2 module — no dependency upgrade needed.
Chirp 2 is v2-only and requires regional endpoint ({location}-speech.googleapis.com).

Results

Einfacher Test.

Ergebnis 4b05925a-00df-41e0-8842-1ee587a3ca26.weba (282 KB) — de-DE

v1 — latest_long Modell: latest_long Latenz: 7303 ms Konfidenz: 64.5% Alternativen: 1 Recording auf Schwyzerdütsch, aber das schafft zu übersetzen will das Modell schlauer ist und welches Modell besser ist schon alles verschiedene Sprachen mache können wir auch auf Deutsch auf Hochdeutsch weiter sprechen und fassen wir doch zusammen, was ich jetzt alles erzählt habe. v2 — chirp_2 Modell: chirp_2 Latenz: 2426.9 ms Konfidenz: 75.4% Alternativen: 1 Region: europe-west4 Mir teste heute mal Recording auf Schweizer Deutsch, ob er das schafft zum Übersetzen, ob das Modell schlauer ist und ob das Modell besser ist. Zum Haus finde aber verschiedene Sprachen kann machen, können wir auch hoch auf Hochdeutsch weiter sprechen und fassen wir doch zusammen, was ich jetzt alles erzählt habe.

Langes Meeting

2.7 KiB Raw Blame History