wiki/z-archive/stt-chirp-v2-evaluation.md

# STT: Google Speech-to-Text v2 / Chirp Evaluation (follow-up)

Status: **in progress** — interactive benchmark page available for SysAdmin (2026-05-15).

Related: gateway `connectorVoiceGoogle.py` uses Speech v1 `SpeechClient` only.

## Goal

Benchmark STT v2 (e.g. Chirp / Chirp 2) for `de-DE` vs current v1 `latest_short` / `latest_long` on:

- Latency (time-to-first-token, final latency)
- WER / subjective quality in meeting + coaching scenarios
- Cost and quota

## Current State (2026-05-15)

- **Interactive benchmark page** added: `Administration > System > STT Benchmark` (SysAdmin only).
  - Upload or record audio; runs v1 and v2 (Chirp 2) simultaneously; shows transcription, confidence, latency side-by-side.
  - Configurable: language, v1 model, v2 model, v2 region.
  - Backend: `routeAdminSttBenchmark.py` using `google.cloud.speech` (v1) + `google.cloud.speech_v2` (v2).
  - Frontend: `SttBenchmarkPage.tsx` under `/admin/stt-benchmark`.
- **Production switch not yet done** — `connectorVoiceGoogle.py` still uses v1 only.

## Next Steps

1. Run benchmark with real meeting/coaching audio samples across `de-DE`, `de-CH`, `en-US`.
2. Compare latency + quality. Document in this file.
3. If Chirp 2 wins: add v2 client path to `connectorVoiceGoogle.py` behind feature flag.
4. Run A/B on CommCoach streaming and Teamsbot batch paths with identical audio fixtures.
5. Document decision in `wiki/b-reference/` and remove flag or make v2 default.

## Notes

- Streaming and batch config differ between v1 and v2; keep `VoiceObjects` as the single facade.
- Billing hooks (`calculateSttCostCHF`) must use measured duration (see streaming `result_end_time`), not compressed byte heuristics.
- `google-cloud-speech==2.21.0` includes `speech_v2` module — no dependency upgrade needed.
- Chirp 2 is v2-only and requires regional endpoint (`{location}-speech.googleapis.com`).


## Results

Einfacher Test.

Ergebnis
4b05925a-00df-41e0-8842-1ee587a3ca26.weba (282 KB) — de-DE

v1 — latest_long
Modell: latest_long
Latenz: 7303 ms
Konfidenz: 64.5%
Alternativen: 1
Recording auf Schwyzerdütsch, aber das schafft zu übersetzen will das Modell schlauer ist und welches Modell besser ist schon alles verschiedene Sprachen mache können wir auch auf Deutsch auf Hochdeutsch weiter sprechen und fassen wir doch zusammen, was ich jetzt alles erzählt habe.
v2 — chirp_2
Modell: chirp_2
Latenz: 2426.9 ms
Konfidenz: 75.4%
Alternativen: 1
Region: europe-west4
Mir teste heute mal Recording auf Schweizer Deutsch, ob er das schafft zum Übersetzen, ob das Modell schlauer ist und ob das Modell besser ist. Zum Haus finde aber verschiedene Sprachen kann machen, können wir auch hoch auf Hochdeutsch weiter sprechen und fassen wir doch zusammen, was ich jetzt alles erzählt habe.

Langes Meeting