Google Text To Speech Voice List

Google Text To Speech Voice List

Reading time1 min
#AI#Cloud#Accessibility#TextToSpeech#GoogleTTS#WaveNet

Mastering Google Text-to-Speech Voice List: Precision Voice Selection and Customization

Default settings in Google Cloud Text-to-Speech suffice for basic integrations. For production-grade, user-facing deployments—especially in accessibility, localization, or high-engagement domains—these defaults are rarely enough. The quality and context of digital speech matter; users subconsciously react to accent, cadence, and clarity.

Consider an actual scenario: an audiobook service receives negative feedback due to robotic delivery and mismatched accents in Spanish-language sections. The engineering fix wasn’t in the frontend, but deep in the voice configuration—selecting the right speaker instance and tuning parameters.


Voice Inventory: Technical Breakdown

Google Cloud TTS exposes a dynamic voice inventory queried via the API, not static enum values. As of mid-2024 (tested v2.14.0 of the google-cloud-texttospeech library), the system yields:

  • Language Codes: en-US, fr-FR, es-ES, etc.
  • Voice IDs: en-US-Wavenet-D, es-ES-Standard-C, etc.
  • Genders: Enum: FEMALE, MALE, NEUTRAL.
  • Synthesis Tech: Standard or WaveNet (WaveNet = significantly less prosody artifacts, at >3× the cost).

Table (sample, not exhaustive—fetch fresh for latest):

LanguageVoice NameGenderTechnology
en-USen-US-Wavenet-AFEMALEWaveNet
en-USen-US-Wavenet-DMALEWaveNet
es-ESes-ES-Wavenet-BMALEWaveNet
ja-JPja-JP-Wavenet-CFEMALEWaveNet
fr-FRfr-FR-Wavenet-AFEMALEWaveNet

Note: As of 2024, at least 15+ English US variations, and growing. API returns regional variants; plan for change.


Dynamic Voice Listing (Python, 2024)

Hard-coding voice options is a maintenance bottleneck. Instead, cache the dynamic list and expose it in your UI.

from google.cloud import texttospeech

def list_voices():
    client = texttospeech.TextToSpeechClient()
    result = client.list_voices()
    for v in result.voices:
        langs = ', '.join(v.language_codes)
        print(f"{v.name} | {langs} | {texttospeech.SsmlVoiceGender(v.ssml_gender).name} | {v.natural_sample_rate_hertz}Hz")
# Usage assumes GOOGLE_APPLICATION_CREDENTIALS is set

Potential gotcha: The API rate-limits excessive calls (HTTP 429). Cache locally—refresh every 24 hours max.


Selection Criteria in Practice

Not all voices are equal for every use case. Consider:

  • Language/Locale Concordance: Never trust a generic match—en-GB and en-US are not interchangeable for most users.
  • Voice Gender: Brand tone, target demo, and content type often dictate consistent gender selection.
  • WaveNet vs. Standard: WaveNet is preferred for production; expensive, but essential where synthetic-sounding TTS undermines user trust.
  • Latency/Quota: WaveNet (~300ms to synthesize 10s of audio) can introduce user-facing delays; batch synthesize for scale.
  • Application Context: For example, public transit announcements demand clarity above naturalness; prioritize intelligibility settings.

Parameter Tuning

Beyond static voice choice, real-world deployments typically adjust rate, pitch, and more based on context or user profile.

Parameters:

  • speakingRate: 0.25–4.0 (default=1.0). Most natural human speech is 0.9–1.1.
  • pitch: -20.0 to +20.0 semitones. Over +10 or below -10 starts to sound unnatural.
  • volumeGainDb: Range -96.0 to +16.0.
  • audioEncoding: LINEAR16, MP3, OGG_OPUS, etc.

Example payload:

{
  "input": {"text": "Bienvenue sur notre plateforme."},
  "voice": {
    "languageCode": "fr-FR",
    "name": "fr-FR-Wavenet-A",
    "ssmlGender": "FEMALE"
  },
  "audioConfig": {
    "audioEncoding": "OGG_OPUS",
    "speakingRate": 1.0,
    "pitch": 0,
    "volumeGainDb": -3.0
  }
}

Mp3 output is widely supported but OGG_OPUS is smaller and useful for streaming applications.


Case Example: Multi-Region Training Platform

On a language learning product (backend in Python 3.11, google-cloud-texttospeech==2.14.0), user preferences are stored as:

{
  "language": "es-ES",
  "voice": "es-ES-Wavenet-D",
  "speakingRate": 0.95,
  "pitch": 2
}

Workflow:

  1. Fetch voice inventory at user session start.
  2. Present previews (audio/mpeg stream inline in web/mobile UI).
  3. Synthesize with cached user settings, or fall back to platform defaults.
  4. If the requested voice is temporarily unavailable (seen: 404 Not Found from API), degrade gracefully to es-ES-Standard-A—but log for future review.

Sample code excerpt:

def synthesize(text, lang, name, rate, pitch, output_file):
    client = texttospeech.TextToSpeechClient()
    config = texttospeech.AudioConfig(
        audio_encoding=texttospeech.AudioEncoding.MP3,
        speaking_rate=rate,
        pitch=pitch)
    voice = texttospeech.VoiceSelectionParams(
        language_code=lang, name=name)
    try:
        resp = client.synthesize_speech(
            input=texttospeech.SynthesisInput(text=text),
            voice=voice,
            audio_config=config)
        with open(output_file, 'wb') as f:
            f.write(resp.audio_content)
    except Exception as ex:
        print("Voice unavailable, fallback triggered:", ex)

Side note: Known issue—if output.mp3 is written to a networked filesystem, sometimes byte truncation occurs under I/O strain; treat as a deployment gotcha.


Best Practices and Practical Considerations

  • Voice List Caching: Refresh on deploy, after major Google Cloud TTS updates, or if error logs detect 404 on voice usage.
  • UI Previews: Always allow users to preview voices—precompute short audio samples (e.g., "Sample phrase.") for every supported option.
  • Fallbacks: Build voice fallback logic. The voice list can and does change; missing voices are a silent source of failure.
  • Budget Awareness: Monitor quota consumption—WaveNet voices are billable per character, costlier than Standard.
  • Monitoring: Instrument for API errors and synthesis latency; both can spike under load.
  • Continuous Updates: Subscribe to Google Cloud release notes—voices and quality frequently improve (or in rare cases, regress).

Where Voice Selection Really Matters

A well-chosen voice bridges accessibility and engagement. Inconsistent, robotic, or regionally mismatched TTS degrades user trust—and is a silent churn factor in multi-lingual markets.

Making voice selection and customization a first-class engineering concern pays exponential dividends in accessibility, user satisfaction, and globalization. Treat the voice list as a living artifact, not a static table, and integrate quality review into release cycles.


Still using defaults? Revisit your synthesis pipeline with voice inventory awareness—and expect measurable user impact.