Your AI Voice Sounds WRONG! Here's Why 🤖 → 🗣️ - Thorsten-Voice, die freie deutsche KI-Stimme.

Transform your Text-to-Speech output from robotic to natural-sounding with proper text preprocessing (cleaning / normalization). My Youtube step-by-step tutorial shows you how to handle numbers, abbreviations, and special characters to significantly improve your TTS quality. This works for ANY TTS, not just fancy AI based text-to-speech models, but espeak / mbrola, too.

Video Tutorial

Why Text Cleaning Matters

When feeding text into a TTS system, certain elements can cause unnatural speech patterns:

Abbreviations like „Dr.“ or „Mr.“ are interpreted as sentence endings
Numbers are read digit by digit instead of naturally
Special characters and symbols may cause unexpected pauses
Time formats and dates might be misinterpreted

„Bad“ text input to TTS: „Dr. Smith paid $1,234 for 2 items at 3pm after waiting outside at 72°F on may, 15th, 2024. While waiting for the train to arrive at 15:45 he called a support hotline at 1-800-555-0123.„

Text NOT cleaned / normalized and spoken with Piper TTS.

This is hard for most TTS systems, because it contains lots of special characters that are hard to pronounce correctly for TTS.

„Better“ text input keeping the same sentence: „Doctor Smith paid one thousand two hundred thirty-four dollars for two items at three p m after waiting outside at seventy-two degrees Fahrenheit on May fifteenth, twenty twenty-four. While waiting for the train to arrive at fifteen forty-five he called a support hotline at one eight hundred five five five zero one two three.„

Text CLEANED / NORMALIZED and spoken with Piper TTS.

The Solution: Text Preprocessing

Below you’ll find a Python script that handles common text cleaning tasks. It works with any TTS system, including Piper, Coqui, eSpeak, and others.

Features:

Converts numbers to words (e.g., „123“ → „one hundred twenty-three“)
Expands common abbreviations
Handles time formats
Processes dates naturally
Converts temperatures and units
Supports multiple languages (configurable)

Download the Script

The script is on my Thorsten-Voice GitHub repository.

Usage Example

I created a jupyter notebook on Google Colab to show the concept of building your voice processing pipeline including text cleaning / normalization.

It uses NVIDIA NeMo framwork for text cleaning and Piper for text-to-speech.

The notebook can be found here and will be explained in my Youtube tutorial here.

Community & Support

Found a bug or have suggestions? Open an issue on GitHub
Questions? Comment below or on the YouTube video

Remember to subscribe to my Thorsten-Voice YouTube channel for more TTS tutorials and updates!