Transform your Text-to-Speech output from robotic to natural-sounding with proper text preprocessing (cleaning / normalization). My Youtube step-by-step tutorial shows you how to handle numbers, abbreviations, and special characters to significantly improve your TTS quality. This works for ANY TTS, not just fancy AI based text-to-speech models, but espeak / mbrola, too.
Video Tutorial
Why Text Cleaning Matters
When feeding text into a TTS system, certain elements can cause unnatural speech patterns:
- Abbreviations like „Dr.“ or „Mr.“ are interpreted as sentence endings
- Numbers are read digit by digit instead of naturally
- Special characters and symbols may cause unexpected pauses
- Time formats and dates might be misinterpreted
„Bad“ text input to TTS: „Dr. Smith paid $1,234 for 2 items at 3pm after waiting outside at 72°F on may, 15th, 2024. While waiting for the train to arrive at 15:45 he called a support hotline at 1-800-555-0123.„
This is hard for most TTS systems, because it contains lots of special characters that are hard to pronounce correctly for TTS.
„Better“ text input keeping the same sentence: „Doctor Smith paid one thousand two hundred thirty-four dollars for two items at three p m after waiting outside at seventy-two degrees Fahrenheit on May fifteenth, twenty twenty-four. While waiting for the train to arrive at fifteen forty-five he called a support hotline at one eight hundred five five five zero one two three.„
The Solution: Text Preprocessing
Below you’ll find a Python script that handles common text cleaning tasks. It works with any TTS system, including Piper, Coqui, eSpeak, and others.
Features:
- Converts numbers to words (e.g., „123“ → „one hundred twenty-three“)
- Expands common abbreviations
- Handles time formats
- Processes dates naturally
- Converts temperatures and units
- Supports multiple languages (configurable)
Download the Script
The script is on my Thorsten-Voice GitHub repository.
Usage Example
I created a jupyter notebook on Google Colab to show the concept of building your voice processing pipeline including text cleaning / normalization.
It uses NVIDIA NeMo framwork for text cleaning and Piper for text-to-speech.
The notebook can be found here and will be explained in my Youtube tutorial here.
Community & Support
- Found a bug or have suggestions? Open an issue on GitHub
- Questions? Comment below or on the YouTube video
Remember to subscribe to my Thorsten-Voice YouTube channel for more TTS tutorials and updates!