Myanmar Text-to-Speech with End-to-End Speech Synthesis

Htoo Pyae Lwin
6 min readSep 21, 2019

This is the proposed mini-thesis for my B.E degree at Yangon Technological University. During the past couple of months, since I was very new to the field of AI, Machine Learning and Deep Learning, I had to learn in a short amount of time. My field of focus was full-stack web development, therefore, I’m not a professional in this field of AI, so pardon my mistakes if any comes up. My supervisor was Dr. Yuzana Win and she guided me throughout this development. In this article, I’m gonna skip signal processing theories for preprocessing stage just for simplicity. Outlines of this article:

Introduction to Speech Synthesis

Text-to-Speech is an joint research area of digital signal processing (DSP) and natural language processing (NLP). In text-to-speech, the goal is to generate a natural-sounding, human-like speech from the input words or sentences in terms of intelligibility and quality which is why for machines, it is a very difficult task. In a technical point of view, speech synthesis can be viewed as a sequence to sequence mapping problem as a sequence of characters are converted into an output audio waveform which is a time-series. Due to this nature, it is possible to achieve this goal by using sequence models.

Approaches

There are two main ways to approach this problem of synthesizing speech. The first one is more of a traditional way used before deep learning. In this type of approach, there are essentially two main stages: Text Analysis and Speech Synthesis. Text analysis is responsible for extracting linguistic features such as phonemes, prosody, etc. The real generation of the audio waveform occurs at the second stage. Similarly, there are also two possible ways to generate speech: either concatenative or parametric. (Refer to the below figure to find out the differences)

Traditional Text-to-Speech Approach

Traditional Speech Synthesis

When I got the thesis title for the first time, I decided to go with this approach as it can generate the best quality audio. I developed a sentence-level G2P (Grapheme to Phoneme) conversion by following Sayar Ye Kyaw Thu’s myG2P dictionary. However, I got stuck at the synthesizer stage as it requires domain expertise in many levels and therefore, I had to change my plan. I’ve always wondered about generating speech directly from the input characters and luckily, there has already research going on.

End-to-End Speech Synthesis

The idea of this approach is to synthesize speech directly from the characters. It does not need phoneme-level alignment and can be trained on completely from scratch given <text, audio> pairs. There are a number of generative models already exist for this purpose, but some of them are not necessarily end-to-end as they usually have models developed and trained separately. Among these models, I decided to go with Tacotron from Google as it is truly an end-to-end generative model that can fulfill our goal.

Tacotron

The backbone of Tacotron is a seq2seq model with attention. Tacotron takes character sequence as inputs and outputs raw spectrograms which are later reconstructed using an algorithm called Griffin-Lim. It operates at frame-level which is why it is faster than sample-level auto-regressive models like WaveNet and SampleRNN. It consists an encoder, an attention-based decoder, and a post-processing net. The encoder is responsible for building up a well-defined summarized representation of the input character sequence. The decoder has to learn about the alignment between the text representations and the output audio frames based on the context.

Tacotron
CBHG

Developing a Burmese Text-to-Speech

Speech Corpus

In order to use Tacotron, we need to have a speech corpus. So, I had to make recordings of me reading Burmese sentences. The text corpus is provided by Base Technology (Expa.AI) and they also allow me to use a web-based recording tool that is able to record and create the mappings of the audio and the text. The final speech corpus contains over 5000 utterances. Totally, this is for about 5 hours. The maximum input text length is 167 and the maximum number of frames in the input audio is 907.

Model Implementation

Google did not release the official implementation of Tacotron but there are some great implementations on GitHub. I referenced from this repository and added some other necessary modules. I had to develop a module for converting from numbers to words as the corpus contains numbers in texts. This module is also available here in this repo.

Training

I trained this model for 150,000 steps which took me about a week. Since I used Google Colab for training, I only have access for 12 hours. I had to restart from mounting my google drive to preprocessing and training every single time :(.

Synthesized Audio Samples

Every 1000 steps, the model is automatically saved and tries to predict a speech. Here are some of the outputs and you can find out more here.

  1. “မင်းတို့ ဘာကြောင့် မြန်မာ စကားပြော ကြတာလဲ။”

2. “ခင်ဗျား အမျိုးသား ကော နေကောင်းရဲ့လား။”

3. “ဒါဆို ရပြီ။”

Produced Loss

loss

Alignment Plot

Alignment

Testing with Custom Inputs

I compared my model with the synthesized audio samples from UCSY to see how much gap there is in quality and naturalness. I knew that my model (with very limited speech corpus and training time) wasn’t gonna predict very well ;) but to my surprise, I guess it did pretty well considering the limitations.

Notes & Thoughts

  • the speech corpus is very small compared to other English speech corpora which are over 24 hours.
  • although I tried my best not to contain any noise when recording, some of the audio recordings did contain many noises: air blowing, background noises, etc.
  • during training, the audio length is limited to 12.5 sec and can cause an error like “incompatible shape”. To solve this, you can tune the hyperparameter max_iters according to your input audio length.
  • there is also another model called Tacotron 2 which incorporates the ideas from both Tacotron and WaveNet. Instead of the Griffin-Lim for reconstruction, it uses the vocoder from WaveNet. I intend to extend the current Myanmar text-to-speech with this model.
  • as a final thought, the current Myanmar Text-to-Speech is not good enough in terms of quality. The problem lies in the speech corpus, so with the right quality speech corpus, I believe this TTS system will be able to produce better audio.

--

--