A better way to create the full-context labels for HTS training data

One of my previous post describes my first attempt to generate training data for HTS system from recordings and transcripts: How to create full-context labels for your HTS system (update: not really worked); unfortunately, it did not work as expected. During eNTERFACE’14, I have learned that there is a tool named EHMM in festvox that can help to build the .utt files (and in turn the full-context labels easily).

To start, you can follow the steps in the following link, which is the full instruction to build a CLUSTERGEN voice: http://festvox.org/bsv/c3170.html.

You can actually stop after having obtained the .utt files and use it for your own purposes.

In general, the necessary steps are listed below:

Read More

How to create full-context labels for your HTS system (update: not really worked)

Update: I later found out that the method described below did not work as expected. Tricking Festival by simply providing it with a custom monophone transcript will generates invalid .utt files. Then creating the full-context labels from those .utt files will give you only quin-phone without any other linguistic context.

However, you would still be able to utilize the script in the first part as the front-end for the TTS system (label/.utt generation using Festival). To create .utt for training data, I have noted down better way here: A better way to create the full-context labels for HTS training data.


If you are familiar with the HTS demos, you probably know about their full-context label format. One full context labels looks like this:

The above line contain the phone identity and many of its linguistic context, including the 2 previous and 2 following phones, position of current phone in current syllable, position of current syllable in current words, stress, accent and so many other think. The detailed description of all those context is in lab_format.pdf inside the data folder of any HTS demo.

However, if you are building your own system, you may have problem getting all those context to create that long labels. In fact, HTS could still work with much shorter full-context labels containing much less information, but you should expect some degree of degradation in the quality of the synthesized speech due to the shrinking of the decision tree. Fortunately, all the text analysis can actually be done automatically by Festival. I will show all the steps in the sections below.

Continue reading the detailed steps

How to configure HTS for in-training synthesis with state-level alignment labels


Utilizing state-level alignment labels allows us to copy the prosody from one speaker and use it on another speaker’s acoustic model. This can be used to improve the synthesized results by using prosody from natural speech and phone features from a HMM-based acoustic models. Moreover, since this technique can create phone-aligned parallel sentences from different acoustic models, we can also use it to generate comparable sentences where the quality of the vocoders or the acoustic features in the training data can be compared separately from the duration models.

Continue reading more on the steps

How to configure HTS demo with STRAIGHT features for 16kHz training data

I have been using HTS for a while for my research on speech synthesis. Recently, I have had some problems when I tried to configure the HTS demo with STRAIGHT features to use 16k data instead of 48k. I finally figured out how to properly do that work, and it is really not as easy as changing one or two configurations like in other demoes without STRAIGHT, so I decided to note all the steps down here.

Read More