How to create full-context labels for your HTS system (update: not really worked)

Update: I later found out that the method described below did not work as expected. Tricking Festival by simply providing it with a custom monophone transcript will generates invalid .utt files. Then creating the full-context labels from those .utt files will give you only quin-phone without any other linguistic context.

However, you would still be able to utilize the script in the first part as the front-end for the TTS system (label/.utt generation using Festival). To create .utt for training data, I have noted down better way here: A better way to create the full-context labels for HTS training data.

Introduction

If you are familiar with the HTS demos, you probably know about their full-context label format. One full context labels looks like this:

The above line contain the phone identity and many of its linguistic context, including the 2 previous and 2 following phones, position of current phone in current syllable, position of current syllable in current words, stress, accent and so many other think. The detailed description of all those context is in lab_format.pdf inside the data folder of any HTS demo.

However, if you are building your own system, you may have problem getting all those context to create that long labels. In fact, HTS could still work with much shorter full-context labels containing much less information, but you should expect some degree of degradation in the quality of the synthesized speech due to the shrinking of the decision tree. Fortunately, all the text analysis can actually be done automatically by Festival. I will show all the steps in the sections below.

Create the full-context label from text

This one looks very straight forward at first, but it needs some workaround if you want to use the label as training data. Basically, you will

1. Use Festival to create the .utt file

2. Use script in the labels section of the Makefile script in the data folder to convert the .utt file into monophone labels and full-context labels. I will not write about this step in this post.

The Scheme script to generate the .utt file from an arbitrary text is very simple:

You can type the script as a command to a Festival console, or activate it directly by

Or save it in a .scm file and then call

And here come the problem. The generated .utt files are from an installed voice of Festival and are almost certainly not aligned to your .wav files, so the full-context labels will be good for synthesis, but training.

The solution – Incorporating alignment information into the utt file

After some searching on the Internet, I found out that Festival also support putting xLabel file inside the utterance for research on prosody (http://www.festvox.org/festvox/x1973.html). The xLabel file look like (http://www.ee.columbia.edu/ln/labrosa/doc/HTKBook21/node83.html):

with the format <end time> <color> <phone> ;. The color is not important here and can be any number; the end time is in seconds and the start time is derived automatically from the previous end time. This format can be converted very easily from HTK monophone label format. Here I assume that you should already have the monophone aligned labels, so it would not be a problem preparing the xLabel files.

Then we modify the Scheme scripts for Festival as below:

And now you have the .utt files that contain all necessary linguistic information as well as aligned exactly as you want. Just use them as in any HTS demo and you are done!

5 thoughts to “How to create full-context labels for your HTS system (update: not really worked)”

    1. Not yet. I am currently using NIST tool to find the syllable boundary, FESTIVAL to get the POS tags, and then my own scripts to combine those data and generate all the locational information.

      I still don’t know how to automatically extract the stress and accent information, which seems to be quite important for generating good prosody.

      1. Hello Nguyen,
        Thanks for the tips. They solved some of my problems.
        I was under the impression that making .utt using Festival does compute Syllable boundary, POS, stress and accent info. and etc. Aren’t these the contextual features?
        Keep up the good work,
        Hamid

        1. Yes the labels created by Festival have all the necessary phonetic and linguistic contexts and can be used for synthesis. However, I believe those labels should not be used as training data because I might not match your training speech (e.g. the pronunciation chosen by festival might not match the pronunciation in the input wav files). Using EHMM as in my updated post will give you better labels to train a voice model.

Leave a Reply

Your email address will not be published. Required fields are marked *