How to configure HTS for in-training synthesis with state-level alignment labels

Purpose

Utilizing state-level alignment labels allows us to copy the prosody from one speaker and use it on another speaker’s acoustic model. This can be used to improve the synthesized results by using prosody from natural speech and phone features from a HMM-based acoustic models. Moreover, since this technique can create phone-aligned parallel sentences from different acoustic models, we can also use it to generate comparable sentences where the quality of the vocoders or the acoustic features in the training data can be compared separately from the duration models.

Steps

The basic steps to get the state level alignment are listed below:

1. Train HTS systems of both data sets.

2. Add the –f parameter to the HSMMAlign function call in the “forced alignment for no-silent GV” step. The modified call should look as below:

3. Run the above step. The state-level alignment result will be store in gv/qst001/ver1/fal (the question and version number can be different according to your training configuration). This folder contains all the monophone state-level alignment labels of all sentences In the training data.

4. Convert the monophone state-level alignment labels into full-context state-level alignment labels. You should already have all the monophone and full-context labels for all those sentences from the training data, so converting is simply replacing the corresponding monophone with the full-context one. I have written a small perl script to do the job. (I have omitted the part preparing the path $monoDir, $fullDir, $monoStateDir and $fullStateDir corresponding to the monophone and full-context label folders, the monophone state-level alignment label folder and the output full-context state-level alignment label folder.)

Then to use this inside another HTS system, you will have to do some more steps:

1. Create all the unseen models in the generating sentences. The unseen models will be created base on the labels in the full_all.list file, so we will need to make sure this file have all the full-context labels in our synthesizing script first. The easiest way is to copy the full-context labels from the source data set to the generating folder of the target data set and the run the makefile to recreate the gen.scp and full_all.list. Then we could run all the making unseen models steps in the training script.

2. Change the gen.scp file, pointing to the state-level alignment labels.

3. Change the function call to HMGenS in the Training.pl script to use state-level alignment labels by adding –s parameter and remove the duration model:

4. Limit the $pgtype parameter to 0 and 1 only (remove 2). $pgtype = 2 (both state and mix sequences are hidden) will not work with state-level alignment labels.

5. Run the generating speech parameter sequences and synthesizing waveforms step again.

Conclusion

Supporting for generating speech from state-level alignment level allows us to have a very refined level of control on the prosody of the generated speech from the HTS system. Currently, I still have some problem with the quality of the generated state-level alignment labels, but I hope I can fix this problem soon and utilize this feature more in the future.

2 thoughts to “How to configure HTS for in-training synthesis with state-level alignment labels”

    1. Sorry for the late reply as I was overseas for the last couple of days.
      $STATES here is the number of HMM states for each phone (excluding the opening and closing states). In the current HTS demo for English, $STATE = 5.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.