How to configure HTS for in-training synthesis with state-level alignment labels


Utilizing state-level alignment labels allows us to copy the prosody from one speaker and use it on another speaker’s acoustic model. This can be used to improve the synthesized results by using prosody from natural speech and phone features from a HMM-based acoustic models. Moreover, since this technique can create phone-aligned parallel sentences from different acoustic models, we can also use it to generate comparable sentences where the quality of the vocoders or the acoustic features in the training data can be compared separately from the duration models.

How to configure HTS demo with STRAIGHT features for 16kHz training data

I have been using HTS for a while for my research on speech synthesis. Recently, I have had some problems when I tried to configure the HTS demo with STRAIGHT features to use 16k data instead of 48k. I finally figured out how to properly do that work, and it is really not as easy as changing one or two configurations like in other demoes without STRAIGHT, so I decided to note all the steps down here.

