One of my previous post describes my first attempt to generate training data for HTS system from recordings and transcripts: How to create full-context labels for your HTS system (update: not really worked); unfortunately, it did not work as expected. During eNTERFACE’14, I have learned that there is a tool named EHMM in festvox that can help to build the .utt files (and in turn the full-context labels easily).
To start, you can follow the steps in the following link, which is the full instruction to build a CLUSTERGEN voice: http://festvox.org/bsv/c3170.html.
You can actually stop after having obtained the .utt files and use it for your own purposes.
In general, the necessary steps are listed below:
- Prepare the folder structure with the command (you can change the italic parts as you want)
$FESTVOXDIR/src/clustergen/setup_cg cmu us awb_arctic
- Copy the wav files into the wav/ folder or using the command below to copy and automatically format the .wav file appropriately (16kHz, 16bit mono, RIFF).
- Put all the transcriptions into the file etc/txt.done.data. The file should be formatted similarly to http://www.festvox.org/cmu_arctic/cmuarctic.data.
- Setup the phone sets appropriately by editing files inside the folder festvox/. You can use the scm files in any festvox voice as a reference.
- Run the following 3 commands.
- Generate the full context labels from the .utt files using the make script inside HTS demo.
There are many interesting tools inside festival and festvox that can be used for both unit selection and HMM-based speech synthesis. You can read more about some tools and commands in the link below: