LING 575 - Spoken Dialog Systems Spring 2017
Homework #1 Due: April 11, 2017: 23:45


This assignment aims to help you familiarize yourself with Kaldi and develop improved understanding of automatic speech recognition through experimentation and analysis. Please review the readings and class lecture on speech recognition. Additional information on Kaldi may be found here.


Through this assignment, you will:

Q1: Setting up Kaldi

To configure your environment to run the Kaldi speech recognizer, you will need to copy the requisite files to a working directory and update some of the script files to reflect our directory structure. Follow the steps below:

Experiment 1: Assessing training sets

Description and Motivation

The size and composition of the data on which the speech recognizer is trained can have a significant impact on recognition accuracy. You should test your digit recognizer on five different training data configurations defined by the *.txt files downloaded from the hw_asr subdirectory to your kaldi work area, in addition to a 1000 instance sample selected from the full digits corpus. The *.txt files consist of speaker ids, one per line, corresponding to configurations specified below:
Configuration #Description
15 male speakers
210 male speakers
320 male speakers
410 male and 10 female speakers
51000 samples, spanning male and female speakers
You can specify the configuration to run by passing the configuration number as a parameter to the script, as in: 3


Run the script for all five training configurations described above and record the best Word Error Rate (WER) and Sentence Error Rate (SER) for each.

Note: You must delete the data, exp, and mfcc directories between runs, in order to ensure correct results.

Experiment 2: Assessing normalization

Description and Motivation

To compensate for speech and speaker variation, suitable normalization of acoustic features is crucial in speech recognition. One form of normalization for MFCC features is cepstral mean and variance normalization, which is implemented by in the Kaldi system.

In the script, this normalization is invoked on line 39. However, it is called with the --fake flag, which creates a dummy file of the appropriate form, but does not actually perform the normalization.


To evaluate the utility of this normalization, edit the script to delete the --fake flag.

Re-Run the script for all five training configurations described above, now with cmvn enabled, and record the best Word Error Rate (WER) and Sentence Error Rate (SER) for each.

Experiment 3: Assessing context models

So far, all of the digit recognizers you have built have used simple monophone models. As discussed in class and in the readings, the context in which a phone appears can have significant effect on its acoustic characteristics, due to coarticulation with neighboring speech sounds in continuous speech. Thus, a monophone model, while simple, is not ideal.

To address this problem, speech recognizers frequently employ a triphone model to try to capture and model the effects of the two neighboring phones. The includes the code for triphone modeling, which writes the results to the exp/tri1/decode directory.

Since the number of possible triphones is very large, many systems use a decision tree to cluster sets of triphones (aka senones) to reduce the complexity of the system to a more manageable scale. This process and creation of the corresponding models is controlled by the script called by This program has two tunable numeric parameters, respectively:



For this assignment, you are not required to submit any code, though you are welcome to write scripts to support, for example, extraction or tabulation of experimental results. Instead you will submit a report describing and analyzing the results of the three sets of experiments specified above.

report.pdf: Please write a brief report detailing the experiments above. Requirements:

Handing in your work

All homework should be handed in using the class CollectIt.