University of Washington: Linguistics: LING 575: Spring 2017: Homework #1

LING 575 - Spoken Dialog Systems Spring 2017
Homework #1 Due: April 11, 2017: 23:45

Overview

This assignment aims to help you familiarize yourself with Kaldi and develop improved understanding of automatic speech recognition through experimentation and analysis. Please review the readings and class lecture on speech recognition. Additional information on Kaldi may be found here.

Goals:

Through this assignment, you will:

Gain additional experience with speech recognition and the Kaldi ASR toolkit.
Train and evaluate an automatic digit recogntion system.
Experiment with alternative feature extraction, training, and context models for speech recognition.
Analyze the effect of different configurations on recognition accuracy.

Q1: Setting up Kaldi

To configure your environment to run the Kaldi speech recognizer, you will need to copy the requisite files to a working directory and update some of the script files to reflect our directory structure. Follow the steps below:

Create a working directory for your assignment on the CL cluster (patas), e.g. work_dir
Copy the /opt/kaldi/egs/tidigits directory to your working directory.
In your own copy of the tidigits/s5 directory, update two symbolic links:
- rm utils; ln -s /opt/kaldi/egs/wsj/s5/utils
- rm steps; ln -s /opt/kaldi/egs/wsj/s5/steps
Still in your tidigits/s5 directory, edit the shell scripts for your environment,
- path.sh: Change the first line to export KALDI_ROOT=/opt/kaldi
- cmd.sh: Replace the "queue.pl ...." section of each command with "run.pl"
Download the final configuration scripts:
- Place all the files from this web site's hw_asr subdirectory in your tidigits/s5 directory.
Now run your baseline speech recognizer with run_575.sh 1. This command will:
- create training and test directories and files as well as lexicon and language model files (in the data directory),
- extract acoustic features, such as mfccs in the mfcc directory,
- and train, test, and evaluate the digit recognizer in the exp directories corresponding to the configuration.
Note: Please run the run_575.sh script under condor on the cluster. This is fairly compute-intensive processing, and should be distributed to the compute nodes. Each run will take 5-10 minutes of CPU time. Please see the CLMS wiki pages on the basics of using the condor cluster. Additional details may found here.
Results can be found with grep WER and grep SER in the exp/mono1a/decode/wer_* files.

Experiment 1: Assessing training sets

Description and Motivation

The size and composition of the data on which the speech recognizer is trained can have a significant impact on recognition accuracy. You should test your digit recognizer on five different training data configurations defined by the *.txt files downloaded from the hw_asr subdirectory to your kaldi work area, in addition to a 1000 instance sample selected from the full digits corpus. The *.txt files consist of speaker ids, one per line, corresponding to configurations specified below:

Configuration #	Description
1	5 male speakers
2	10 male speakers
3	20 male speakers
4	10 male and 10 female speakers
5	1000 samples, spanning male and female speakers

You can specify the configuration to run by passing the configuration number as a parameter to the run_575.sh script, as in:
run_575.sh 3

Actions

Run the run_575.sh script for all five training configurations described above and record the best Word Error Rate (WER) and Sentence Error Rate (SER) for each.

Note: You must delete the data, exp, and mfcc directories between runs, in order to ensure correct results.

Experiment 2: Assessing normalization

Description and Motivation

To compensate for speech and speaker variation, suitable normalization of acoustic features is crucial in speech recognition. One form of normalization for MFCC features is cepstral mean and variance normalization, which is implemented by compute_cmvn_stats.sh in the Kaldi system.

In the run_575.sh script, this normalization is invoked on line 39. However, it is called with the --fake flag, which creates a dummy file of the appropriate form, but does not actually perform the normalization.

Actions

To evaluate the utility of this normalization, edit the run_575.sh script to delete the --fake flag.

Re-Run the run_575.sh script for all five training configurations described above, now with cmvn enabled, and record the best Word Error Rate (WER) and Sentence Error Rate (SER) for each.

Experiment 3: Assessing context models

So far, all of the digit recognizers you have built have used simple monophone models. As discussed in class and in the readings, the context in which a phone appears can have significant effect on its acoustic characteristics, due to coarticulation with neighboring speech sounds in continuous speech. Thus, a monophone model, while simple, is not ideal.

To address this problem, speech recognizers frequently employ a triphone model to try to capture and model the effects of the two neighboring phones. The run_575.sh includes the code for triphone modeling, which writes the results to the exp/tri1/decode directory.

Since the number of possible triphones is very large, many systems use a decision tree to cluster sets of triphones (aka senones) to reduce the complexity of the system to a more manageable scale. This process and creation of the corresponding models is controlled by the train_deltas.sh script called by run_575.sh. This program has two tunable numeric parameters, respectively:

numleaves: The number of such sets of triphones, corresponding to the leaves of the decision tree.
totgauss: The total number of Gaussian mixtures used to model them.

Actions

Uncomment the code in the run_575.sh script delimited by ## Triphone models.
Execute the run_575.sh 5 script. Try to tune the numleaves and totgauss parameters and record the best WER and SER, and the corresponding parameter settings. Note: You are not expected to exhaustively explore the space, but to perform some investigation. You can explore values for each parameter within an order of magnitude higher.

Submission

For this assignment, you are not required to submit any code, though you are welcome to write scripts to support, for example, extraction or tabulation of experimental results. Instead you will submit a report describing and analyzing the results of the three sets of experiments specified above.

report.pdf: Please write a brief report detailing the experiments above. Requirements:

The report should be 2-3 pages, in PDF format, no smaller than 11pt.
For each set of experiments,
- Briefly summarize the experimental conditions.
- Tabulate the results, presenting best WER and SER for the different configurations.
- Analyze and discuss the results. What impact do you observe? What do you think drives the changes in score that you have found? Were you surprised by any of the results? Which ones? Why?
Conclude with a brief overall summary. Include problems you came across and how (or if) you were able to solve them, any insights, special features, and what you learned. Give examples if possible. If you were not able to complete parts of the project, discuss what you tried and/or what did not work.

Handing in your work

All homework should be handed in using the class CollectIt.

LING 575 - Spoken Dialog Systems Spring 2017 Homework #1 Due: April 11, 2017: 23:45

Overview

Goals:

Q1: Setting up Kaldi

Experiment 1: Assessing training sets

Description and Motivation

Actions

Experiment 2: Assessing normalization

Description and Motivation

Actions

Experiment 3: Assessing context models

Actions

Submission

Handing in your work

LING 575 - Spoken Dialog Systems Spring 2017
Homework #1 Due: April 11, 2017: 23:45