LING 575 - Spoken Dialog Systems
Spring 2017
Homework #1
Due: April 11, 2017: 23:45
Overview
This assignment aims to help you familiarize yourself with Kaldi
and develop improved understanding of automatic speech recognition through
experimentation and analysis. Please review the readings and class
lecture on speech recognition. Additional information on Kaldi may
be found here.
Goals:
Through this assignment, you will:
- Gain additional experience with speech recognition and the Kaldi ASR toolkit.
- Train and evaluate an automatic digit recogntion system.
- Experiment with alternative feature extraction, training, and context
models for speech recognition.
- Analyze the effect of different configurations on recognition accuracy.
Q1: Setting up Kaldi
To configure your environment to run the Kaldi speech recognizer, you will
need to copy the requisite files to a working directory and update some
of the script files to reflect our directory structure. Follow the steps
below:
- Create a working directory for your assignment on the CL cluster (patas), e.g. work_dir
- Copy the /opt/kaldi/egs/tidigits directory to your working directory.
- In your own copy of the tidigits/s5 directory, update two symbolic links:
- rm utils; ln -s /opt/kaldi/egs/wsj/s5/utils
- rm steps; ln -s /opt/kaldi/egs/wsj/s5/steps
- Still in your tidigits/s5 directory, edit the shell scripts for your environment,
- path.sh: Change the first line to export KALDI_ROOT=/opt/kaldi
- cmd.sh: Replace the "queue.pl ...." section of each command with "run.pl"
- Download the final configuration scripts:
- Now run your baseline speech recognizer with run_575.sh 1. This command will:
- create training and test directories and files as well as lexicon and language model files (in the data directory),
- extract acoustic features, such as mfccs in the mfcc directory,
- and train, test, and evaluate the digit recognizer in the exp directories corresponding to the configuration.
- Note: Please run the run_575.sh script under
condor on the cluster. This is fairly compute-intensive processing, and
should be distributed to the compute nodes. Each run will take 5-10 minutes
of CPU time. Please see the CLMS wiki pages on the basics of using the condor cluster. Additional details may found here.
- Results can be found with grep WER and grep SER in the exp/mono1a/decode/wer_* files.
Experiment 1: Assessing training sets
Description and Motivation
The size and composition of the data on which the speech recognizer is
trained can have a significant impact on recognition accuracy. You should
test your digit recognizer on five different training data configurations
defined by the *.txt files downloaded from the hw_asr
subdirectory to your kaldi work area, in addition to a 1000 instance sample
selected from the full digits corpus. The *.txt files
consist of speaker ids, one per line, corresponding to configurations
specified below:
Configuration # | Description |
1 | 5 male speakers |
2 | 10 male speakers |
3 | 20 male speakers |
4 | 10 male and 10 female speakers |
5 | 1000 samples, spanning male and female speakers |
You can specify the configuration to run by passing the configuration number
as a parameter to the run_575.sh script, as in:
run_575.sh 3
Actions
Run the run_575.sh script for all five training configurations
described above and record the best Word Error Rate (WER) and Sentence
Error Rate (SER) for each.
Note: You must delete the data, exp, and
mfcc directories between runs, in order to ensure correct results.
Experiment 2: Assessing normalization
Description and Motivation
To compensate for speech and speaker variation, suitable normalization
of acoustic features is crucial in speech recognition. One form
of normalization for MFCC features is cepstral mean and variance
normalization, which is implemented by compute_cmvn_stats.sh
in the Kaldi system.
In the run_575.sh script, this normalization is invoked on
line 39. However, it is called with the --fake
flag, which creates a dummy file of the appropriate form, but does not actually
perform the normalization.
Actions
To evaluate the utility of this normalization, edit the run_575.sh
script to delete the --fake flag.
Re-Run the run_575.sh script for all five training configurations
described above, now with cmvn enabled, and record the best Word Error Rate (WER) and Sentence
Error Rate (SER) for each.
Experiment 3: Assessing context models
So far, all of the digit recognizers you have built have used simple
monophone models. As discussed in class and in the readings, the
context in which a phone appears can have significant effect on
its acoustic characteristics, due to coarticulation with neighboring
speech sounds in continuous speech. Thus, a monophone model, while
simple, is not ideal.
To address this problem, speech recognizers frequently employ
a triphone model to try to capture and model the effects of
the two neighboring phones. The run_575.sh
includes the code for triphone modeling, which writes the results to
the exp/tri1/decode directory.
Since the number of possible
triphones is very large, many systems use a decision tree to
cluster sets of triphones (aka senones) to reduce the complexity
of the system to a more manageable scale.
This process and creation of the corresponding models is controlled
by the train_deltas.sh script called by run_575.sh.
This program has two tunable numeric parameters, respectively:
- numleaves:
The number of such
sets of triphones, corresponding to the leaves of the decision tree.
- totgauss: The total number of Gaussian mixtures used to model
them.
Actions
- Uncomment the code in the run_575.sh script delimited by
## Triphone models.
-
Execute the run_575.sh 5 script. Try to tune the numleaves and totgauss parameters and record the best WER and SER, and the corresponding parameter settings. Note: You are not expected to exhaustively explore the space, but to perform some investigation. You can explore values for each parameter within an order of magnitude higher.
Submission
For this assignment, you are not required to submit any code, though you
are welcome to write scripts to support, for example, extraction or
tabulation of experimental results.
Instead you will submit a report describing and analyzing the
results of the three sets of experiments specified above.
report.pdf: Please write a brief report detailing the
experiments above. Requirements:
- The report should be 2-3 pages, in PDF format, no smaller than 11pt.
- For each set of experiments,
- Briefly summarize the experimental conditions.
- Tabulate the results, presenting best WER and SER for the different
configurations.
- Analyze and discuss the results. What impact do you observe? What do
you think drives the changes in score that you have found? Were you surprised
by any of the results? Which ones? Why?
- Conclude with a brief overall summary. Include problems you came across and how (or if) you were able to solve them, any insights, special features, and what you learned. Give examples if possible. If you were not able to complete parts of the project, discuss what you tried and/or what did not work.
Handing in your work
All homework should be handed in using the class CollectIt.