ICE-Can (Voices of the International Corpus of English (VOICE) CANADA) is a 70-recording corpus of speakers of Canadian English. It contains recordings of monologues and dialogues in scripted and unscripted contexts. This page describes the steps to treat this corpus so that it may be used with the Montreal Forced Aligner and imported for the SPADE project.
The International Corpus of English has traditionally been found here, but as of May 2017, this link is down. As such, the corpus must currently be pulled down manually from ICE-Can the link at the University of Alberta.
Each file in the corpus has associated with it an audio recording and a transcript, and the corpus itself contains one spreadsheet of metadata. Download these according to this structure:
ICE-Can | (metadata.xls file) +- wav | | (.wav files) | +- mp3 | | (.mp3 files) +- txt | | (.txt files)
The .txt file corresponding to
S2A-004_1.wav is missing from the corpus, so put a placeholder .txt file with the same name in the
Once downloaded, put this corpus data on Havarti at
Converting to .wav
The .mp3 files included in the corpus must be converted to .wav. This is done with the
mp3ToWav.py script (update with link).
mp3ToWav.py needs the following prerequisites to run:
Download the script into the
wav directory and then run
python mp3ToWav.py mp3.
Many .wav files included in the corpus have a sampling rate that is too high and must be downsampled to 44.1 kHz. This is done with the
resample.praat script (update with link).
resample.praat needs the following prerequisites to run:
In the original
ICE-Can directory, create a new directory
wavDownsample to be used for output. Then, in Praat, go to
Praat > Open Praat Script, open
resample.praat, set the parameters to outputting a .wav file at 44.1 kHz, and click
When both of these steps are done, put this treated audio data on Havarti at
The transcripts included with the corpus are written in a markup language that is irrelevant for our purposes. To clean the transcripts and convert them into a TextGrid format, we use the
translate_transcript.py script (update with link).
translate_transcript.py needs the following prerequisites to run:
- SciPy, see documentation for installation details
- textgrid, installed through
pip install git+http://github.com/kylebgorman/textgrid.git
- xlrd, installed through
pip install xlrd
Download the script into the original
ICE-Can directory and create a new directory
textgrid to be used for output. Then run
python translate_transcript.py txt wav textgrid metadata.xls.
There is an undetermined but isolated problem with the resulting TextGrid
S1B-043_2.TextGrid, so delete it.
The script will output a file
corpuserrors.txt, which stores all the fixes done to actual corpus errors/typos in the original transcripts.
Dictionary and Acoustic Model
The dictionary being used for this task is the lexicon from LibriSpeech.
The acoustic model is the pre-trained English model included with the Montreal Forced Aligner.
For the aligner to function, the .wav files and their corresponding TextGrids must be stored in one folder together. Create a new directory
textgrid-wav inside the main
ICE-Canada directory and copy all the downsampled .wav files and all the .TextGrid files into it.
The working directory structure should now look like this:
ICE-Can | (metadata.xls file) | corpuserrors.txt +- wav | | (.wav files) | +- mp3 | | (.mp3 files) +- txt | | (.txt files) +- wavDownsample | | (downsampled .wav files) +- textgrid | | (.TextGrid files) +- textgrid-wav | | (downsampled .wav files) | | (.TextGrid files)
Alignment uses the Montreal Forced Aligner.
For this corpus to align well, we need to make some configuration changes:
- Clone or download the MFA from source on Github.
- Download the binaries by navigating to
thirdparty/inside the directory the MFA was cloned into and running
- Navigate back into the directory the MFA was cloned into, navigate into
aligner/, and open
config.pyin a text editor.
- Change the value of
self.beamon line 60 to
Then, to run the aligner with the LibriSpeech dictionary and the pre-trained English model, navigate into the directory the aligner was cloned into, and run
python3 -m aligner.command_line.align /path/to/ICE-Can/textgrid-wav /path/to/librispeech-lexicon.txt english /path/to/desired/output/directory.
When alignment is finished, put the aligned TextGrid data on Havarti at
The corpus can be imported with the MFA parser.
This corpus is imported in PolyglotDB under the name