ICE-Can

From MLML
Jump to: navigation, search

ICE-Can (Voices of the International Corpus of English (VOICE) CANADA) is a 70-recording corpus of speakers of Canadian English. It contains recordings of monologues and dialogues in scripted and unscripted contexts. This page describes the steps to treat this corpus so that it may be used with the Montreal Forced Aligner and imported for the SPADE project.

Get Dataset

The International Corpus of English has traditionally been found here, but as of May 2017, this link is down. As such, the corpus must currently be pulled down manually from ICE-Can the link at the University of Alberta.

Each file in the corpus has associated with it an audio recording and a transcript, and the corpus itself contains one spreadsheet of metadata. Download these according to this structure:

 ICE-Can
 | (metadata.xls file)
 +- wav
 |  | (.wav files)
 |  +- mp3
 |     | (.mp3 files)
 +- txt
 |  | (.txt files)

The .txt file corresponding to S2A-004_1.wav is missing from the corpus, so put a placeholder .txt file with the same name in the txt folder.

Once downloaded, put this corpus data on Havarti at corpora/ICE-Can.

Treating Audio

Converting to .wav

The .mp3 files included in the corpus must be converted to .wav. This is done with the mp3ToWav.py script (update with link). mp3ToWav.py needs the following prerequisites to run:

  • PyDub, installed through pip install pydub
  • ffmpeg, see PyDub documentation for installation details

Download the script into the wav directory and then run python mp3ToWav.py mp3.

Downsampling

Many .wav files included in the corpus have a sampling rate that is too high and must be downsampled to 44.1 kHz. This is done with the resample.praat script (update with link). resample.praat needs the following prerequisites to run:

In the original ICE-Can directory, create a new directory wavDownsample to be used for output. Then, in Praat, go to Praat > Open Praat Script, openresample.praat, set the parameters to outputting a .wav file at 44.1 kHz, and click Run.

When both of these steps are done, put this treated audio data on Havarti at corpora/ICE-Can/treated.

Treating Transcripts

The transcripts included with the corpus are written in a markup language that is irrelevant for our purposes. To clean the transcripts and convert them into a TextGrid format, we use the translate_transcript.py script (update with link). translate_transcript.py needs the following prerequisites to run:

Download the script into the original ICE-Can directory and create a new directory textgrid to be used for output. Then run python translate_transcript.py txt wav textgrid metadata.xls.

There is an undetermined but isolated problem with the resulting TextGrid S1B-043_2.TextGrid, so delete it.

The script will output a file corpuserrors.txt, which stores all the fixes done to actual corpus errors/typos in the original transcripts.

Dictionary and Acoustic Model

The dictionary being used for this task is the lexicon from LibriSpeech.

The acoustic model is the pre-trained English model included with the Montreal Forced Aligner.

Alignment

Preparation

For the aligner to function, the .wav files and their corresponding TextGrids must be stored in one folder together. Create a new directory textgrid-wav inside the main ICE-Canada directory and copy all the downsampled .wav files and all the .TextGrid files into it.

The working directory structure should now look like this:

 ICE-Can
 | (metadata.xls file)
 | corpuserrors.txt
 +- wav
 |  | (.wav files)
 |  +- mp3
 |     | (.mp3 files)
 +- txt
 |  | (.txt files)
 +- wavDownsample
 |  | (downsampled .wav files)
 +- textgrid
 |  | (.TextGrid files)
 +- textgrid-wav
 |  | (downsampled .wav files)
 |  | (.TextGrid files)

Running

Alignment uses the Montreal Forced Aligner.

For this corpus to align well, we need to make some configuration changes:

  1. Clone or download the MFA from source on Github.
  2. Download the binaries by navigating to thirdparty/ inside the directory the MFA was cloned into and running python3 download_binaries.py.
  3. Navigate back into the directory the MFA was cloned into, navigate into aligner/, and open config.py in a text editor.
  4. Change the value of self.beam on line 60 to 20.

Then, to run the aligner with the LibriSpeech dictionary and the pre-trained English model, navigate into the directory the aligner was cloned into, and run python3 -m aligner.command_line.align /path/to/ICE-Can/textgrid-wav /path/to/librispeech-lexicon.txt english /path/to/desired/output/directory.

When alignment is finished, put the aligned TextGrid data on Havarti at corpora/ICE-Can/aligned.

Import

The corpus can be imported with the MFA parser.

This corpus is imported in PolyglotDB under the name ICE_Can.