ICE-Can

From MLML
Revision as of 19:42, 11 May 2017 by AColes (Talk | contribs)

Jump to: navigation, search

ICE-Can (Voices of the International Corpus of English (VOICE) CANADA) is a 70-recording corpus of speakers of Canadian English. It contains recordings of monologues and dialogues in scripted and unscripted contexts. This page describes the steps to treat this corpus so that it may be used with the Montreal Forced Aligner and imported for the SPADE project.

Get Dataset

The International Corpus of English has traditionally been found here, but as of May 2017, this link is down. As such, the corpus must currently be pulled down manually from ICE-Can the link at the University of Alberta.

Each file in the corpus has associated with it an audio recording and a transcript, and the corpus itself contains one spreadsheet of metadata. Download these according to this structure:

 ICE-Can
 | (metadata.xls file)
 +- wav
 |  | (.wav files)
 |   +- mp3
 |      | (.mp3 files)
 +- txt
 |  | (.txt files)

The .txt file corresponding to S2A-004_1.wav is missing from the corpus, so put a placeholder .txt file with the same name in the txt folder.

Once downloaded, put this corpus data on Havarti (yet to be done).

Treating Audio

Converting to .wav

The .mp3 files included in the corpus must be converted to .wav. This is done with the mp3ToWav.py script (update with link). mp3ToWav.py needs the following prerequisites to run:

  • PyDub, installed through pip install pydub
  • ffmpeg, see PyDub documentation for installation details

Download the script into the wav directory and then run python mp3ToWav.py mp3.

Downsampling

Many .wav files included in the corpus have a sampling rate that is too high and must be downsampled to 44.1 kHz. This is done with the downsample.py script (update with link). downsample.py needs the following prerequisites to run:

  • PyDub, installed through pip install pydub
  • ffmpeg, see PyDub documentation for installation details

Download the script into the original ICE-Can directory and create a new directory wavDownsample to be used for output. Then run python downsample.py wav wavDownsample.

When both of these steps are done, put this treated audio data on Havarti (yet to be done).

Treating Transcripts

The transcripts included with the corpus are written in a markup language that is irrelevant for our purposes. To clean the transcripts and convert them into a TextGrid format, we use the translate_transcript.py script (update with link). translate_transcript.py needs the following prerequisites to run:

Download the script into the original ICE-Can directory and create a new directory textgrid to be used for output. Then run python translate_transcript.py txt wav textgrid metadata.xls.

There is an undetermined but isolated problem with the resulting TextGrid S1B-043_2.TextGrid, so delete it.

The script will output a file corpuserrors.txt, which stores all the fixes done to actual corpus errors/typos in the original transcripts.

Dictionary and Acoustic Model

The dictionary being used for this task is the lexicon from LibriSpeech. (Acoustic model still being determined.)

Alignment

Preparation

For the aligner to function, the .wav files and their corresponding TextGrids must be stored in one folder together. Create a new directory textgrid-wav inside the main ICE-Canada directory and copy all the downsampled .wav files and all the .TextGrid files into it.

The ultimate directory structure should now look like this:

 ICE-Can
 | (metadata.xls file)
 | corpuserrors.txt
 +- wav
 |  | (.wav files)
 |   +- mp3
 |      | (.mp3 files)
 +- txt
 |  | (.txt files)
 +- wavDownsample
 |  | (downsampled .wav files)
 +- textgrid
 |  | (.TextGrid files)
 +- textgrid-wav
 |  | (downsampled .wav files)
 |  | (.TextGrid files)

Running

Alignment uses the Montreal Forced Aligner. To run it with the LibriSpeech dictionary and (appropriate acoustic model), first navigate to the directory of the aligner, then run bin/mfa_train_and_align /path/to/ICE-Can/textgrid-wav /path/to/librispeech-lexicon.txt /path/to/desired/output/directory.