Difference between revisions of "ICE-Can"

From MLML
Jump to: navigation, search
m (Protected "ICE-Can" ([Edit=Allow only autoconfirmed users] (indefinite) [Move=Allow only autoconfirmed users] (indefinite)))
(Initial commit)
Line 1: Line 1:
(ICE-Can page)
+
[https://dataverse.library.ualberta.ca/dvn/dv/VOICE ICE-Can] (Voices of the International Corpus of English (VOICE) CANADA) is a 70-recording corpus of speakers of Canadian English. It contains recordings of monologues and dialogues in scripted and unscripted contexts. This page describes the steps to treat this corpus so that it may be used with the [http://montreal-forced-aligner.readthedocs.io/en/latest/index.html Montreal Forced Aligner] and imported for the [[SPADE]] project.
 +
 
 +
== Get Dataset ==
 +
The International Corpus of English has traditionally been found [http://ice-corpora.net here], but as of May 2017, this link is down. As such, the corpus must currently be pulled down manually from [https://dataverse.library.ualberta.ca/dvn/dv/VOICE ICE-Can the link at the University of Alberta].
 +
 
 +
Each file in the corpus has associated with it an audio recording and a transcript, and the corpus itself contains one spreadsheet of metadata. Download these according to this structure:
 +
 
 +
  ICE-Can
 +
  | (metadata.xls file)
 +
  +- wav
 +
  |  | (.wav files)
 +
  |  +- mp3
 +
  |      | (.mp3 files)
 +
  +- txt
 +
  |  | (.txt files)
 +
 
 +
The .txt file corresponding to <code>S2A-004_1.wav</code> is missing from the corpus, so put a placeholder .txt file with the same name in the <code>txt</code> folder.
 +
 
 +
Once downloaded, put this corpus data on Havarti ''(yet to be done)''.
 +
 
 +
== Treating Audio ==
 +
=== Converting to .wav ===
 +
The .mp3 files included in the corpus must be converted to .wav. This is done with the <code>mp3ToWav.py</code> script ''(update with link)''. <code>mp3ToWav.py</code> needs the following prerequisites to run:
 +
* [https://github.com/jiaaro/pydub PyDub], installed through <code>pip install pydub</code>
 +
* [https://ffmpeg.org ffmpeg], see PyDub documentation for installation details
 +
 
 +
Download the script into the <code>wav</code> directory and then run <code>python mp3ToWav.py mp3</code>.
 +
 
 +
=== Downsampling ===
 +
Many .wav files included in the corpus have a sampling rate that is too high and must be downsampled to 44.1 kHz. This is done with the <code>downsample.py</code> script ''(update with link)''. <code>downsample.py</code> needs the following prerequisites to run:
 +
* [https://github.com/jiaaro/pydub PyDub], installed through <code>pip install pydub</code>
 +
* [https://ffmpeg.org ffmpeg], see PyDub documentation for installation details
 +
 
 +
Download the script into the original <code>ICE-Can</code> directory and create a new directory <code>wavDownsample</code> to be used for output. Then run <code>python downsample.py wav wavDownsample</code>.
 +
 
 +
When both of these steps are done, put this treated audio data on Havarti ''(yet to be done)''.
 +
 
 +
== Treating Transcripts ==
 +
The transcripts included with the corpus are written in a markup language that is irrelevant for our purposes. To clean the transcripts and convert them into a [http://www.fon.hum.uva.nl/praat/manual/TextGrid.html TextGrid] format, we use the <code>translate_transcript.py</code> script ''(update with link)''. <code>translate_transcript.py</code> needs the following prerequisites to run:
 +
* [https://github.com/scipy/scipy#installation SciPy], see documentation for installation details
 +
* [https://github.com/kylebgorman/textgrid textgrid], installed through <code>pip install git+http://github.com/kylebgorman/textgrid.git</code>
 +
* [https://github.com/python-excel/xlrd xlrd], installed through <code>pip install xlrd</code>
 +
 
 +
Download the script into the original <code>ICE-Can</code> directory and create a new directory <code>textgrid</code> to be used for output. Then run <code>python translate_transcript.py txt wav textgrid metadata.xls</code>.
 +
 
 +
There is an undetermined but isolated problem with the resulting TextGrid <code>S1B-043_2.TextGrid</code>, so delete it.
 +
 
 +
The script will output a file <code>corpuserrors.txt</code>, which stores all the fixes done to actual corpus errors/typos in the original transcripts.
 +
 
 +
== Dictionary and Acoustic Model ==
 +
The dictionary being used for this task is the lexicon from [http://www.openslr.org/12/ LibriSpeech]. ''(Acoustic model still being determined.)''
 +
 
 +
== Alignment ==
 +
=== Preparation ===
 +
For the aligner to function, the .wav files and their corresponding TextGrids must be stored in one folder together. Create a new directory <code>textgrid-wav</code> inside the main <code>ICE-Canada</code> directory and copy all the downsampled .wav files and all the .TextGrid files into it.
 +
 
 +
The ultimate directory structure should now look like this:
 +
 
 +
  ICE-Can
 +
  | (metadata.xls file)
 +
  | corpuserrors.txt
 +
  +- wav
 +
  |  | (.wav files)
 +
  |  +- mp3
 +
  |      | (.mp3 files)
 +
  +- txt
 +
  |  | (.txt files)
 +
  +- wavDownsample
 +
  |  | (downsampled .wav files)
 +
  +- textgrid
 +
  |  | (.TextGrid files)
 +
  +- textgrid-wav
 +
  |  | (downsampled .wav files)
 +
  |  | (.TextGrid files)
 +
 
 +
=== Running ===
 +
Alignment uses the [http://montreal-forced-aligner.readthedocs.io/en/latest/index.html Montreal Forced Aligner]. To run it with the LibriSpeech dictionary and ''(appropriate acoustic model)'', first navigate to the directory of the aligner, then run <code>bin/mfa_train_and_align /path/to/ICE-Can/textgrid-wav /path/to/librispeech-lexicon.txt /path/to/desired/output/directory</code>.

Revision as of 19:42, 11 May 2017

ICE-Can (Voices of the International Corpus of English (VOICE) CANADA) is a 70-recording corpus of speakers of Canadian English. It contains recordings of monologues and dialogues in scripted and unscripted contexts. This page describes the steps to treat this corpus so that it may be used with the Montreal Forced Aligner and imported for the SPADE project.

Get Dataset

The International Corpus of English has traditionally been found here, but as of May 2017, this link is down. As such, the corpus must currently be pulled down manually from ICE-Can the link at the University of Alberta.

Each file in the corpus has associated with it an audio recording and a transcript, and the corpus itself contains one spreadsheet of metadata. Download these according to this structure:

 ICE-Can
 | (metadata.xls file)
 +- wav
 |  | (.wav files)
 |   +- mp3
 |      | (.mp3 files)
 +- txt
 |  | (.txt files)

The .txt file corresponding to S2A-004_1.wav is missing from the corpus, so put a placeholder .txt file with the same name in the txt folder.

Once downloaded, put this corpus data on Havarti (yet to be done).

Treating Audio

Converting to .wav

The .mp3 files included in the corpus must be converted to .wav. This is done with the mp3ToWav.py script (update with link). mp3ToWav.py needs the following prerequisites to run:

  • PyDub, installed through pip install pydub
  • ffmpeg, see PyDub documentation for installation details

Download the script into the wav directory and then run python mp3ToWav.py mp3.

Downsampling

Many .wav files included in the corpus have a sampling rate that is too high and must be downsampled to 44.1 kHz. This is done with the downsample.py script (update with link). downsample.py needs the following prerequisites to run:

  • PyDub, installed through pip install pydub
  • ffmpeg, see PyDub documentation for installation details

Download the script into the original ICE-Can directory and create a new directory wavDownsample to be used for output. Then run python downsample.py wav wavDownsample.

When both of these steps are done, put this treated audio data on Havarti (yet to be done).

Treating Transcripts

The transcripts included with the corpus are written in a markup language that is irrelevant for our purposes. To clean the transcripts and convert them into a TextGrid format, we use the translate_transcript.py script (update with link). translate_transcript.py needs the following prerequisites to run:

Download the script into the original ICE-Can directory and create a new directory textgrid to be used for output. Then run python translate_transcript.py txt wav textgrid metadata.xls.

There is an undetermined but isolated problem with the resulting TextGrid S1B-043_2.TextGrid, so delete it.

The script will output a file corpuserrors.txt, which stores all the fixes done to actual corpus errors/typos in the original transcripts.

Dictionary and Acoustic Model

The dictionary being used for this task is the lexicon from LibriSpeech. (Acoustic model still being determined.)

Alignment

Preparation

For the aligner to function, the .wav files and their corresponding TextGrids must be stored in one folder together. Create a new directory textgrid-wav inside the main ICE-Canada directory and copy all the downsampled .wav files and all the .TextGrid files into it.

The ultimate directory structure should now look like this:

 ICE-Can
 | (metadata.xls file)
 | corpuserrors.txt
 +- wav
 |  | (.wav files)
 |   +- mp3
 |      | (.mp3 files)
 +- txt
 |  | (.txt files)
 +- wavDownsample
 |  | (downsampled .wav files)
 +- textgrid
 |  | (.TextGrid files)
 +- textgrid-wav
 |  | (downsampled .wav files)
 |  | (.TextGrid files)

Running

Alignment uses the Montreal Forced Aligner. To run it with the LibriSpeech dictionary and (appropriate acoustic model), first navigate to the directory of the aligner, then run bin/mfa_train_and_align /path/to/ICE-Can/textgrid-wav /path/to/librispeech-lexicon.txt /path/to/desired/output/directory.