Difference between revisions of "Raleigh Corpus"

From MLML
Jump to: navigation, search
Line 25: Line 25:
 
== Import ==
 
== Import ==
 
To import, the files from the directory <code>fixed_textgrids</code> need to be moved into a directory containing all the <code>.wav</code> files as well. Call this new directory <code>textgrid-wav</code>.
 
To import, the files from the directory <code>fixed_textgrids</code> need to be moved into a directory containing all the <code>.wav</code> files as well. Call this new directory <code>textgrid-wav</code>.
 
The following files have a missing corresponding audio file, and thus should be left out of this directory:
 
  ral1440d.TextGrid
 
  
 
The following files have a misformatted TextGrid still not readable by the Python library, and thus they and their corresponding <code>.wav</code> files should be left out of this directory:
 
The following files have a misformatted TextGrid still not readable by the Python library, and thus they and their corresponding <code>.wav</code> files should be left out of this directory:

Revision as of 16:02, 30 May 2017

The Raleigh Corpus is a corpus ...fill corpus information here. This page describes the steps to treat this corpus so that it may be used with the Montreal Forced Aligner and imported for the SPADE project.

Get Dataset

This data can be found on Havarti at corpora/Raleigh.

Treating Audio

None needed.

Treating Transcripts

Short to Long TextGrids

The TextGrid files included with the corpus are in Praat's Short TextGrid format, which is not compatible with Python's TextGrid-reading library. A Praat script, make_long_textgrid.praat (link here), can be used to convert these files. It needs the following prerequisites to run:

In the original Raleigh directory, create a new directory long_textgrids to be used for output. Then, in Praat, go to Praat > Open Praat Script, open make_long_textgrid.praat, and click Run. This will convert each file into a normal Long TextGrid text file and put it in the long_textgrids directory.

Fixing TextGrid Contents

The contents of the TextGrids still are not in the appropriate format to be used with the MFA. The script raleigh_tg_fix.py (link here) can be used to re-order required tiers and delete superfluous ones, and needs the following prerequisites to run:

In the original Raleigh directory, create a new directory fixed_textgrids to be used for output. Then, run python long_textgrids fixed_textgrids. This fixes the TextGrids that are Praat-readable.

Alignment

Already done. (Will be redone eventually.)

Import

To import, the files from the directory fixed_textgrids need to be moved into a directory containing all the .wav files as well. Call this new directory textgrid-wav.

The following files have a misformatted TextGrid still not readable by the Python library, and thus they and their corresponding .wav files should be left out of this directory:

 ral1550d.TextGrid
 ral3450d.TextGrid
 ral3590d.TextGrid
 ral3740d.TextGrid

The corpus can then be imported using the MFA parser.