Raleigh Corpus

Jump to: navigation, search

The Raleigh Corpus is a corpus ...fill corpus information here. This page describes the steps to treat this corpus so that it may be used with the Montreal Forced Aligner and imported for the SPADE project.

Get Dataset

This data can be found on Havarti at corpora/Raleigh.

Treating Audio

None needed.

Treating Transcripts

Short to Long TextGrids

The TextGrid files included with the corpus are in Praat's Short TextGrid format, which is not compatible with Python's TextGrid-reading library. A Praat script, make_long_textgrid.praat (link here), can be used to convert these files. It needs the following prerequisites to run:

In the original Raleigh directory, create a new directory long_textgrids to be used for output. Then, in Praat, go to Praat > Open Praat Script, open make_long_textgrid.praat, and click Run. This will convert each file into a normal Long TextGrid text file and put it in the long_textgrids directory.

Fixing TextGrid Contents

The contents of the TextGrids still are not in the appropriate format to be used with the MFA. The script raleigh_tg_fix.py (link here) can be used to re-order required tiers and delete superfluous ones, and needs the following prerequisites to run:

In the original Raleigh directory, create a new directory fixed_textgrids to be used for output. Then, run python long_textgrids fixed_textgrids. This fixes the TextGrids that are Praat-readable.


Already done. (Will be redone eventually.)


To import, the files from the directory fixed_textgrids need to be moved into a directory containing all the .wav files as well. Call this new directory textgrid-wav.

The following files have a misformatted TextGrid still not readable by the Python library, and thus they and their corresponding .wav files should be left out of this directory:


The corpus can then be imported using the MFA parser.

This corpus is imported in PolyglotDB under the name Raleigh.