SCOTS Corpus

From MLML
Jump to: navigation, search

The Scottish Corpus Of Texts and Speech (SCOTS) is a corpus containing over 1300 written and spoken texts, 77% of which is made up of written texts and 23% of spoken texts. This page describes the steps to treat this corpus so that it may be used with the Montreal Forced Aligner and imported for the SPADE project.

Get Dataset

This data can be found on Havarti at corpora/SCOTS.

Treating Audio

Non-English input

The following files are in Gaelic, not any dialect of English. They should be placed in their own sub-directory and removed from the top-level directory containing the wav files.

  1406_-_Catherine_Laing_Darren_Laing_GAELIC.wav
  1407_-_Donald_Ewen_Morrison_Margaret_MacDonald_GAELIC.wav

Treating Transcripts

Encoding

Before treatment, the following files, which have been ambiguously encoded from the corpus, should be re-encoded into UTF-8 format. This can be done by opening the files in a text editor and re-saving them in UTF-8 encoding.

  0598John_lecture_1.TextGrid
  0799_Niall_Rob.TextGrid
  1387_-_Doreen_Waugh_Linda_Riddell.TextGrid
  1428_-_LESSER_CONTENT_scotland-glasgow.TextGrid
  1450_-_kyle_bettley_cheryl_campbell.wav
  1470_Mairead_Mackechnie.wav
  1471_Three_Shetlandic_Ladies.wav
  1485_-_David_and_Rosalyn_Sweeney.wav
  1521_-_Jim_and_June_Anderson.wav
  1544_-_Matthew_Fitt.wav
  1545_Christine_and_Greg_-_CHILD_disk_1.wav
  1546_Ross_and_Shona_-_CHILD_disk_1.wav
  1547_Andrea_and_Marcus_disk_2.wav
  1548_Jamie_and_Linda_-_CHILD_disk_1.wav
  1549_Fiona_and_Liam_-_CHILD_disk_1.wav
  1550_Kelly_and_Abbey_-_CHILD_disk_3.wav

The transcripts of SCOTS audio already come in TextGrid format, but they need some treatment before being able to be used with the Montreal Forced Aligner.

Non-English input

The following files are formatted incorrectly for TextGrid and are in Gaelic, not any dialect of English. They should be placed in their own sub-directory and removed from the top-level directory containing the TextGrid files.

  1406_-_Catherine_Laing_Darren_Laing_GAELIC_badformat.TextGrid
  1407_-_Donald_Ewen_Morrison_Margaret_MacDonald_GAELIC_badformat.TextGrid

Treatment

The SCOTS transcripts are already in TextGrid format, but they need some treatment before alignment, which is done by clean_scots.py (add link). It needs the following prerequisites to run:

Ensure that all the TextGrids are in their own directory and that there exists a desired output directory for the fixed transcripts. Then, run python clean_scots.py textgrid_input_dir textgrid_output_dir. This will put all the fixed textgrid transcripts in their own directory.

Dictionary

Not yet done.

Alignment

Pre-Alignment

The following files are missing a corresponding audio file, and thus should not be included in the directory to be used for alignment:

  1379_-_Lorraine_Peacock_Martin_Aitken_Simon_Marsh_.TextGrid

The following files have typos in their names, and should be renamed accordingly:

  0579_james_austinDOUBLE.TextGrid 
     (to 0579_james_austin.TextGrid)
  0598John_lecture_1.TextGrid
     (to 0598_john_lecture_1.wav)