The dataset was already on the MLML server
The audio was already in .wav format, and on the server
Santa Barbara transcription files (.trn) are aligned only at the utterance level. A script was written to parse this into .textgrid format. Utterances that had less than a .15 second pause between each other were collapsed into a single utterance. The Librispeech English dictionary was used with MFA to align these utterances. A new acoustic model was trained.
Since the data was converted to textgrids, an importer already exists for PolyglotDB.