Speech Corpora for Different Languages: A Systematic Review
- 1 Department of Complex Information Security of Computer Systems, Tomsk State University of Control Systems and Radio Electronics, Tomsk, Russia
Abstract
The study of speech signals relies on carefully curated audio recordings, which are compiled and stored within specialized speech corpora. This article provides a comprehensive overview of such corpora across multiple languages, with particular focus on Russian, English, and Arabic. It notes that Russian and Arabic are represented by fewer corpora compared to the more extensive resources available for English. The discussion includes an examination of typical speech corpus structures, a description of standard parameters for characterizing corpora, and an outline of common metrics used to describe the speech signal itself.
DOI: https://doi.org/10.3844/jcssp.2026.9.24
Copyright: © 2026 Vladimir Igorevich Fedoseev, Anton Aleksandrovich Konev and Natalia Sergeevna Repyuk. This is an open access article distributed under the terms of the
Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
- 19 Views
- 4 Downloads
- 0 Citations
Download
Keywords
- Dataset
- Pronunciation
- Speech Corpora
- Transcript
- Speech Recognition