Speech Corpora for Different Languages: A Systematic Review

Vladimir Igorevich Fedoseev; Anton Aleksandrovich Konev; Natalia Sergeevna Repyuk

doi:10.3844/jcssp.2026.9.24

Review Article Open Access

Speech Corpora for Different Languages: A Systematic Review

Vladimir Igorevich Fedoseev¹, Anton Aleksandrovich Konev ¹ and Natalia Sergeevna Repyuk¹

¹ Department of Complex Information Security of Computer Systems, Tomsk State University of Control Systems and Radio Electronics, Tomsk, Russia

Abstract

The study of speech signals relies on carefully curated audio recordings, which are compiled and stored within specialized speech corpora. This article provides a comprehensive overview of such corpora across multiple languages, with particular focus on Russian, English, and Arabic. It notes that Russian and Arabic are represented by fewer corpora compared to the more extensive resources available for English. The discussion includes an examination of typical speech corpus structures, a description of standard parameters for characterizing corpora, and an outline of common metrics used to describe the speech signal itself.

Journal of Computer Science

Volume 22 No. 1, 2026, 9-24

DOI: https://doi.org/10.3844/jcssp.2026.9.24

Submitted On: 29 July 2024 Published On: 2 February 2026

How to Cite: Fedoseev, V. I., Konev , A. A. & Repyuk, N. S. (2026). Speech Corpora for Different Languages: A Systematic Review. Journal of Computer Science, 22(1), 9-24. https://doi.org/10.3844/jcssp.2026.9.24

Copyright: © 2026 Vladimir Igorevich Fedoseev, Anton Aleksandrovich Konev and Natalia Sergeevna Repyuk. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

19 Views
4 Downloads
0 Citations

Download

Keywords

Dataset
Pronunciation
Speech Corpora
Transcript
Speech Recognition