Computing, Telecommunication and Control

Информатика, телекоммуникации и управление

2687-0517

10.18721/JCSTCS.18203

Dataset creation for comprehensive performance evaluation of automatic speech recognition systems

Создание набора данных для комплексной оценки производительности систем автоматического распознавания речи

Andrusenko

Andrei

andrusenkoau@gmail.com

0000-0003-1116-7765

56049610600

Drobintsev

Pavel

drobintsev_pd@spbstu.ru

Peter the Great St. Petersburg Polytechnic University

09 06 2025

18 2 33 44

The performance evaluation of Automatic Speech Recognition (ASR) systems heavily depends on the availability of diverse and representative test datasets encompassing a wide range of complexities in various domains. This work introduces a novel methodology for collecting and preparing datasets for comprehensive ASR system evaluation. The proposed dataset incorporates a modern vocabulary enriched with numerous unique terms and proper nouns, facilitating an in-depth evaluation of overall ASR performance and the effectiveness of context-biasing techniques in computer science. Additionally, the dataset retains critical text features such as Punctuation and Capitalization (P&C), enabling a rigorous evaluation of P&C prediction algorithms. We present a detailed account of the dataset creation process, along with its statistical and qualitative analysis. Furthermore, we benchmark state-of-the-art ASR models, context-biasing approaches, and P&C prediction techniques using the proposed dataset, providing valuable insights into their relative performance.

automatic speech recognition test dataset large language models punctuation and capitalization context-biasing