Computing, Telecommunication and Control

Информатика, телекоммуникации и управление

2687-0517

10.18721/JCSTCS.15406

Multi-channel transformer: A transformer-based model for multi-speaker speech recognition

Многоканальный трансформер: Модель для распознавания многоголосной речи, основанная на архитектуре трансформер

Fadeeva

Ekaterina

rediska@yandex-team.ru Ershov

Vasily

noxoomo@yandex-team.ru

Yandex LLC

30 12 2022

15 4 73 85

Most of the modern approaches to multi-speaker speech recognition are either not applicable in case of overlapping speech or require a lot of time to run, which can be critical, for example, in case of real-time speech recognition. In this paper, a transformer-based end-to-end model for overlapping speech recognition is presented. It is implemented by using a generalization of the standard approach to speech recognition. The introduced model achieves results comparable in quality to modern state-of-the-art models, but requires less model calls, which speeds up the inference. In addition, a procedure for generating synthetic data for model training is described. This procedure allows to compensate for the lack of real multi-speaker speech training data by creating a stream of data from the initial collection.

speech recognition multi-speaker speech recognition diarization speech separation voice technologies