<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.3 20210610//EN" "https://jats.nlm.nih.gov/publishing/1.3/JATS-journalpublishing1-3.dtd">
<article article-type="research-article" dtd-version="1.3" xml:lang="en">
  <front xmlns:xlink="http://www.w3.org/1999/xlink">
    <journal-meta>
      <journal-title-group>
        <journal-title>Computing, Telecommunication and Control</journal-title>
        <trans-title-group xml:lang="ru">
          <trans-title>Информатика, телекоммуникации и управление</trans-title>
        </trans-title-group>
      </journal-title-group>
      <issn pub-type="epub">2687-0517</issn>
    </journal-meta>
    <article-meta xmlns:xlink="http://www.w3.org/1999/xlink">
      <article-id pub-id-type="publisher-id">6</article-id>
      <article-id pub-id-type="doi">10.18721/JCSTCS.15406</article-id>
      <title-group>
        <article-title>Multi-channel transformer: A transformer-based model for multi-speaker speech recognition</article-title>
        <trans-title-group xml:lang="ru">
          <trans-title>Многоканальный трансформер: Модель для распознавания многоголосной речи, основанная на архитектуре трансформер</trans-title>
        </trans-title-group>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <name>
            <surname>Fadeeva</surname>
            <given-names>Ekaterina</given-names>
          </name>
          <xref ref-type="aff" rid="aff1"/>
          <email>rediska@yandex-team.ru</email>
        </contrib>
        <contrib contrib-type="author">
          <name>
            <surname>Ershov</surname>
            <given-names>Vasily</given-names>
          </name>
          <xref ref-type="aff" rid="aff1"/>
          <email>noxoomo@yandex-team.ru</email>
        </contrib>
      </contrib-group>
      <aff id="aff1">Yandex LLC</aff>
      <pub-date publication-format="electronic" date-type="pub" iso-8601-date="2022-12-30">
        <day>30</day>
        <month>12</month>
        <year>2022</year>
      </pub-date>
      <volume>15</volume>
      <issue>4</issue>
      <fpage>73</fpage>
      <lpage>85</lpage>
      <self-uri xmlns:xlink="http://www.w3.org/1999/xlink" content-type="pdf" xlink:href="https://infocom.spbstu.ru/userfiles/files/articles/2022/4/73-85.pdf"/>
      <abstract xml:lang="en">
        <p>Most of the modern approaches to multi-speaker speech recognition are either not applicable in case of overlapping speech or require a lot of time to run, which can be critical, for example, in case of real-time speech recognition. In this paper, a transformer-based end-to-end model for overlapping speech recognition is presented. It is implemented by using a generalization of the standard approach to speech recognition. The introduced model achieves results comparable in quality to modern state-of-the-art models, but requires less model calls, which speeds up the inference. In addition, a procedure for generating synthetic data for model training is described. This procedure allows to compensate for the lack of real multi-speaker speech training data by creating a stream of data from the initial collection.</p>
      </abstract>
      <kwd-group xml:lang="en">
        <kwd>speech recognition</kwd>
        <kwd>multi-speaker speech recognition</kwd>
        <kwd>diarization</kwd>
        <kwd>speech separation</kwd>
        <kwd>voice technologies</kwd>
      </kwd-group>
    </article-meta>
  </front>
</article>
