ResNet-SV: Fast and accurate speaker verification with a multi-layer cascade attention mechanism

Circuits and Systems for Receiving, Transmitting and Signal Processing
Authors:
Abstract:

One of the most challenging issues of voice biometrics rapid development is the need to develop methods that can combine speed and accuracy. Traditional solutions tend to choose a compromise between these two aspects, which either complicates the speaker verification process or reduces accuracy, especially under real-world conditions in which background noise and fluctuation in speech are substantial obstacles. This paper examines modern approaches and their architectural features. The architecture is based on ResNet, originally designed for computer vision tasks, which was modified and adapted for optimal performance in speech processing. The proposed modification method based on a multi-layer cascade attention mechanism for feature extraction from convolutional blocks is described in detail. This modification allows using fewer layers for feature extraction, thereby increasing the speed of the model, and allows to deal more effectively with the noise in the audio signal. The paper concludes with the model parameters used in the training process, as well as key metrics such as EER and minDCF computed on the VoxCeleb1 dataset. The results are compared with solutions built on other architectures. Through experimentation, the authors were able to achieve a high level of accuracy, with a smaller number of the neural network model parameters. This work brings us closer to a wider application of voice biometric systems in various scenarios.