The study of the vision transformer architecture by explainability methods

Information Technologies

The article discusses issues of explainability of the operating principles of a machine learning model. As the architecture of the model, one of the types of transformer is considered, the task of which is to classify images based on the popular “ImageNet-1000” dataset. This type of transformer is also called vision transformer and can serve either as a standalone model or as part of a more complex architecture. The explainability methods included activation maps of classes, which were calculated by applying algorithms based on forward and backward propagation of image tensors through the components of the transformer: multi-head attention layers and fully connected multilayer networks. The aim of the work is to increase the explainability of the internal processes of the functioning of the vision transformer by analyzing the obtained activation maps and calculating a metric to evaluate their explainability. The results of the study reveal patterns that reflect the mechanisms of operation of the vision transformer in solving the image classification problem, as well as evaluating the importance of the identified classification features through the use of the explainability metric.