The article is devoted to the study and comparative analysis of the software and hardware implementation of the operation of summing transposed matrices and its modified version – the operation of transposing the sum of matrices. A feature of the study is the use of high-level synthesis tools to obtain a hardware implementation. The relevance of the study is due to the widespread use of matrix operations for solving problems of various classes, the power asymptotic complexity of matrix calculations and the lack of data on the use of this toolkit in the tasks of creating hardware devices for matrix calculations. A step-by-step method of synthesis and optimization of a hardware device is proposed. A comparative study of software and hardware implementations of two computational tasks is carried out. It is shown that a large gain in the performance of hardware implementations is obtained by increasing the degree of parallelism of calculations. Additionally, conclusions are drawn about the inefficiency of attempts to achieve high clock frequencies, as well as about the increase in resources spent with increased speed due to parallelization.