Fixing 1-bit Adam and 1-bit LAMB algorithms
Today, various neural network models are trained using distributed learning in order to reduce the time spent. The most common way of distributed learning today is the approach, in which the data are divided into parts and sent along with the model to different devices, each device calculates updates for the model, then the updates are aggregated on the server, the server updates the weights of the model and transfers their new version to the devices. Slow network communication between devices can significantly reduce distribution efficiency. Recent studies propose one-bit versions of the Adam and LAMB algorithms, which can significantly reduce the amount of transmitted information, thus improving the scalability of training. However, it turned out that these algorithms diverge in some neural network architectures. The goal of this work is an empirical study of these algorithms, to find the solution of the discovered divergence problem and propose new aspects of testing gradient descent algorithms.