Using an Ensemble of Transforming Autoencoders to Represent 3D Objects

System Analysis and Control

One of the key goals of computer vision-related machine learning is to obtain high-quality representations of visual data resistant to changes in viewpoint, area, lighting, object pose or texture. Current state-of-the-art convolutional networks, such as GoogLeNet or AlexNet, can successfully produce invariant representations sufficient to perform complex multiclass classification. Some researchers, however, (Hinton, Khizhevsky, et al.) suggest that this approach, while being quite suitable for classification tasks, is misguided in terms of what an efficient visual system should be capable of doing: namely, being able to reflect spatial transformations of learned objects in a predictable way. The key concept of their research is equivariance rather than invariance, or the model's ability to change representation parameters in response to different poses and transformations of a model-specific visual entity. This paper employs Hinton's architecture of transforming autoencoder neural networks to identify lowlevel spatial feature descriptors. Applying a supervised SVM classifier to these detectors, one can then represent a sufficiently complex object, such as a geometric shape or a human face, as a composition of spatially related features. Using the equivariance property, one can also draw distinctions between different object poses, e.g., a frontal face image or a profile image, and then, be able to learn about another, higher-leveled transforming autoencoder via the same architecture. To obtain initial data for first-level feature learning, we use sequences of frames, or movies, and apply computer vision algorithms to detect regions of maximum interest and track their image patches across the movie. We argue that this way of learning features represents a more realistic approach to vision than general naive feature learning from a supervised dataset. The initial idea came from the concept of one-shot learning (by Fei-Fei et al.), that suggests a possibility of obtaining meaningful features from just one image (or, as in this study, a rather limited set of images supervised by time and order).