A hybrid non-uniform loop tiling for supercomputers with deep memory hierarchies
Non-uniform nature of multi-level memory architectures of modern supercomputers represents a notably underestimated issue in design of loop transformation algorithms. The imperfect compiler support for architectural features of deep memory hierarchies results in insufficient data locality, which in turn is an obstacle to achieving performance portability for a wide range of important computational kernels like iterated stencils or sparse-matrices. The major contribution of this paper is an algorithmic skeleton for hybrid non-uniform loop tiling. The proposed approach combines locality-enhancing features of polyhedral compilation framework with capability of non-uniformity effects modeling via hierarchical parameterized tiling strategy performed in a canonical syntactic manner. The polyhedral stage focuses on spatial and temporal locality prioritization along with end-to-end optimization pipeline. At the syntactic stage the parameterized loop tiling strategy allows an automatic definition of tiled loop characteristics to map it according to the hierarchical memory architecture. The tiles with various parameters like size, shape and form can be generated through the novel permutational target-specific algorithms. As a result, the variants of acoherent non-uniform loop tiling algorithm were designed on the basis of the proposed approach to evaluate the permutational techniques of tile size and tile shape selection. The early-stage experimental results are presented to show the effects of hybrid non-uniform tiling on data locality to deliver near-optimal performance portability for representative benchmarks across deepening memory hierarchies of multi-machine macronodes.