Christian Böhm
Prof. Dr.
Principal Investigator
* Former Principal Investigator
The LU decomposition is an essential element used in many linear algebra applications. Furthermore, it is used in LINPACK to benchmark the performance of modern multi-core processor environments. These processors offer a large memory hierarchy including multiple registers and various levels of cache. Registers or L1 data cache are small in size but also very fast. The L2 or L3 cache memory is usually shared among other cores and larger but slower. For the LU decomposition, the latency of fetching data from the main memory to the registers to perform a calculation also depends on the input matrix's memory access pattern. Here, we look at the block factorization algorithm, where the LU decomposition performance depends on the performance of the matrix multiplication. In both cases, the LU decomposition and the matrix multiplication, such a matrix is traversed by three nested loops. In this paper, we propose to traverse such loops in an order defined by a space-filling curve. This traversal dramatically improves data locality and offers effective exploitation of the memory hierarchy. Besides the canonical (or line-by-line) access pattern, we demonstrate the traversal in Hilbert-, Peano and Morton order. Our extensive experiments show that the Morton order (or Z -order) and the inverse Morton order (or Z-order) have a better runtime performance compared to the others.
inproceedings
BibTeXKey: PPB20