Efficient Non-negative Least Squares on Multi-core Architectures

We parallelize a version of the active-set iterative algorithm derived from the original works of Lawson and Hanson (1974) on multi-core architectures. This algorithm requires the solution of an unconstrained least squares problem in every step of the iteration for a matrix composed of the passive columns of the original system matrix. To achieve improved performance, we use parallelizable procedures to efficiently update and {\em downdate} the QR factorization of the matrix at each iteration, to account for inserted and removed columns. We use a reordering strategy of the columns in the decomposition to reduce computation and memory access costs. We consider graphics processing units (GPUs) as a new mode for efficient parallel computations and compare our implementations to that of multi-core CPUs. Both synthetic and non-synthetic data are used in the experiments.

SIAM SISC 2011 Paper

CUDA Source-code

Below, we included our second implementation of the updating/downdating NNLS codes in C and accelerated using OpenMP and MKL Blas routines. We also integrated Matlab mex support in the compilation for ease of use and benchmarking.

OpenMP/MKL Source-code w/ Matlab Interface