TY - GEN
T1 - Efficient and Portable ALS Matrix Factorization for Recommender Systems
AU - Chen, J.
AU - Fang, J.
AU - Liu, Weifeng
AU - Tang, T
AU - Chen, X
AU - Yang, C.
PY - 2017/6/30
Y1 - 2017/6/30
N2 - Alternating least squares (ALS) has been proved to be an effective solver of matrix factorization for recommender systems. To speedup factorizing performance, various parallel ALS solvers have been proposed to leverage modern multi-core CPUs and many-core GPUs/MICs. Existing implementations are limited in either speed or portability (constrained to certain platforms). In this paper, we present an efficient and portable ALS solver for recommender systems. On the one hand, we diagnose the baseline implementation and observe that it lacks the awareness of the hierarchical thread organization on modern hardware. To achieve high performance, we apply the thread batching technique and three architecture-specific optimizations. On the other hand, we implement the ALS solver in OpenCL so that it can run on various platforms (CPUs, GPUs, and MICs). Based on the architectural specifics, we select a suitable code variant for each platform to efficiently mapping it to the underlying hardware. The experimental results show that our implementation performs 5.5× faster on a 16-core CPU and 21.2 faster on K20c than the baseline implementation. Our implementation also outperforms cuMF on various datasets.
AB - Alternating least squares (ALS) has been proved to be an effective solver of matrix factorization for recommender systems. To speedup factorizing performance, various parallel ALS solvers have been proposed to leverage modern multi-core CPUs and many-core GPUs/MICs. Existing implementations are limited in either speed or portability (constrained to certain platforms). In this paper, we present an efficient and portable ALS solver for recommender systems. On the one hand, we diagnose the baseline implementation and observe that it lacks the awareness of the hierarchical thread organization on modern hardware. To achieve high performance, we apply the thread batching technique and three architecture-specific optimizations. On the other hand, we implement the ALS solver in OpenCL so that it can run on various platforms (CPUs, GPUs, and MICs). Based on the architectural specifics, we select a suitable code variant for each platform to efficiently mapping it to the underlying hardware. The experimental results show that our implementation performs 5.5× faster on a 16-core CPU and 21.2 faster on K20c than the baseline implementation. Our implementation also outperforms cuMF on various datasets.
U2 - 10.1109/IPDPSW.2017.91
DO - 10.1109/IPDPSW.2017.91
M3 - Article in proceedings
SN - 9780769561493
T3 - IEEE International Symposium on Parallel and Distributed Processing Workshops
SP - 409
EP - 418
BT - 2017 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW)
CY - Lake Buena Vista, FL, USA
ER -