Incorporating MAGMA into the `fields' spatial statistics package / by John Paige, Isaac Lyngaas, Vinay Ramakrishnaiah, Dorit Hammerling, Raghu Kumar, and Douglas Nychka

By: Contributor(s): Series: | NCAR Technical NotesBoulder, CO : National Center for Atmospheric Research (NCAR), 2015Content type:
  • text
Media type:
  • unmediated
Carrier type:
  • volume
ISSN:
  • 2153-2397
  • 2153-2400
Subject(s): Online resources: Abstract: In this report we describe how to incorporate the Cholesky decomposition from the Matrix Algebra on GPU and Multicore Architectures (MAGMA) library into some of the calculations of the `fields' spatial statistics package in R. We provide MAGMA installation instructions as well as demonstrations of performance when applied to simulated datasets and the CO2 dataset available in fields. While there are other spatial statistics packages in R using parallelism, such as bigGP and parspatstat, none to our knowledge directly incorporates GPUs or other coprocessors. Our code is timed on Caldera computational nodes in the National Center for Atmospheric Research's Yellowstone supercomputing environment. We find that for 40,000 x 40,000 matrices the MAGMA-accelerated decomposition has a 30.7 and 46.2 times speedup for 1 and 2 GPU implementations respectively over chol, the standard Cholesky decomposition function in R (with settings allowing R programmers to use our accelerated function like they would chol). The speedups are greater when using in-place calculations where the original matrix is overwritten and not copied. In that case, the equivalent speedups are 41.8 and 54.4 times for in place decompositions on one and two GPUs respectively. We also time a simple spatial analysis workflow with maximum likelihood estimation with up to over 23,000 observations, where accelerated workflows achieved approximately 4.2 and 4.3 times speedup when using 1 and 2 GPUs respectively over a corresponding unaccelerated workflow. As problem size increases, speedups improve, and the 2 GPU decompositions perform increasingly well compared to their corresponding 1 GPU implementations. Performance for 2 GPU decompositions is slower than with 1 GPU in some cases due to additional communication overheads and data dependencies in the Cholesky decomposition algorithm, and will be explored further in Ramakrishnaiah et al. (2015).
Holdings
Item type Current library Call number Copy number Status Date due Barcode Item holds
REPORT REPORT NCAR Library Mesa Lab 03721 1 Available 50583020003889
Total holds: 0

2015-08

Technical Report

In this report we describe how to incorporate the Cholesky decomposition from the Matrix Algebra on GPU and Multicore Architectures (MAGMA) library into some of the calculations of the `fields' spatial statistics package in R. We provide MAGMA installation instructions as well as demonstrations of performance when applied to simulated datasets and the CO2 dataset available in fields. While there are other spatial statistics packages in R using parallelism, such as bigGP and parspatstat, none to our knowledge directly incorporates GPUs or other coprocessors. Our code is timed on Caldera computational nodes in the National Center for Atmospheric Research's Yellowstone supercomputing environment. We find that for 40,000 x 40,000 matrices the MAGMA-accelerated decomposition has a 30.7 and 46.2 times speedup for 1 and 2 GPU implementations respectively over chol, the standard Cholesky decomposition function in R (with settings allowing R programmers to use our accelerated function like they would chol). The speedups are greater when using in-place calculations where the original matrix is overwritten and not copied. In that case, the equivalent speedups are 41.8 and 54.4 times for in place decompositions on one and two GPUs respectively. We also time a simple spatial analysis workflow with maximum likelihood estimation with up to over 23,000 observations, where accelerated workflows achieved approximately 4.2 and 4.3 times speedup when using 1 and 2 GPUs respectively over a corresponding unaccelerated workflow. As problem size increases, speedups improve, and the 2 GPU decompositions perform increasingly well compared to their corresponding 1 GPU implementations. Performance for 2 GPU decompositions is slower than with 1 GPU in some cases due to additional communication overheads and data dependencies in the Cholesky decomposition algorithm, and will be explored further in Ramakrishnaiah et al. (2015).

Questions? Email library@ucar.edu.

Not finding what you are looking for? InterLibrary Loan.