

If you are doing a ton of iterations with gradient evaluations in the range of a few milliseconds, the GPU will not be of much use. is the model taking a lot of time because the gradient evaluation is slow or because we are doing a ton of gradient evaluations (the tree depth numbers are big).The biggest questions when it comes to GPU use are: The work in the backend was finished a while ago, but not everything has made it upstream to the Stan level just yet, as that requires a bit more careful consideration and testing.

This is a bit annoying, I understand, but that is how we were able to do it most reliably at this time. I should also note that as of now, the Stan model will use OpenCL only for lpdf/lpmfs that are not inside user-defined functions (so transformed parameters and model block). Most gains are currently gained if you use a GLM lpdf/lpmf function (which obviously is not applicable everywhere).

Yeah, I would not expect much speedup for 500-1000, but that varies on where most of the time is spent.Īt the moment (version 2.29), you should expect to see GPU speedup mainly if most of the gradient evaluation time is spent in lpdf functions. Perhaps I’ll only see speedups with 50k-100k datasets? Perhaps I’ll only see speedups with 50k-100k datasets? Perhaps I’m expecting too much? Just thought I’d ask for others’ experiences with GPU speedups. Having seen the writups of GPU performance I’m thinking it’s down to my setup - an old 32GB AMD Firepro W9100 - rated at ~5200 GFlops for single precision, though it has an uncapped ~2600 GFlops for double precision & still gets solid double precision performance in benchmarks.ĭoes Stan really use the double precision? Theoretically this card should handily beat all the (double-precision-capped) consumer NVIDIA cards in FP64 unless they do some software emulation using single precision that I don’t know about. It’s mainly heirarchical normal_lpdf stuff, with the odd latent GP. I can see the GPU working around 50-60%, but it runs maybe 5x slower when I enable opencl, with or without threading, & whether I leave the reduce_sum in place or flatten it. I run most of my models threaded via reduce_sum & find truly substantial slowdowns with more complicated models when I enable opencl on model trials with smaller data - 500-1000 observation subsets used for model development.
That's around the 1/64 F32/F64 throughput ratio of recent NVIDIA GPUs - very impressive for pure emulation.I’m just wondering about others’ experiences with Stan on the GPU. This means the initial estimate of performance is 96-192 GFLOPS on M1 Max, not 24 GFLOPS 😄. It's also conflated with function calls, so I can't get a good measure. I'd give it a 2x speedup from OoOE, but not 8x. Looking at SoftFloat's code, none of it can be vectorized this way. It computed total FLOPS on the Intel CPU as 32 GFLOPS, then used some ratios to arrive at 24 GFLOPS for Apple GPU. The initial back-of-the-envelope calculation assumed all 8 vector registers of an Intel AVX CPU were utilized (i7 256-bit vectors). I don't think 35-bit floating point arithmetic is worth my time, and fully IEEE-compliant arithmetic is not as slow as I thought. I sketched out plans for the project in a new GitHub repository: metal-float64.
