← Back to context

Comment by sorenjan

15 days ago

> I think Cuda's GPU offloading model is very boring.

> That is, I think the more general architecture amdgpu is shipping is better than the specialised one cuda implements, despite the developer experience being rather gnarlier.

This reminded me of that Twitter thread that was linked on HN yesterday, specifically the part about AMD's "true" dual core compared to Intel's "fake" dual core.

> We did launch a “true” dual core, but nobody cared. By then Intel’s “fake” dual core already had AR/PR love. We then started working on a “true” quad core, but AGAIN, Intel just slapped 2 dual cores together & called it a quad-core. How did we miss that playbook?! AMD always launched w/ better CPUs but always late to mkt. Customers didn’t grok what is fake vs real dual/quad core. If you do cat /proc/cpu and see cpu{0-3} you were happy.

https://news.ycombinator.com/item?id=40696384

What is the currently available best way to write GPGPU code to be able to ship a single install.exe to end users that contains compiled code that runs on their consumer class AMD, Nvidia, and Intel graphics cards? Would AdaptiveCpp work?

Shipping compiled code works fine if you have a finite set of kernels. Just build everything for every target, gzip the result and send it out. People are a bit reluctant to do that because there are lots of copies of essentially the same information in the result.

I suspect every solution you'll find which involves sending a single copy of the code will have a patched copy of llvm embedded in said install.exe, which ideally compiles the kernels to whatever is around locally at install time, but otherwise does so at application run time. It's not loads of fun deriving a program from llvm but it has been done a lot of times now.

  • > Shipping compiled code works fine if you have a finite set of kernels. Just build everything for every target, gzip the result and send it out. People are a bit reluctant to do that because there are lots of copies of essentially the same information in the result.

    That's kind of the point, you have to build everything for a lot of different targets. And what happens a year from now when the client have bought the latest GPU and wants to run the same program on that? Not having an intermediary compile target like RTX is a big downside, although I guess it didn't matter for Frontier.

    I can't find any solution, AdaptiveCpp seems like the best option but they say Windows support is highly experimental because they depend on a patched llvm, and they only mention OpenMP and Cuda backends anyway. Seems like Cuda is still the best Windows option.

    • There's a degree of moving the goalposts there.

      Shipping some machine code today to run on a GPU released tomorrow doesn't work anywhere. Cuda looks like it does provided someone upgrades the cuda installation on the machine after the new GPU is released because the ptx is handled by the cuda runtime. HSAIL was meant to do that on amdgpu but people didn't like it.

      That same trick would work on amdgpu - compile to spir-v, wait for a new GPU, upgrade the compiler on the local machine, now you can run that spir-v. The key part is the installing a new JIT which knows what the new hardware is, even if you're not willing to update the program itself. Except that compile to spir-v is slightly off in the weeds for compute kernels at present.

      It's tempting to view that as a non-issue. If someone can upgrade the cuda install on the machine, they could upgrade whatever program was running on cuda as well. In practice this seems to annoy people though which is why there's a gradual move toward spir-v, or to shipping llvm IR and rolling the die on the auto-upgrade machinery handling it. Alternatively ship source code and compile it on site, that'll work for people who are willing to apt install a new clang even if they aren't willing to update your program.

      1 reply →