← Back to context

Comment by pyaamb

12 days ago

I'd love to know if any domain experts have a write up on what the the talent+ time+financial investment it would take for AMD to come up with with something that is a worthy rival to CUDA. Very curious to understand what the obstacles are.

~5 years. Medium-sized team in-house + hordes (hundreds, thousands) of engineers in the field helping clients on-site, writing code for them directly upstreamed to drivers, core libs, etc. (iteratively optimized in-house, ship feature, rinse and repeat). Story of the PlayStation SDKs, of DX too, but above all CUDA (they really outdid this strategy), now for cuDNN and so much more.

It takes incompressible time because you have to explore the whole space, cover most bases; and it takes an industry several years (about one "gen" / hardware cycle) to do that meaningfully.It helps when your platform is disruptive and customers move fast.

Maybe 3 years at best if you start on a new ideal platform designed for it from scratch. And can throw ungodly amount of money fast at it (think 5K low-level engineers roaming your installed base).

Maybe 10+ yrs (or never) if you're alone, poor, and Radeon (j/k but to mean it's non-trivial).

I’d say it mainly needs persistence and good execution (library support). NVIDIA has co-developed CUDA with their hardware, and largely stayed compatible with it, since around 2009, and around 2012 it first started taking off in the HPC space. Years later this enabled first their boom in crypto and then an even bigger one in AI. I don’t think this amount of R&D would be out of reach of today’s AMD (as NVIDIA wasn’t any bigger back then), but the backing of it needs to come from the very top.

First, they need to work with kernel devs to finally fix their drivers. Like, Nvidia used to be a "pain in the ass" here as well (that's a literal quote from Torvalds), so simply by contributing more than nothing, they could have taken the lead. But they definitely screwed this one up.

Second, they need to fix their userspace stack. ROCm being open source and all is great in principle, but simply dropping your source to the masses doesn't make it magically work. They need to stop letting it linger by either working with the open source community (huge time investment) or do it themselves (huge money investment).

  • The code is all on GitHub, the ISA docs are public, the driver is in upstream Linux with the work in progress on Gitlab. You can build whatever you want on AMD's hardware with total disregard to their software if you're so inclined. One or two companies seem to be doing so.

    This has been true since roughly the opencl days, where the community could have chosen open standards over subservience to team green. Then again for the HSA movement, a really solid heterogeneous programming model initially supported by a bunch of companies. Also broadly ignored.

    Today the runtime code is shipping in Linux distributions. Decent chance your laptop has an AMD CPU in it, that'll have a built in GPU that can run ROCm with the kernel you're already using and packages your distribution ships.

    I'm not sure what more AMD could be doing here. What more do you want them to do?

    • > the community could have chosen open standards over subservience to team green

      i think most people would rather have proprietary software that works rather than opensource that doesn't

    • >The code is all on GitHub, the ISA docs are public, the driver is in upstream Linux with the work in progress on Gitlab

      That's exactly what I meant by dumping the source and hoping that someone turns else it to plug and play magic - for free. This simply doesn't work.

      1 reply →

I wonder if they really need a CUDA rival.

This AI stuff has progressed a bit. Intel has been working on interesting stuff with OneAPI. It might be the case that things have progressed to the point where the primitives are well enough understood that you need something more like a good library rather than a good compiler.

In the end, more people seem to love BLAS than Fortran, after all.

  • That library (Triton) sits on top of the compiler and drivers (ROCm). If the driver kernel panics, no high-level library can fix that.

    • I don’t have direct experience, so I could be wrong. But, I believe a lot of the nice stuff that CUDA brings along is profiling and performance related, that is, most useful if you are writing the code yourself. Plus, if the ecosystem is not quite as stable, but it is mostly AMD’s engineers writing the library that have to deal with it, they have more latitude to just not go down the buggy or bad-performance code-paths.

I don't want a CUDA rival. I want to get the entire pile of CUDA code that is already written and run it on AMD GPUs without any kind of tweak or rewrite, and have it just work every time

Compatibility with existing code is very important. People can't afford to rewrite their stuff just to support AMD, and thus they don't

AMD is kind of trying to do this with rocm and HIP, but whatever they are doing it's not enough

My theory is that someone came up with the bright idea of allowing more open source in the stack and that that would allow them to get it all done via crowd sourcing and on the cheap. But if true it was a quite naive view of how it might work.

If instead they said let's take the money we should invested in internal development and build an open developer community that will leverage our hardware to build a world class software stack it might have been a little better.

  • AMD has just never had good developer software. For ages the best BLAS on AMD was… Intel MKL, as long as you figured out how to dispatch the right kernels.

    Actually, it could be really cool if everybody acted like AMD. The fact that Intel and Nvidia put out the best number libraries for free means you can’t sell a number crunching library!

I spotted this recent post https://www.reddit.com/r/LocalLLaMA/comments/1deqahr/comment... that was pretty interesting:

> When I was working on TVM at Qualcomm to port it to Hexagon a few years ago we had 12 developers working on it and it was still a multiyear long and difficult process.

> This is also ignoring the other 20 or so developers we had working on Hexagon for LLVM, which did all of the actual hardware enablement; we just had to generate good LLVM IR. You have conveniently left out all of the LLVM support that this all requires as AMD also uses LLVM to support their GPU architectures.

> Funny enough, about a half dozen of my ex coworkers left Qualcomm to go do ML compilers at AMD and they're all really good at it; way better than I am, and they haven't magically fixed every issue

> It's more like "hire 100 additional developers to work on the ROCM stack for a few years"

This last statement sounds about right. Note that ROCm has over 250 repos on Github, a lot of them pretty active: https://github.com/orgs/ROCm/repositories?type=all - I'm sure an enterprising analyst who was really interested could look at the projects active over the past year and find unique committers. I'd guess it's in the hundreds already.

I think if you click through the ROCm docs https://rocm.docs.amd.com/en/latest/ (and maybe compare to the CUDA docs https://docs.nvidia.com/cuda/ ) you might get a good idea of the differences. ROCm has made huge strides over the past year, but to me, the biggest fundamental problem is still that CUDA basically runs OOTB on every GPU that Nvidia makes (with impressive backwards and in some cases even forwards compatibility to boot https://docs.nvidia.com/deploy/cuda-compatibility/ ) on both Linux and Windows, and... ROCm simply doesn't.

I think the AMD's NPUs complicate things a bit as well. It looks like it's its currently running on its own ONNX/Vitis (Xilinx) stack https://github.com/amd/RyzenAI-SW , and really it should either get folded into ROCm (or a new SYCL/oneAPI-ish layer needs to be adopted to cover everything).

> is a worthy rival to CUDA

Vulkan Compute already exists

But when Developers still continue buying NVIDIA for CUDA, because developers only target CUDA for their applications it is a chicken-egg scenario, similar to Linux vs Windows.