Comment by cepth

12 days ago

A couple of thoughts here.

* AMD's traditional target market for its GPUs has been HPC as opposed to deep learning/"AI" customers.

For example, look at the supercomputers at the national labs. AMD has won quite a few high profile bids with the national labs in recent years:

- Frontier (deployment begun in 2021) (https://en.wikipedia.org/wiki/Frontier_(supercomputer)) - used at Oak Ridge for modeling nuclear reactors, materials science, biology, etc.

- El Capitan (2023) (https://en.wikipedia.org/wiki/El_Capitan_(supercomputer)) - Livermore national lab

AMD GPUs are pretty well represented on the TOP500 list (https://top500.org/lists/top500/list/2024/06/), which tends to feature computers used by major national-level labs for scientific research. AMD CPUs are even moreso represented.

* HPC tends to focus exclusively on FP64 computation, since rounding errors in that kind of use-case are a much bigger deal than in DL (see for example https://hal.science/hal-02486753/document). NVIDIA innovations like TensorFloat, mixed precision, custom silicon (e.g., the "transformer engine") are of limited interest to HPC customers. It's no surprise that AMD didn't pursue similar R&D, given who they were selling GPUs to.

* People tend to forget that less than a decade ago, AMD as a company had a few quarters of cash left before the company would've been bankrupt. When Lisa Su took over as CEO in 2014, AMD market share for all CPUs was 23.4% (even lower in the more lucrative datacenter market). This would bottom out at 17.8% in 2016 (https://www.trefis.com/data/companies/AMD,.INTC/no-login-req...).

AMD's "Zen moment" didn't arrive until March 2017. And it wasn't until Zen 2 (July 2019), that major datacenter customers began to adopt AMD CPUs again.

* In interviews with key AMD figures like Mark Papermaster and Forrest Norrod, they've mentioned how in the years leading up to the Zen release, all other R&D was slashed to the bone. You can see (https://www.statista.com/statistics/267873/amds-expenditure-...) that AMD R&D spending didn't surpass its previous peak (on a nominal dollar, not even inflation-adjusted, basis) until 2020.

There was barely enough money to fund the CPUs that would stop the company from going bankrupt, much less fund GPU hardware and software development.

* By the time AMD could afford to spend on GPU development, CUDA was the entrenched leader. CUDA was first released in 2003(!), ROCm not until 2016. AMD is playing from behind, and had to make various concessions. The ROCm API is designed around CUDA API verbs/nouns. AMD funded ZLUDA, intended to be a "translation layer" so that CUDA programs can run as a drop-in on ROCm.

* There's a chicken-and-egg problem here.

1) There's only one major cloud (Azure) that has ready access to AMD's datacenter-grade GPUs (the Instinct series).

2) I suspect a substantial portion of their datacenter revenue still comes from traditional HPC customers, who have no need for the ROCm stack.

3) The lack of a ROCm developer ecosystem means that development and bug fixes come much slower than they would for CUDA. For example, the mainline TensorFlow release was broken on ROCm for a while (you had to install the nightly release).

4) But, things are improving (slowly). ROCm 6 works substantially better than ROCm 5 did for me. PyTorch and TensorFlow benchmark suites will run.

Trust me, I share the frustration around the semi-broken state that ROCm is in for deep learning applications. As an owner of various NVIDIA GPUs (from consumer laptop/desktop cards to datacenter accelerators), in 90% of cases things just work on CUDA.

On ROCm, as of today it definitely doesn't "just work". I put together a guide for Framework laptop owners to get ROCm working on the AMD GPU that ships as an optional add-in (https://community.frame.work/t/installing-rocm-hiplib-on-ubu...). This took a lot of head banging, and the parsing of obscure blogs and Github issues.

TL;DR, if you consider where AMD GPUs were just a few years ago, things are much better now. But, it still takes too much effort for the average developer to get started on ROCm today.

12 comments

cepth

vegabook 12 days ago

Summary: AMD works if you spend 500m USD+ with them. Then they'll throw an army of their own software engineers into the contract who will hold your hand every step of the way, and remove all the jank for you. By contrast, since at least 10 years ago, I could buy any GTX card and CUDA worked out of the box, and that applied right down to a $99 Jetson Nano.

AMD's strategy looks a lot like IBM's mainframe strategy of the 80s. And that didn't go well.

cepth 12 days ago
No, not really?
The customers at the national labs are not going to be sharing custom HPC code with AMD engineers, if for no other reason than security clearances. Nuclear stockpile modeling code, or materials science simulations are not being shared with some SWE at AMD. AMD is not “removing jank”, for these customers. It’s that these customers don’t need a modern DL stack.
Let’s not pretend like CUDA works/has always worked out of the box. There’s forced obsolescence (“CUDA compute capability”). CUDA didn’t even have backwards compatibility for minor releases (.1,.2, etc.) until version 11.0. The distinction between CUDA, CUDA toolkit, CUDNN, and the actual driver is still inscrutable to many new devs (see the common questions asked on r/localLlama and r/StableDiffusion).
Directionally, AMD is trending away from your mainframe analogy.
The first consumer cards got official ROCm support in 5.0. And you have been able to run real DL workloads on budget laptop cards since 5.4 (I’ve done so personally). Developer support is improving (arguably too slowly), but it’s improving. Hugging Face, Cohere, MLIR, Lamini, PyTorch, TensorFlow, DataBricks, etc all now have first party support for ROCm.
- jedbrown 12 days ago
  
  > customers at the national labs are not going to be sharing custom HPC code with AMD engineers
  There are several co-design projects in which AMD engineers are interacting on a weekly basis with developers of these lab-developed codes as well as those developing successors to the current production codes. I was part of one of those projects for 6 years, and it was very fruitful.
  > I suspect a substantial portion of their datacenter revenue still comes from traditional HPC customers, who have no need for the ROCm stack.
  HIP/ROCm is the prevailing interface for programming AMD GPUs, analogous to CUDA for NVIDIA GPUs. Some projects access it through higher level libraries (e.g., Kokkos and Raja are popular at labs). OpenMP target offload is less widespread, and there are some research-grade approaches, but the vast majority of DOE software for Frontier and El Capitan relies on the ROCm stack. Yes, we have groaned at some choices, but it has been improving, and I would say the experience on MI-250X machines (Frontier, Crusher, Tioga) is now similar to large A100 machines (Perlmutter, Polaris). Intel (Aurora) remains a rougher experience.
- kkielhofner 12 days ago
  
  > The customers at the national labs are not going to be sharing custom HPC code with AMD engineers, if for no other reason than security clearances. Nuclear stockpile modeling code, or materials science simulations are not being shared with some SWE at AMD. AMD is not “removing jank”, for these customers.
  I work closely with OLCF and Frontier (I have a job running on Frontier right now). This is incorrect. The overwhelming majority of compute and resource allocation are not "nuclear stockpile modeling code" projects or anything close to it. AMD often gets directly involved with various issues (OLCF staff has plenty of stories about this). I know because I've spoken with them and AMD.
  Speaking of Frontier, you get fun things like compiling an AWS project just to get RCCL to kind of work decently with Slingshot interconnect via libfabric[0] vs NCCL that "just works", largely due to Nvidia's foresight with their acquisition of Mellanox over five years ago.
  > Let’s not pretend like CUDA works/has always worked out of the box.
  It is and has been miles beyond the competition and that's clearly all you need. Nvidia has > 90% market share and is worth ~10x AMD. 17 years of focus and investment (30% of their R&D spend is software) when your competitors are wandering all over the place in fits and starts will do that. I'm also of the personal opinion that AMD just doesn't have software in their DNA and don't seem to understand that people don't want GPUs, they want solutions that happen to work best on GPUs and that entails broad and significant investment in the accompanying software stacks.
  AMD has truly excellent hardware that is significantly limited by their lack of investment in software.
  > There’s forced obsolescence (“CUDA compute capability”).
  Compute capability is why code targeting a given lineage of hardware just works. You can target 8.0 (for example) and as long as your hardware is 8.0 it will run on anything with Nvidia stamped on it from laptop to Jetson to datacenter and the higher-level software doesn't know the difference (less VRAM, which is what it is). Throw in "+PTX" when building and it will run on anything up too (albeit not taking full advantage of new hardware). With official support, without setting various environment variable and compiler hacks to end up with code that often randomly crashes (I know from personal experience). It is extremely common for projects to target SM 7.x, 8.x and 9.x. The stack just figures it out from there.
  This is the PTX intermediary available with CUDA and the driver that makes this possible, where in AMD land you have some pretty drastic differences within CDNA or RDNA families not to mention CDNA vs RDNA in the first place.
  IMO it's an elegant solution that works and makes it simple, even more so than CPUs (AVX, etc). How would you suggest they divide something like eight year old Pascal vs Blackwell? In terms of obsolescence, Pascal is a great example - it's supported by up to and including latest drivers, CUDA 12, and everything in their frameworks support matrix[1] of which AMD doesn't have an equivalent. Like we saw with CUDA 11, CUDA 12 will be supported by major projects for years, resulting in at least a decade of support for Pascal. Please show me an AMD GPU with even eight years of support. Back to focus, ROCm isn't even that old and AMD is infamous for removing support for GPUs, often within five years if not less.
  > CUDA didn’t even have backwards compatibility for minor releases (.1,.2, etc.) until version 11.0.
  Yes but they have it and CUDA 11 is four years old. They also do nice things like when they added Hopper support in 11.7 so on the day of release it "just worked" with whatever you were already running (PTX again). Same for their consumer GPUs, it "just works" the day of release. AMD took over a year to officially support their current flagship desktop GPU (7900 XTX) and even that is dicey in practice due to CDNA vs RDNA. Even when they did they were doing bizarre things like supporting Python 3.10 with ROCm 5.7 docker containers and Python 3.9 in ROCm 6 docker containers for the first few months.
  Python 3.10 is pretty much the de-facto standard for these stacks, cue my surprise when I was excited for ROCm 6 only to find out Python code with popular projects was blowing up all over the place because 3.9. It just screams "we don't get this".
  > The distinction between CUDA, CUDA toolkit, CUDNN, and the actual driver is still inscrutable to many new devs (see the common questions asked on r/localLlama and r/StableDiffusion).
  Yes, and AMD has direct equivalents that are even less clear. The reddit communities you mention are not the best examples (I would not call those users "devs"). Even so, look at any post of someone coming along asking what hardware to buy. The responses are overwhelmingly "AMD is a world of pain, if you want for it to just work buy Nvidia". IMO the only "AMD is fine, don't believe the FUD" responses are an effect of the cult-like "team red vs team green" bleeding over from hobbyist/gamer subs on Reddit because it's just not accurate. I don't know a single dev or professional in the space (who's livelihood depends on it) who agrees.
  They will also often point out that due to significantly better software AMD hardware is often bested by previous generation Nvidia hardware with dramatically inferior paper specs [2]. I like to say that AMD is at the "get it to work" stage while Nvidia and the broader CUDA ecosystem has been at the "squeeze every last penny out of it" stage for many years.
  > And you have been able to run real DL workloads on budget laptop cards since 5.4 (I’ve done so personally).
  Depends on what you mean by "real DL workloads". Vanilla torch? Yes. Then start looking at flash attention, triton, xformers, and production inference workloads...
  > Developer support is improving (arguably too slowly), but it’s improving.
  Generally agree but back to focus and discipline it's a shame that it took a massive "AI" goldrush over the past ~18 months for them to finally take it vaguely seriously. Now you throw in the fact that Nvidia has absurdly more resources, their 30% R&D spend on software is going to continue to rocket CUDA ahead of ROCm.
  For Frontier and elsewhere I really want AMD to succeed, I just don't think it does them (or anyone) any favors by pretending that all is fine in ROCm land.
  [0] - https://www.olcf.ornl.gov/wp-content/uploads/OLCF_AI_Trainin...
  [1] - https://docs.nvidia.com/deeplearning/frameworks/support-matr...
  [2] - https://www.reddit.com/r/LocalLLaMA/comments/191srof/amd_rad...
  
  5 replies →

KeplerBoy 12 days ago

Small correction: CUDA was first released in 2007 and of course Nvidia was also aiming at HPC before the AlexNet moment.

fulafel 12 days ago

Good summary. There was also the 2010's multivendor HSA and OpenCL software evolution directions that ended up losing other vendors on the way and many customers turned out to accept the proprietary Cuda.