← Back to context

Comment by logicchains

12 days ago

And yet they still can't solve the problem of their GPU driver/software stack for ML being much worse than NVidia's. It seems like the first step is easy: pay more for engineers. AMD pays engineers significantly less than NVidia, and it's presumably quite hard to build a competitive software stack while paying so much less. You get what you pay for.

Everyone does software poorly, hardware companies more so.

  • Well this is glaringly obvious to whole world, and Nvidia managed to get it right. Surely a feat that can be repeated elsewhere when enough will is spread over some time. And it would make them grow massively, something no shareholder ever frowns upon.

    • > Nvidia managed to get it right

      I don't think they did. If you work in the space and watched it develop over the years, you can see that there's been (and still are) plenty of jank and pain points to be found. Their spread mostly comes from their dominant market position before, afaik.

      3 replies →

Exactly. ROCm is only available for the top tier RX 7900 GPUs and you’re expected to run Linux.

AMD was (is) tinkering with a translation layer for CUDA, much like how WINE translates directX. Great idea but it’s been taking a while in this fast paced market.

  • > AMD was (is) tinkering with a translation layer for CUDA

    From what I understand, they dropped the contract with the engineer who was working on it.

    Fortunately, as part of the contract, said engineer stipulated that the project would become open source, so now it is, and is still being maintained by that engineer.

  • > ROCm is only available for the top tier RX 7900 GPUs and you’re expected to run Linux.

    Fixed it for you: ROCm is only officially supported for the top tier RX 7900 GPUs and you’re expected to run Linux.

    Desktop class cards work if you apply a "HSA version override".

    • Cool. I was thinking of getting a 7800 XT over a 4070. Hope I can get llama 70B working nearly as well.

Came here to say this. They only just recently got an AMD GPU on MLPerf thanks to a (different company), Tinycorp by George Hotz. I guess basic ML performance is too hard a problem.

  • I dunno, it a world where hardware companies like, sold hardware, and then software companies wrote software and sold that could be pretty nice. It is cool that Hotz is doing something other than contribute to an anticompetitive company’s moat.

A couple of thoughts here.

* AMD's traditional target market for its GPUs has been HPC as opposed to deep learning/"AI" customers.

For example, look at the supercomputers at the national labs. AMD has won quite a few high profile bids with the national labs in recent years:

- Frontier (deployment begun in 2021) (https://en.wikipedia.org/wiki/Frontier_(supercomputer)) - used at Oak Ridge for modeling nuclear reactors, materials science, biology, etc.

- El Capitan (2023) (https://en.wikipedia.org/wiki/El_Capitan_(supercomputer)) - Livermore national lab

AMD GPUs are pretty well represented on the TOP500 list (https://top500.org/lists/top500/list/2024/06/), which tends to feature computers used by major national-level labs for scientific research. AMD CPUs are even moreso represented.

* HPC tends to focus exclusively on FP64 computation, since rounding errors in that kind of use-case are a much bigger deal than in DL (see for example https://hal.science/hal-02486753/document). NVIDIA innovations like TensorFloat, mixed precision, custom silicon (e.g., the "transformer engine") are of limited interest to HPC customers. It's no surprise that AMD didn't pursue similar R&D, given who they were selling GPUs to.

* People tend to forget that less than a decade ago, AMD as a company had a few quarters of cash left before the company would've been bankrupt. When Lisa Su took over as CEO in 2014, AMD market share for all CPUs was 23.4% (even lower in the more lucrative datacenter market). This would bottom out at 17.8% in 2016 (https://www.trefis.com/data/companies/AMD,.INTC/no-login-req...).

AMD's "Zen moment" didn't arrive until March 2017. And it wasn't until Zen 2 (July 2019), that major datacenter customers began to adopt AMD CPUs again.

* In interviews with key AMD figures like Mark Papermaster and Forrest Norrod, they've mentioned how in the years leading up to the Zen release, all other R&D was slashed to the bone. You can see (https://www.statista.com/statistics/267873/amds-expenditure-...) that AMD R&D spending didn't surpass its previous peak (on a nominal dollar, not even inflation-adjusted, basis) until 2020.

There was barely enough money to fund the CPUs that would stop the company from going bankrupt, much less fund GPU hardware and software development.

* By the time AMD could afford to spend on GPU development, CUDA was the entrenched leader. CUDA was first released in 2003(!), ROCm not until 2016. AMD is playing from behind, and had to make various concessions. The ROCm API is designed around CUDA API verbs/nouns. AMD funded ZLUDA, intended to be a "translation layer" so that CUDA programs can run as a drop-in on ROCm.

* There's a chicken-and-egg problem here.

1) There's only one major cloud (Azure) that has ready access to AMD's datacenter-grade GPUs (the Instinct series).

2) I suspect a substantial portion of their datacenter revenue still comes from traditional HPC customers, who have no need for the ROCm stack.

3) The lack of a ROCm developer ecosystem means that development and bug fixes come much slower than they would for CUDA. For example, the mainline TensorFlow release was broken on ROCm for a while (you had to install the nightly release).

4) But, things are improving (slowly). ROCm 6 works substantially better than ROCm 5 did for me. PyTorch and TensorFlow benchmark suites will run.

Trust me, I share the frustration around the semi-broken state that ROCm is in for deep learning applications. As an owner of various NVIDIA GPUs (from consumer laptop/desktop cards to datacenter accelerators), in 90% of cases things just work on CUDA.

On ROCm, as of today it definitely doesn't "just work". I put together a guide for Framework laptop owners to get ROCm working on the AMD GPU that ships as an optional add-in (https://community.frame.work/t/installing-rocm-hiplib-on-ubu...). This took a lot of head banging, and the parsing of obscure blogs and Github issues.

TL;DR, if you consider where AMD GPUs were just a few years ago, things are much better now. But, it still takes too much effort for the average developer to get started on ROCm today.

  • Summary: AMD works if you spend 500m USD+ with them. Then they'll throw an army of their own software engineers into the contract who will hold your hand every step of the way, and remove all the jank for you. By contrast, since at least 10 years ago, I could buy any GTX card and CUDA worked out of the box, and that applied right down to a $99 Jetson Nano.

    AMD's strategy looks a lot like IBM's mainframe strategy of the 80s. And that didn't go well.

    • No, not really?

      The customers at the national labs are not going to be sharing custom HPC code with AMD engineers, if for no other reason than security clearances. Nuclear stockpile modeling code, or materials science simulations are not being shared with some SWE at AMD. AMD is not “removing jank”, for these customers. It’s that these customers don’t need a modern DL stack.

      Let’s not pretend like CUDA works/has always worked out of the box. There’s forced obsolescence (“CUDA compute capability”). CUDA didn’t even have backwards compatibility for minor releases (.1,.2, etc.) until version 11.0. The distinction between CUDA, CUDA toolkit, CUDNN, and the actual driver is still inscrutable to many new devs (see the common questions asked on r/localLlama and r/StableDiffusion).

      Directionally, AMD is trending away from your mainframe analogy.

      The first consumer cards got official ROCm support in 5.0. And you have been able to run real DL workloads on budget laptop cards since 5.4 (I’ve done so personally). Developer support is improving (arguably too slowly), but it’s improving. Hugging Face, Cohere, MLIR, Lamini, PyTorch, TensorFlow, DataBricks, etc all now have first party support for ROCm.

      8 replies →

  • Small correction: CUDA was first released in 2007 and of course Nvidia was also aiming at HPC before the AlexNet moment.

  • Good summary. There was also the 2010's multivendor HSA and OpenCL software evolution directions that ended up losing other vendors on the way and many customers turned out to accept the proprietary Cuda.

And yet people seem to work just fine with ML on AMD GPUs when they aren’t thinking about Jensen.

  • I have a 7900 XTX. There's a known firmware crash issue with ComfyUI. It's been reported like a year ago. Every rocm patch release I check the notes, and every release it goes unfixed. That's not to go into the intense jank that is the rocm debian repo. If we need DL at work, I'll recommend Nvidia, no question.

  • Which AMD GPUs? Most consumer AMD GPUs don't even support ROCm.

    • Debian, Arch and Gentoo have ROCm built for consumer GPUs. Thus so do their derivatives. Anything gfx9 or later is likely to be fine and gfx8 has a decent chance of working. The https://github.com/ROCm/ROCm source has build scripts these days.

      At least some of the internal developers largely work on consumer hardware. It's not as solid as the enterprise gear but it's also very cheap so overall that seems reasonable to me. I'm using a pair of 6900XT, with a pair of VII's in a backup machine.

      For turn key proprietary stuff where you really like the happy path foreseen by your vendor, in classic mainframe style, team green is who you want.

      9 replies →

    • ROCm 6.0 and 6.1 list RDNA3 (gfx1100) and RDNA2 (gfx1030) in their supported architectures list: https://rocm.docs.amd.com/en/latest/compatibility/compatibil...

      Although "official" / validated support^ is only for PRO W6800/V620 for RDNA2 and RDNA3 RX 7900's for consumer. Based on lots of reports you can probably just HSA_OVERRIDE_GFX_VERSION override for other RDNA2/3 cards and it'll probably just work. I can get GPU-accelerate ROCm for LLM inferencing on my Radeon 780M iGPU for example w/ ROCm 6.0 and HSA_OVERRIDE_GFX_VERSION=11.0.0

      (In the past some people also built custom versions of ROCm for older architectures (eg ROC_ENABLE_PRE_VEGA=1) but I have no idea if those work still or not.)

      ^ https://rocm.docs.amd.com/projects/install-on-linux/en/lates...