← Back to context

Comment by wordofx

16 days ago

And yet people seem to work just fine with ML on AMD GPUs when they aren’t thinking about Jensen.

I have a 7900 XTX. There's a known firmware crash issue with ComfyUI. It's been reported like a year ago. Every rocm patch release I check the notes, and every release it goes unfixed. That's not to go into the intense jank that is the rocm debian repo. If we need DL at work, I'll recommend Nvidia, no question.

Which AMD GPUs? Most consumer AMD GPUs don't even support ROCm.

  • Debian, Arch and Gentoo have ROCm built for consumer GPUs. Thus so do their derivatives. Anything gfx9 or later is likely to be fine and gfx8 has a decent chance of working. The https://github.com/ROCm/ROCm source has build scripts these days.

    At least some of the internal developers largely work on consumer hardware. It's not as solid as the enterprise gear but it's also very cheap so overall that seems reasonable to me. I'm using a pair of 6900XT, with a pair of VII's in a backup machine.

    For turn key proprietary stuff where you really like the happy path foreseen by your vendor, in classic mainframe style, team green is who you want.

    • > For turn key proprietary stuff where you really like the happy path foreseen by your vendor

      there really was no way for AMD to foresee that people might want to run GPGPU workloads on their polaris cards? isn't that a little counterfactual to the whole OpenCL and HSA Framework push predating that?

      Example: it's not that things like Bolt didn't exist to try and compete with Thrust... it's that the NVIDIA one has had three updates in the last month and Bolt was last updated 10 years ago.

      You're literally reframing "having working runtime and framework support for your hardware" as being some proprietary turnkey luxury for users, as well as an unforeseeable eventuality for AMD. It wasn't a development priority, but users do like to actually build code that works etc.

      That's why you got kicked to the curb by Blender - your OpenCL wasn't stable even after years of work from them and you. That's why you got kicked to the curb by Octane - your Vulkan Compute support wasn't stable enough to even compile their code successfully. That's the story that's related by richg42 about your OpenGL driver implementation too - that it's just paper features and resume-driven development by developers 10 years departed all the way down.

      The issues discussed by geohotz aren't new, and they aren't limited to ROCm or deep learning in general. This is, broadly speaking, the same level of quality that AMD has applied to all its software for decades. And the social-media "red team" loyalism strategy doesn't really work here, you can't push this into "AMD drivers have been good for like 10 years now!!!" fervor when the understanding of the problems are that broad and that collectively shared and understood. Every GPGPU developer who's tried has bounced off this AMD experience for literally an entire generation running now. The shared collective experience is that AMD is not serious in the field, and it's difficult to believe it's a good-faith change and interest in advancing the field rather than just a cashgrab.

      It's also completely foreseeable that users want broad, official support for all their architectures, and not one or two specifics etc. Like these aren't mysteries that AMD just accidentally forgot about, etc. They're basic asks that you are framing as "turnkey proprietary stuff", like a working opencl runtime or a working spir-v compiler.

      What was it linus said about the experience of working with NVIDIA? That's been the experience of the GPGPU community working with AMD, for decades. Shit is broken and doesn't work, and there's no interest in making it otherwise. And the only thing that changed it is a cashgrab, and a working compiler/runtime is still "turnkey proprietary stuff" they have to be arm-twisted into doing by Literally Being Put On Blast By Geohotz Until It's Fixed. "Fuck you, AMD" is a sentiment that there is very valid reasons to feel given the amount of needless suffering you have generated - but we just don't do that to red team, do we?

      But you guys have been more intransigent about just supporting GPGPU, no matter what framework, please just pick one, get serious and start working already than NVIDIA ever was about wayland etc. You've blown decades just refusing to ever shit or get off the pot (without even giving enough documentation for the community to just do it themselves). And that's not an exaggeration - I bounced off the AMD stack in 2012, and it wasn't a new problem then either. It's too late for "we didn't know people wanted a working runtime or to develop on gaming cards" to work as an excuse, after decades of overt willing neglect it's just patronizing.

      Again, sorry, this is ranty, it's not that I'm upset at you personally etc, but like, my advice as a corporate posture here is don't go looking for a ticker-tape parade for finally delivering a working runtime that you've literally been advertising support for for more than a decade like it's some favor to the community. These aren't "proprietary turnkey features" they're literally the basics of the specs you're advertising compliance with, and it's not even just one it's like 4+ different APIs that have this problem with you guys that has been widely known, discussed in tech blogs etc for more than a decade (richg42). I've been saying it for a long time, so has everyone else who's ever interacted with AMD hardware in the GPGPU space. Nobody there cared until it was a cashgrab, actually half the time you get the AMD fan there to tell you the drivers have been good for a decade now (AMD cannot fail, only be failed). It's frustrating. You've poisoned the well with generations of developers, with decades of corporate obstinance that would make NVIDIA blush, please at least have a little contrition about the whole experience and the feelings on the other side here.

      8 replies →

  • ROCm 6.0 and 6.1 list RDNA3 (gfx1100) and RDNA2 (gfx1030) in their supported architectures list: https://rocm.docs.amd.com/en/latest/compatibility/compatibil...

    Although "official" / validated support^ is only for PRO W6800/V620 for RDNA2 and RDNA3 RX 7900's for consumer. Based on lots of reports you can probably just HSA_OVERRIDE_GFX_VERSION override for other RDNA2/3 cards and it'll probably just work. I can get GPU-accelerate ROCm for LLM inferencing on my Radeon 780M iGPU for example w/ ROCm 6.0 and HSA_OVERRIDE_GFX_VERSION=11.0.0

    (In the past some people also built custom versions of ROCm for older architectures (eg ROC_ENABLE_PRE_VEGA=1) but I have no idea if those work still or not.)

    ^ https://rocm.docs.amd.com/projects/install-on-linux/en/lates...