Comment by kkielhofner

14 days ago

> The customers at the national labs are not going to be sharing custom HPC code with AMD engineers, if for no other reason than security clearances. Nuclear stockpile modeling code, or materials science simulations are not being shared with some SWE at AMD. AMD is not “removing jank”, for these customers.

I work closely with OLCF and Frontier (I have a job running on Frontier right now). This is incorrect. The overwhelming majority of compute and resource allocation are not "nuclear stockpile modeling code" projects or anything close to it. AMD often gets directly involved with various issues (OLCF staff has plenty of stories about this). I know because I've spoken with them and AMD.

Speaking of Frontier, you get fun things like compiling an AWS project just to get RCCL to kind of work decently with Slingshot interconnect via libfabric[0] vs NCCL that "just works", largely due to Nvidia's foresight with their acquisition of Mellanox over five years ago.

> Let’s not pretend like CUDA works/has always worked out of the box.

It is and has been miles beyond the competition and that's clearly all you need. Nvidia has > 90% market share and is worth ~10x AMD. 17 years of focus and investment (30% of their R&D spend is software) when your competitors are wandering all over the place in fits and starts will do that. I'm also of the personal opinion that AMD just doesn't have software in their DNA and don't seem to understand that people don't want GPUs, they want solutions that happen to work best on GPUs and that entails broad and significant investment in the accompanying software stacks.

AMD has truly excellent hardware that is significantly limited by their lack of investment in software.

> There’s forced obsolescence (“CUDA compute capability”).

Compute capability is why code targeting a given lineage of hardware just works. You can target 8.0 (for example) and as long as your hardware is 8.0 it will run on anything with Nvidia stamped on it from laptop to Jetson to datacenter and the higher-level software doesn't know the difference (less VRAM, which is what it is). Throw in "+PTX" when building and it will run on anything up too (albeit not taking full advantage of new hardware). With official support, without setting various environment variable and compiler hacks to end up with code that often randomly crashes (I know from personal experience). It is extremely common for projects to target SM 7.x, 8.x and 9.x. The stack just figures it out from there.

This is the PTX intermediary available with CUDA and the driver that makes this possible, where in AMD land you have some pretty drastic differences within CDNA or RDNA families not to mention CDNA vs RDNA in the first place.

IMO it's an elegant solution that works and makes it simple, even more so than CPUs (AVX, etc). How would you suggest they divide something like eight year old Pascal vs Blackwell? In terms of obsolescence, Pascal is a great example - it's supported by up to and including latest drivers, CUDA 12, and everything in their frameworks support matrix[1] of which AMD doesn't have an equivalent. Like we saw with CUDA 11, CUDA 12 will be supported by major projects for years, resulting in at least a decade of support for Pascal. Please show me an AMD GPU with even eight years of support. Back to focus, ROCm isn't even that old and AMD is infamous for removing support for GPUs, often within five years if not less.

> CUDA didn’t even have backwards compatibility for minor releases (.1,.2, etc.) until version 11.0.

Yes but they have it and CUDA 11 is four years old. They also do nice things like when they added Hopper support in 11.7 so on the day of release it "just worked" with whatever you were already running (PTX again). Same for their consumer GPUs, it "just works" the day of release. AMD took over a year to officially support their current flagship desktop GPU (7900 XTX) and even that is dicey in practice due to CDNA vs RDNA. Even when they did they were doing bizarre things like supporting Python 3.10 with ROCm 5.7 docker containers and Python 3.9 in ROCm 6 docker containers for the first few months.

Python 3.10 is pretty much the de-facto standard for these stacks, cue my surprise when I was excited for ROCm 6 only to find out Python code with popular projects was blowing up all over the place because 3.9. It just screams "we don't get this".

> The distinction between CUDA, CUDA toolkit, CUDNN, and the actual driver is still inscrutable to many new devs (see the common questions asked on r/localLlama and r/StableDiffusion).

Yes, and AMD has direct equivalents that are even less clear. The reddit communities you mention are not the best examples (I would not call those users "devs"). Even so, look at any post of someone coming along asking what hardware to buy. The responses are overwhelmingly "AMD is a world of pain, if you want for it to just work buy Nvidia". IMO the only "AMD is fine, don't believe the FUD" responses are an effect of the cult-like "team red vs team green" bleeding over from hobbyist/gamer subs on Reddit because it's just not accurate. I don't know a single dev or professional in the space (who's livelihood depends on it) who agrees.

They will also often point out that due to significantly better software AMD hardware is often bested by previous generation Nvidia hardware with dramatically inferior paper specs [2]. I like to say that AMD is at the "get it to work" stage while Nvidia and the broader CUDA ecosystem has been at the "squeeze every last penny out of it" stage for many years.

> And you have been able to run real DL workloads on budget laptop cards since 5.4 (I’ve done so personally).

Depends on what you mean by "real DL workloads". Vanilla torch? Yes. Then start looking at flash attention, triton, xformers, and production inference workloads...

> Developer support is improving (arguably too slowly), but it’s improving.

Generally agree but back to focus and discipline it's a shame that it took a massive "AI" goldrush over the past ~18 months for them to finally take it vaguely seriously. Now you throw in the fact that Nvidia has absurdly more resources, their 30% R&D spend on software is going to continue to rocket CUDA ahead of ROCm.

For Frontier and elsewhere I really want AMD to succeed, I just don't think it does them (or anyone) any favors by pretending that all is fine in ROCm land.

[0] - https://www.olcf.ornl.gov/wp-content/uploads/OLCF_AI_Trainin...

[1] - https://docs.nvidia.com/deeplearning/frameworks/support-matr...

[2] - https://www.reddit.com/r/LocalLLaMA/comments/191srof/amd_rad...

5 comments

kkielhofner

cepth 14 days ago

(Split into two parts due to comment length restrictions)

> I work closely with OLCF and Frontier (I have a job running on Frontier right now). This is incorrect. The overwhelming majority of compute and resource allocation are not "nuclear stockpile modeling code" projects or anything close to it. AMD often gets directly involved with various issues (OLCF staff has plenty of stories about this). I know because I've spoken with them and AMD.

I don't have any experience running a job on one of these national supercomputers, so I'll defer to you on this. (Atomic Canyon looks very cool!)

Just two follow-ups then: is it the case that any job, small or large, enjoys this kind of AMD optimization/debugging support? Does your typical time-grant/node-hour academic awardee get that kind of hands-on support?

And, for nuclear modeling (be it weapons or civilian nuclear), do you know if AMD engineers can get involved? (https://insidehpc.com/2023/02/frontier-pushes-boundaries-86-... this article claims "86% of nodes" were used on at least one modeling run, which I imagine is among the larger jobs)

> It is and has been miles beyond the competition and that's clearly all you need. Nvidia has > 90% market share and is worth ~10x AMD. 17 years of focus and investment (30% of their R&D spend is software) when your competitors are wandering all over the place in fits and starts will do that.

No dispute here that NVIDIA is the market leader today, deservedly so. NVIDIA to its credit has invested in CUDA for many years, even when it wasn't clear there was an immediate ROI.

But, I bristle at the narrative fallacy that it was some divine inspiration and/or careful planning (“focus”) that made CUDA the perfect backbone for deep learning.

In 2018, NVIDIA was chasing crypto mining, and felt the need to underplay (i.e., lie) to investors about how large that segment was (https://wccftech.com/nvidia-sued-cryptocurrency-mining-reven...). As late as 2022, NVIDIA was diverting wafer supply from consumer, professional, and datacenter GPUs to produce crippled "LHR" mining cards.

Jensen has at various points pumped (during GTC and other high profile events):

- Ray tracing (2018) (https://www.youtube.com/watch?v=95nphvtVf34)

- More ray tracing (2019) (https://youtu.be/Z2XlNfCtxwI)

- "Omniverse" (2020) https://youtu.be/o_XeGyg2NIo?list=PLZHnYvH1qtOYOfzAj7JZFwqta...)

- Blockchain, NFTs, and the metaverse (2021) (https://cointelegraph.com/news/nvidia-ceo-we-re-on-the-cusp-...) (https://blockonomi.com/nvidiz-ceo-talks-crypto-nfts-metavers...)

- ETH (2021) (https://markets.businessinsider.com/currencies/news/nvidia-c...)

- "Omniverse"/digital twins (2022) (https://www.youtube.com/watch?v=PWcNlRI00jo)

- Autonomous vehicles (2022) (https://www.youtube.com/watch?v=PWcNlRI00jo)

Most of these predictions about use cases have not panned out at all. The last GTC keynote prior to the "ChatGPT moment" took place just 2 months before the general availability of ChatGPT. And, if you click through to the video, you'll see that LLMs got under 7 minutes of time at the very end of a 90 minute keynote. Clearly, Jensen + NVIDIA leadership had no idea that LLMs would get the kind of mainstream adoption/hype that they have.

On the business side, it hasn't exactly always been a smooth ride for NVIDIA either. In Q2 2022 (again right before the "ChatGPT moment"), the company missed earnings estimates by 18%(!) due to inventory writedowns (https://www.pcworld.com/article/828754/nvidia-preannounces-l...).

The end markets that Jensen forecasts/predicts on quarterly earnings calls (I’ve listened to nearly every one for the last decade) are comically disconnected from what ends up happening.

It's a running joke among buy-side firms that there'll always be an opportunity to buy the NVDA dip, given the volatility of the company's performance + stock.

NVIDIA's "to the moon" run as a company is due in large part to factors outside of its design or control. Of course, how large is up for debate.

If/when it turns out that most generative products can't turn a profit, and NVIDIA revenues decline as a result, it wouldn't be fair to place the blame for the collapse of those end markets at NVIDIA’s feet. Similarly, the fact that LLMs and generative AI turned out to be hit use cases has little to do with NVIDIA's decisions.

AMD is a company that was on death’s door until just a few years ago (2017). It made one of the most incredible corporate comebacks in the history of capitalism on the back of its CPUs, and is now dipping its toes into GPUs again.

NVIDIA had a near-monopoly on non-console gaming. It parlayed that into a dominant software stack.

It’s possible to admire both, without papering over the less appealing aspects of each’s history.

> Depends on what you mean by "real DL workloads". Vanilla torch? Yes. Then start looking at flash attention, triton, xformers, and production inference workloads...

As I mentioned above, this is a chicken-and-egg phenomenon with the developer ecosystem. I don't think we really disagree.

CUDA is an "easy enough" GPGPU backbone that due to incumbency and the lack of real competition from AMD and Intel for a decade led to the flourishing of a developer ecosystem.

Tri Dao (sensibly) decided to write his original Flash Attention paper with an NVIDIA focus, for all the reasons you and I have mentioned. Install base size, ease of use of ROCm vs CUDA, availability of hardware on-prem & in the cloud, etc.

Let's not forget that Xformers is a Meta project, and that non-A100 workloads (i.e., GPUs without 8.0 compute capability) were not officially supported by Meta for the first year of Xformers (https://github.com/huggingface/diffusers/issues/2234) (https://github.com/facebookresearch/xformers/issues/517#issu...). This is the developer ecosystem at work.

AMD right now is forced to put in the lion's share of the work to get a sliver of software parity. It took years to get mainline PyTorch and Tensorflow support for ROCm. The lack of a ROCm developer community (hello chicken and egg), means that AMD ends up being responsbile for first-party implementations of most of the hot new ideas coming from research.

Flash Attention for ROCm does exist (https://github.com/ROCm/flash-attention) (https://llm-tracker.info/howto/AMD-GPUs#flash-attention-2), albeit only on a subset of cards.

Triton added (initial) support for ROCm relatively recently (https://github.com/triton-lang/triton/pull/1983).

Production-scale LLM inference is now entirely possible with ROCm, via first-party support for vLLM (https://rocm.blogs.amd.com/artificial-intelligence/vllm/READ...) (https://community.amd.com/t5/instinct-accelerators/competiti...).

> Compute capability is why code targeting a given lineage of hardware just works. You can target 8.0 (for example) and as long as your hardware is 8.0 it will run on anything with Nvidia stamped on it from laptop to Jetson to datacenter and the higher-level software doesn't know the difference (less VRAM, which is what it is).

This in theory is the case. But, even as an owner of multiple generations of NVIDIA hardware, I find myself occasionally tripped up.

Case in point:

RAPIDS (https://rapids.ai/) is one of the great non-deep learning success stories to come out of CUDA, a child of the “accelerated computing” push that predates the company’s LLM efforts. The GIS and spatial libraries are incredible.

Yet, I was puzzled when earlier this year I updated cuSpatial to the newest available version (24.02) (https://github.com/rapidsai/cuspatial/releases/tag/v24.02.00) via my package manager (Mamba/Conda), and started seeing pretty vanilla functions start breaking on my Pascal card. Logs indicated I needed a Volta card (7.0 CC or newer). They must've reimplemented certain functions altogether.

There’s nothing in the release notes that indicates this bump in minimum CC. The consumer-facing page for RAPIDS (https://rapids.ai/) has a mention under requirements.

So I’m led to wonder, did the RAPIDS devs themselves not realize that certain dependencies experienced a bump in CC?

cepth 14 days ago
(Part 2 of 2)
> Please show me an AMD GPU with even eight years of support. Back to focus, ROCm isn't even that old and AMD is infamous for removing support for GPUs, often within five years if not less.
As you yourself noted, CDNA vs RDNA makes things more complicated in AMD land. I also think it’s unfair to ask about “eight years of support” when the first RDNA card didn’t launch until 2019, and the first CDNA “accelerator” in 2020.
The Vega and earlier generation is so fundamentally different that it would’ve been an even bigger lift for the already small ROCm team to maintain compatibility.
If we start seeing ROCm removing support for RDNA1 and CDNA1 cards soon, then I’ll share your outrage. But I think ROCm 6 removing support for Radeon VII was entirely understandable.
> Generally agree but back to focus and discipline it's a shame that it took a massive "AI" goldrush over the past ~18 months for them to finally take it vaguely seriously. Now you throw in the fact that Nvidia has absurdly more resources, their 30% R&D spend on software is going to continue to rocket CUDA ahead of ROCm.
> For Frontier and elsewhere I really want AMD to succeed, I just don't think it does them (or anyone) any favors by pretending that all is fine in ROCm land.
The fact is that the bulk of AMD profits is still coming from CPUs, as it always has. AMD wafer allotment at TSMC has to first go towards making its hyperscaler CPU customers happy. If you promise AWS/Azure/GCP hundreds of thousands of EPYC CPUs, you better deliver.
I question how useful it is to dogpile (not you personally, but generally) on AMD, when the investments in people and dollars are trending in the right decision. PyTorch and TensorFlow were broken on ROCm until relatively recently. Now that they work, you (not unreasonably) ask where the other stuff is.
The reality is that NVIDIA will likely forever be the leader with CUDA. I doubt we’ll ever see PhD students and university labs making ROCm their first choice when having to decide where to conduct career-making/breaking research.
But, I don’t think it’s really debatable that AMD is closing the relative gap, given the ROCm ecosystem didn’t exist until at all relatively recently. I’m guessing the very credible list of software partners now at least trying ROCm (https://www.amd.com/en/corporate/events/advancing-ai.html#ec...) are not committing time + resources to an ecosystem that they see as hopeless.
---
Final thoughts:
A) It was completely rational for AMD to focus on devoting the vast majority of R&D spend to its CPUs (particularly server/EPYC), particularly after the success of Zen. From the day that Lisa Su took over (Oct 8, 2014), the stock is up 50x+ (even more earlier in 2024), not that share price is reflective of value in the short term. AMD revenue for calendar year 2014 was $5.5B, operating income negative 155 million. Revenue for 2023 was $22.68B, operating income $401 million. Operating income was substantially higher in 2022 ($1.2B) and 2021 ($3.6B), but AMD has poured that money into R&D spending (https://www.statista.com/statistics/267873/amds-expenditure-...), as well as the Xilinx acquisition.
B) It was completely rational for NVIDIA to build out CUDA, as a way to make it possible to do what they initially called "scientific computing" and eventually "GPU-accelerated computing". There's also the reality that Jensen, the consummate hype man, had to sell investors a growth story. The reality is that gaming will always be a relatively niche market. Cloud gaming (GeForce Now) never matched up to revenue expectations.
C) It’s difficult for me to identify any obvious “points of divergence” that in an alternate history would’ve led to better outcomes with AMD. Without the benefit of “future knowledge”, at what point should AMD have ramped up ROCm investment? Given, as I noted above, in the months before ChatGPT went viral, Jensen’s GTC keynote gave only a tiny mention to LLMs.
D) If anything, the company that missed out was Intel. Beyond floundering on the transition from 14nm to 10nm (allowing TSMC and thus AMD to surpass them), Intel wasted its CPU-monopoly years and the associated profits. Projects like Larrabee (https://www.anandtech.com/show/3738/intel-kills-larrabee-gpu...) and Xe (doomed in part by internal turf wars) (https://www.tomshardware.com/news/intel-axes-xe-hp-gpus-for-...) were killed off. R&D spending was actually comparable to the amount spent on share buybacks in 2011 (14.1B in buybacks vs 8.3B in R&D spending), 2014 (10.7B vs 11.1B), 2018 (10.8B vs 13.B), 2019 (13.5B vs 13.3B) and 2020 (14.1B vs 13.55B). (See https://www.intc.com/stock-info/dividends-and-buybacks and https://www.macrotrends.net/stocks/charts/INTC/intel/researc...).
- vegabook 14 days ago
  
  lol AMD flogged its floundering foundry waaay before Intel ran into any problems.
  in fact most of your points about AMD's lack of dough can be traced back to that disaster. The company wasn't hit by some meteorite. It screwed up all by itself.
  Then lucky it had that duopolistic X86 licence to lean on or it would have gone the way of Zilog or Motorola. 'Cos it sure can't rely on its janky compute offering.
  
  1 reply →