GPU Ecosystems: NVIDIA and AMD — Brief ☧
Deep version → | Related: PyTorch vs TF → | Rust ML →
"The hearing ear, and the seeing eye, the LORD hath made even both of them."
— Proverbs 20:12 (KJV)
Q: A CPU processes instructions one at a time (or a handful in
parallel). But training a neural network means multiplying enormous
matrices — millions of multiplications that are all independent of each
other. What kind of hardware is built for that?
A: A GPU (Graphics Processing Unit). Originally designed to
compute millions of pixel colors simultaneously for video games, GPUs
turned out to be perfect for the massive parallel math in neural
networks. Where a CPU has 8-64 cores, a modern GPU has thousands.
Q: So who makes these GPUs?
A: Two main players:
- NVIDIA built a complete ecosystem — the CUDA programming platform, cuDNN for neural network primitives, TensorRT for deployment. After 20 years of development, nearly every ML framework supports CUDA first. When researchers say "GPU," they usually mean NVIDIA.
- AMD competes with raw hardware power (the MI300X has 192 GB of memory vs. NVIDIA H100's 80 GB) and an open standard called ROCm/HIP that is designed to be compatible with CUDA code. AMD is the challenger, gaining ground especially where memory capacity or cost matters.
Q: If both can do matrix math, why does NVIDIA dominate?
A: Software ecosystem, not just hardware. CUDA has two decades of
libraries, tutorials, and community support. Most ML code "just works"
on NVIDIA. AMD's ROCm is catching up but still has gaps. It is like the
difference between a city with established roads and a newer city with
wider streets but fewer shops — the infrastructure matters as much as
the raw capacity.
Q: Where does an FPGA fit alongside these GPU giants?
A: Different tool for a different job. GPUs excel at dense
floating-point math (matrix multiply). FPGAs excel at sparse
bit-parallel operations (like bitmask
AND/OR for constraint solving). Neither replaces the other — the right
choice depends on whether your problem is dense arithmetic or sparse
Boolean logic.
Quick Comparison
| Feature | NVIDIA (H100) | AMD (MI300X) | Our FPGA (F2) |
|---|---|---|---|
| Compute | 990 TFLOPS (FP8) | 1300 TFLOPS (FP8) | N/A (bit-parallel) |
| Memory | 80 GB HBM3 | 192 GB HBM3 | 16 GB HBM2 |
| Bandwidth | 3.35 TB/s | 5.3 TB/s | 460 GB/s |
| Software | CUDA (dominant) | ROCm/HIP (growing) | Custom RTL |
| ML ecosystem | PyTorch, TF, JAX | PyTorch (via HIP) | Our rust_chirho/ |
| Best for | Dense matrix ops | Large model inference | Sparse constraint solving |
| Price | ~$30K | ~$15K | ~$2/hr (cloud) |
There is a lot packed into this comparison table, so let us unpack the key insight. NVIDIA dominates through software ecosystem, not raw hardware specs. AMD actually has more memory (192 GB vs 80 GB) and more raw compute (1300 vs 990 TFLOPS), but CUDA's two decades of libraries, tutorials, and community support create a moat that raw performance numbers cannot bridge. This is a powerful lesson about technology adoption: the best hardware does not always win -- the best ecosystem does.
The choice of hardware affects the
time complexity in practice: an
algorithm that is
linear-time on paper may be 1000x
faster on a GPU because all operations run simultaneously rather than
in sequence. But this speedup only applies to workloads that are inherently parallel and dense -- like matrix multiplication. Sparse, irregular workloads often leave most of a GPU's thousands of cores idle.
Connection to our project: GPUs are the competition for neurosymbolic
workloads, and understanding where they excel versus where our FPGA excels is essential for making good architectural decisions. Our FPGA wins when the problem is fundamentally
AND/OR) rather than floating-point (matrix multiply). A GPU is designed to multiply enormous matrices of floating-point numbers; our FPGA is designed to AND enormous bitmasks of Boolean values. The SemMedDB benchmark
shows 307K constraint intersections per second on FPGA — competitive with GPU for
this specific workload type, and at a fraction of the cost ($2/hour for an F2 instance versus $30,000 for an H100 card). The ideal neurosymbolic system uses both: a GPU for the neural components that need floating-point math, and an FPGA for the symbolic components that need bit-parallel logic.
Soli Deo Gloria