NVIDIA Blackwell GPU architecture: Unleashing next‑gen AI performance

Blackwell GPU: NVIDIA’s next-gen architecture with multi-die design, FP4 precision, NVLink-5, and GB200 Superchip powering unparalleled AI training and real-time inference.
Dave Davies
Created on May 16|Last edited on May 16
Comment
NVIDIA’s Blackwell GPU architecture is heralded as the engine behind “AI factories” in the new era of AI reasoning. Succeeding the 2022 Hopper generation (H100) and its 2024 refresh (H200), Blackwell introduces a sweeping set of technological breakthroughs aimed at dramatically accelerating AI model training and real-time inference for generative AI and large language models (LLMs).
Named after mathematician David Blackwell, this architecture defines NVIDIA’s next chapter with unparalleled performance, efficiency, and scale for enterprise AI workloads. In this article, we’ll dive deep into Blackwell’s key features and technologies, how it improves AI training and inference efficiency compared to the H100, the industries and applications poised to benefit, and the role of the new Grace-Blackwell “GB200” Superchip. We’ll also explore Blackwell’s support for generative AI and LLMs, innovations like NVLink-5, FP4 precision, and RAS, along with technical specs and performance metrics – including comparisons to NVIDIA’s Hopper H100/H200 GPUs and competitor accelerators like AMD’s Instinct MI300 and Google TPUs.
Table of contentsKey innovations of the NVIDIA Blackwell architectureA multi-die “Superchip” GPU with massive memory and ComputeHow do two dies act as one?Transformer Engine 2.0 – FP8 and new FP4 precision for LLMsFaster NVLink networking for AI superclustersThe Grace-Blackwell superchip (GB200) and HGX platformsBlackwell B200 vs Hopper H100/H200: Specs and performanceAdvancing AI across industries and applicationsReliability and security for enterprise-grade AINVIDIA Blackwell vs competitors: MI300, TPUs, and Custom SiliconEarly adoption: CoreWeave’s Grace-Blackwell Cloud and beyondConclusion: Pioneering the Next Era of AI Computing
﻿
Key innovations of the NVIDIA Blackwell architectureNVIDIA Blackwell introduces a six-fold set of breakthrough technologies that together enable training and real-time LLM inference at scales up to 10 trillion parameters. The core innovations include:
World’s Most Powerful Multi-Die GPU – A multi-die design packing 208 billion transistors in total. Blackwell GPUs use two reticle-limit dies (≈104B transistors each) made on a custom TSMC 4NP process, linked by a 10 TB/s chip-to-chip interface to function as one unified GPU. This effectively breaks the single-die size barrier to deliver massive on-chip resources.
Second-Generation Transformer Engine (FP4 Precision) – New micro-tensor scaling techniques and advanced dynamic range management (integrated into TensorRT-LLM and NeMo frameworks) allow 4-bit floating point (FP4) AI operations. This doubles the effective model size and compute throughput for inference, accelerating large language model training/inference while maintaining accuracy. Blackwell’s Ultra Tensor Cores provide 2× faster attention processing and 1.5× more AI FLOPS than the prior generation.
Fifth-Generation NVLink and NVSwitch – An upgraded NVLink® interconnect delivers 1.8 TB/s of bidirectional bandwidth per GPU, enabling seamless high-speed communication across up to 576 GPUs in a cluster. The NVLink Switch chip achieves an aggregate 130 TB/s GPU bandwidth within a single 72-GPU pod (NVL72) and supports advanced in-network computing (SHARP™) for efficient scaling.
Reliability, Availability, Serviceability (RAS) Engine – A dedicated RAS engine infuses intelligent resiliency. It uses AI-powered predictive analytics to monitor thousands of hardware/software data points, diagnosing faults early and forecasting issues to maximize uptime. This preventative maintenance capability allows massive-scale AI deployments to run uninterrupted for weeks or months.
Secure AI and Confidential Computing – Blackwell is the first GPU with TEE-I/O support, enabling end-to-end encryption of data in use. It introduces advanced confidential computing features that secure models and data with virtually no performance penalty. This hardware-based trusted execution environment (TEE) is critical for privacy-sensitive industries (e.g. healthcare, finance) that require protection of AI intellectual property and sensitive data during processing.
Decompression Engine – A new on-die decompression engine accelerates data loading and analytics. It supports popular compression codecs (LZ4, Snappy, Deflate), offloading work that traditionally burdened CPUs. Paired with the high-speed Grace CPU memory link (up to 900 GB/s), this engine speeds up end-to-end data pipelines – databases, Spark analytics, ETL – improving performance and reducing costs for data science workloads.
These innovations collectively make Blackwell a game-changer for large-scale AI. Let’s explore each in more detail and examine how Blackwell GPUs (B100/B200) differ from Hopper GPUs (H100/H200) in architecture and performance.
A multi-die “Superchip” GPU with massive memory and ComputeOne of Blackwell’s hallmark features is its multi-die GPU design. NVIDIA has effectively doubled the GPU silicon by combining two large dies into one package, acting as a single CUDA GPU with full coherence. Each Blackwell die contains ~104 billion transistors, about 30% more than the 80B-transistor GH100 (Hopper) die. Together, the dual-die B200 GPU packs a staggering 208B transistors, fabricated on TSMC’s custom 4NP (4nm) node. This is the first time NVIDIA has not moved to a smaller process generation (opting to enhance 4N rather than jump to 3nm), so the huge performance gains rely on architectural improvements and scaling out die size.
How do two dies act as one?NVIDIA developed a high-bandwidth internal link (dubbed NV-HBI – NV High-Bandwidth Interface) providing 10 TB/s connectivity between the dies. This enormous interposer bandwidth (approximately 5 TB/s each direction) ensures that the two-die GPU behaves as a unified device with no significant performance compromises. It’s a level of chiplet/bridge bandwidth far exceeding any previous multi-chip implementations (for comparison, even advanced multi-chip bridges like Apple’s UltraFusion were on the order of ~2.5 TB/s). While NVIDIA hasn’t detailed the exact packaging method (whether using CoWoS passive interposer or a base-die approach akin to AMD), the end result is a single Blackwell B200 accelerator that transparently doubles the compute resources and memory of a single-die GPU.
Memory capacity and bandwidth see a major leap. Each Blackwell die interfaces with 4 stacks of next-gen HBM3E memory (for 8 stacks total on B200), widening the memory bus to 8192 bits. With HBM3E, NVIDIA offers up to 192 GB of VRAM on a B200 (8× 24GB stacks) – a 50% increase over the 80–96 GB on H100 GPUs. This caters directly to growing model sizes, as memory capacity was a constraining factor for giant models. Equally important, Blackwell’s memory bandwidth soars to an aggregate 8 TB/s (1 TB/s per stack). That is ~2.4× the H100’s bandwidth (3.2 TB/s) and even 66% more than the interim H200 (~4.8 TB/s with HBM3E). In practical terms, Blackwell GPUs can feed data to the cores far more quickly, reducing bottlenecks especially in memory-bound ML workloads.
This beefed-up memory subsystem, combined with the multi-die compute, positions Blackwell to handle models of unprecedented scale on a single board. For instance, each B200 GPU module (in SXM form) draws up to 1000W TDP (vs 700W on H100) to power its doubled resources. Liquid cooling is expected for dense deployments, though NVIDIA indicates air-cooling is possible at that 1000W level in certain configurations. The higher power envelope underscores that NVIDIA is pushing the silicon to its limits to maximize performance. As we’ll see next, much of that power drives new tensor core capabilities tailored for AI transformers and generative models.
﻿
﻿
Transformer Engine 2.0 – FP8 and new FP4 precision for LLMsTraining and serving today’s generative AI models (GPT, PaLM, etc.) demand massive compute – but not all of it needs full 16- or 32-bit precision. NVIDIA’s answer, introduced with Hopper, was the Transformer Engine for automated mixed precision, using FP8 matrix math to accelerate transformer layers. Blackwell takes this further with a second-generation Transformer Engine that introduces even lower 4-bit precision support and smarter scaling for minimal accuracy loss.
At the heart of this engine are NVIDIA Blackwell Tensor Cores with added support for new 8-bit and 4-bit floating point formats. Blackwell’s tensor cores can operate on FP16/BF16, FP8, and now FP4 data, dynamically adjusting precision for different phases of training or inference. To maintain accuracy at these ultra-low precisions, Blackwell implements fine-grained “micro‑tensor scaling” techniques. Essentially, it can apply per-tensor (or even sub-tensor) scaling factors and use “microscaling” formats defined by the AI community to preserve significant bits of information. NVIDIA pairs this with advanced range management algorithms in their software (within TensorRT-LLM and the NeMo framework) so that developers can leverage FP4 acceleration with minimal manual tuning.
The payoff is substantial: Blackwell’s 4-bit mode doubles the effective throughput and model size that a GPU can handle at a given memory footprint. In fact, NVIDIA states that enabling FP4 can double the performance and size of next-gen models that memory can support, while maintaining high accuracy. This means a model that might have required 2× more GPUs or memory to run in FP8 could potentially be served on half the hardware using FP4 – a huge win for deploying trillion-parameter models economically.
Even without FP4, the raw tensor core muscle of Blackwell gets a big boost. Compared to H100, the B200 GPU more than doubles per-GPU throughput in key AI precisions. For example, one B200 can reach 4.5 petaFLOPS of tensor processing in FP16/BF16 (vs ~2 PFLOPS on H100) and up to 9 PFLOPS in FP8 (vs ~4 PFLOPS on H100). And with FP4 it peaks around 18 PFLOPS of tensor compute – a capability Hopper lacked entirely. NVIDIA deliberately sacrificed some 64-bit floating point (FP64) throughput (which matters mainly for HPC scientific sims) in order to allocate more silicon to AI math where 32-bit and lower precision dominate. As an NVIDIA blog notes, training AI rarely needs full 64-bit precision; by reducing FP64 tensor performance relative to H100, Blackwell “squeezes more juice” into the data types that deep learning uses most.
Concretely, Blackwell’s Transformer Engine 2.0 can automatically mix FP4, FP8, and higher precisions in one workflow. During inference, less sensitive parts of the model (e.g. intermediate activations) can use 4-bit, while more sensitive portions (e.g. final layers) use 8-bit or 16-bit, all handled by hardware and software libraries. For training, FP8 and BF16 combos speed up matrix ops while keeping weight updates stable. This flexible precision support supercharges both throughput and memory utilization. NVIDIA reports that for extremely large models like a 1.8 trillion parameter GPT-MoE, a Blackwell GPU can achieve 15× higher inference throughput than an H100 GPU in real-time generation of output tokens. The chart below illustrates this jump:
﻿
﻿
This massive inference speedup directly translates to better user experiences (snappier responses from LLM-based services) and much lower serving cost per query. As CoreWeave notes, Blackwell’s FP4-enabled Transformer Engine is a “massive leap forward” for accelerating inference of the largest models. And it’s not just inference – training big models also sees up to 3–4× speedups in clusters, which we’ll discuss further below. In short, Blackwell was purpose-built to enhance generative AI and LLM workloads, making training more efficient and high-throughput inference at scale finally feasible.
Faster NVLink networking for AI superclustersAchieving exascale AI performance isn’t just about one supercharged GPU – it requires many GPUs working in unison. That is why NVIDIA Blackwell places heavy emphasis on multi-GPU scaling via its fifth-generation NVLink and the new NVLink Switch fabric. As models grow into the multi-trillion parameter range, fast interconnects between GPUs (and between nodes) become as critical as the compute itself for parallel training and serving.
NVLink 5 doubles the per-GPU communication bandwidth over the previous generation. Each Blackwell GPU now has up to 1.8 TB/s of bidirectional NVLink bandwidth to connect with others. In practical terms, within an 8-GPU server (like an HGX baseboard), the GPUs form a high-speed mesh interconnect at 900 GB/s per link, ensuring model shards and activation gradients can sync with minimal delay. Compared to an H100’s 900 GB/s total NVLink, Blackwell offers twice the in-node GPU interconnect throughput.
More impressively, NVIDIA has introduced the NVLink Switch System (NVSwitch) for Blackwell, enabling large rack-scale GPU clusters with full bandwidth. A single NVLink Switch chip can connect 8 GPUs with all-to-all bandwidth. By integrating multiple switch chips in a modular tray, NVIDIA composes a 72-GPU NVLink domain called NVL72, where all 72 GPUs communicate at full NVLink speed. Within this 72-GPU pool, NVLink Switch provides a staggering 130 TB/s of bisection bandwidth. This is a 4× improvement in bandwidth efficiency over prior generations, helped by in-network computing features like NVLink-SHARP which offload reduction operations to the switch hardware (supported at FP8 precision).
In simpler terms, Blackwell can scale up to 72 GPUs as if they were one giant GPU – each GPU can reach any other with high-speed links, critical for synchronizing large model training. NVIDIA notes that a 72-GPU Blackwell cluster (one NVL72 pod) can operate as a single unit with 1.4 exaFLOPs of AI compute and 30 TB of pooled memory. This is the foundation of NVIDIA’s DGX SuperPOD design. In fact, Blackwell’s NVLink allows multi-node connectivity too: multiple 72-GPU pods can be joined via NVLink bridges or future NVLink Switch extensions to reach the platform limit of 576 GPUs in one seamlessly connected system. Unlocking trillions of parameters truly demands this kind of “every GPU to every other GPU” fabric.
Beyond NVLink, Blackwell systems use NVIDIA Quantum-2/Quantum-X InfiniBand and Ethernet for scaling to data center level. For instance, the Blackwell platform supports up to 800 Gb/s networking per node with the latest 800G IB/Ethernet NICs. In the cloud context, CoreWeave’s Blackwell instances attach 400 Gb/s InfiniBand per GPU, allowing them to cluster up to 110,000 GPUs over IB with efficient RDMA and in-network reductions. This ensures that even beyond a single NVLink domain, multi-rack or multi-data-center deployments can train enormous models with decent scaling.
All these networking advancements boil down to one thing: reducing communication bottlenecks. As NVIDIA describes, swift, seamless GPU-to-GPU communication is the key to unleashing performance on trillion-parameter models. Blackwell tackles this with NVLink-5 inside the server and fast cluster interconnects beyond, so that whether you have 8 GPUs or 500+ working together, the data flows freely. The result is near-linear scalability on complex AI workloads. In fact, NVIDIA claims that a full GB200 NVL72 rack (72 Blackwell GPUs + Grace CPUs) can train a 1.8T-parameter model 4× faster than an equivalent H100 cluster. For inference, the gain is even larger – as noted, up to 30× faster LLM inference at scale with dramatically lower latency. This makes Blackwell the platform of choice for AI supercomputers that need extreme throughput, such as national labs and hyperscalers building next-gen chatbots, search AI, and recommendation systems.
The Grace-Blackwell superchip (GB200) and HGX platformsWhile the Blackwell architecture GPUs (B100, B200) can be deployed in traditional x86 servers (via PCIe or HGX boards), NVIDIA is also pushing a more integrated solution: the Grace Hopper style “superchip” combining an Arm-based Grace CPU with Blackwell GPUs. In this generation, that comes as the NVIDIA GB200 Grace Blackwell Superchip.
Each GB200 Superchip packages together 1 Grace CPU (144-core Arm Neoverse V2) and 2 Blackwell B200 GPUs in a single module, connected via NVLink-C2C at 900 GB/s. This forms a tight CPU-GPU coupling with a unified memory address space. The Grace CPU brings up to 960 GB of LPDDR5X memory (with 1 TB/s bandwidth) to the table, which is cache-coherent with the GPUs’ HBM memory. In effect, a GB200 node has massive memory (up to ~1.5 TB aggregate) accessible by both CPU and GPU, and extremely fast CPU-GPU communication (far beyond PCIe). This design is ideal for giant models that demand both high compute (GPUs) and large memory or are bound by CPU pre/post-processing.
A single GB200 superchip is an immensely powerful unit on its own – delivering up to 1 petaFLOP of AI performance (mixed precision) thanks to the two B200 GPUs. But the real magic is when they are used as building blocks for larger systems. NVIDIA’s reference design GB200 NVL72 is a rack-scale solution comprising 36 GB200 Superchips (which equates to 72 Blackwell GPUs + 36 Grace CPUs) all interconnected via NVLink Switch at the rack level. In addition, these racks integrate BlueField-3 DPUs for high-performance networking, storage offload, and security, enabling multi-rack clusters with efficient RDMA and virtualization.
The GB200 NVL72 rack essentially behaves like a single massive 72-GPU system with 30 TB of unified HBM + CPU memory and ~1.4 EFLOPs of AI compute. NVIDIA positions this as the cornerstone of its DGX SuperPOD (the turnkey AI data center solution). According to NVIDIA, one such rack delivers up to 30× the LLM inference performance of a same-sized H100 cluster, while cutting energy use by 25×. That huge efficiency gain (30× more performance at 1/25 the energy per inference) stems from Blackwell’s FP4 capability and architectural improvements that make it far more cost-effective for giant model inference. In other words, fewer Blackwell GPUs can do what would have required dozens of Hopper-based systems, saving on power and total cost of ownership.
Not every deployment will use the Grace CPU, of course. For customers who prefer x86 servers, NVIDIA offers the HGX B200 platform. HGX B200 is an 8-GPU baseboard (similar to the HGX H100 8-GPU) that links eight B200 GPUs via NVLink in a server tray, typically to dual x86 CPUs. This supports standard PCIe Gen5/6 and up to 400 Gb/s networking per GPU for integration into existing datacenters. The HGX B200 still benefits from Blackwell’s features (FP4, NVLink5 between the 8 GPUs, etc.) but without the Grace CPU’s coherent memory. CoreWeave describes HGX B200 as designed for the most demanding AI and data processing on x86, delivering up to 15× faster real-time inference on massive models compared to Hopper HGX systems.
Whether via GB200 (Grace+Blackwell) or HGX B200 (x86+Blackwell), enterprises can choose how to deploy the new GPUs. Interestingly, NVIDIA has also introduced smaller form factors like DGX Station (a deskside AI supercomputer with Grace + Blackwell GPUs for researchers) and even DGX Spark – a compact developer workstation powered by a single GB10 Grace-Blackwell Superchip for models up to 200B parameters. This indicates the flexibility of the Blackwell platform: it scales from a personal AI developer machine all the way to multi-rack superpods.
Blackwell B200 vs Hopper H100/H200: Specs and performanceWith the architecture covered, let’s compare the technical specs and performance metrics of Blackwell against its Hopper predecessors. The table below summarizes key specs of NVIDIA’s flagship data center GPUs: the Blackwell B200 Tensor Core GPU, the interim Hopper H200, and the original Hopper H100 (all in SXM form factors):

GPU ModelNVIDIA B200 (Blackwell)NVIDIA H200 (Hopper Refresh)NVIDIA H100 (Hopper)
ArchitectureBlackwell (2025)Hopper (2024) – “H200”Hopper (2022)
Process NodeTSMC 4N+ (4NP)TSMC 4N (optimized)TSMC 4N
Transistors208 billion (2×104B)~80 billion80 billion
SMs / Cores(Not publicly disclosed)16896 CUDA (H100 specs)16896 CUDA
Tensor Cores528 (?) with FP4/FP8528 (FP8 support)512 (FP16/FP8)
Peak FP64 (TFLOPS)30–40 (Tensor: 40)34 (Tensor: 67)34 (Tensor: 67)
Peak FP32 (TFLOPS)60–80 (Tensor: 2.2 PF)67 (Tensor: ~0.99 PF)67 (Tensor: ~0.99 PF)
Peak BF16/FP164.5 PFLOPS~2.0 PFLOPS~2.0 PFLOPS
Peak FP89 PFLOPS~4 PFLOPS~4 PFLOPS
Peak FP418 PFLOPSN/AN/A
GPU Memory (HBM)192 GB HBM3e (8 stacks)141 GB HBM3e (6 stacks)80 GB HBM3 (5–6 stacks)
Memory Bandwidthup to 8 TB/s~4.8 TB/s~3.2 TB/s
NVLink Bandwidth1.8 TB/s per GPU900 GB/s900 GB/s
Max MIG Instances7 (≈27 GB each)7 (≈16.5 GB each)7 (≈10 GB each)
Form FactorSXM5 (1000W module)SXM (700W, 600W options)SXM (700W, 500W)
NotableDual-die, FP4, NVLink5HBM3e upgrade, GH200 variantFirst FP8, Transformer Eng.
﻿
As shown, the Blackwell B200 Tensor Core GPU outclasses the Hopper H100/H200 in nearly every dimension relevant to AI:
Compute: In raw tensor operations, B200 offers about 2.3× the throughput of H100 at FP16/BF16 and FP8, plus the new FP4 mode for another 2× boost. This yields up to 15× inference performance for giant models and ~3–4× training speedups in many cases. (Notably, FP64 throughput is slightly lower on B200 vs H100’s tensor cores, reflecting the AI-first design tradeoff.)
Memory: B200 carries 192 GB HBM3e memory – over 2× the effective VRAM of H100 (80 GB). This huge capacity means a single Blackwell GPU can hold much larger models or batch sizes without resorting to multi-GPU sharding. Moreover, at 8 TB/s, its memory bandwidth is 2.5× H100’s. This alleviates bottlenecks on memory-intensive workloads (e.g. large sparse models, recommender embeddings).
Interconnect: With NVLink-5, Blackwell doubles the GPU interconnect speed to 1.8 TB/s, improving multi-GPU training efficiency (less communication overhead). In multi-node setups, Blackwell systems leverage NVLink Switch and 800G networking to scale further, whereas H100 clusters relied more on external InfiniBand (slower for internode GPU comms).
In summary, Blackwell’s spec advantages translate to significantly higher throughput per accelerator and better scaling. Real-world benchmarks underscore these gains. For instance, in MLPerf-style tests, an HGX B200 system can train large language models roughly 2–3× faster than HGX H100, and perform inference on models with trillions of parameters that would previously be infeasible to serve in real-time. Blackwell essentially enables a new class of use cases – like interactive 1+ trillion parameter chatbots – by combining raw horsepower with smarter precision and memory usage.
It’s worth noting that H200 (mentioned above) refers to a Hopper-based GPU that NVIDIA introduced in late 2023/2024, often associated with the GH200 Grace-Hopper superchip. The H200 provided a memory upgrade (HBM3e) and slightly higher clocks, but its compute architecture remained Hopper. Some sources informally called the Blackwell generation “H200” as a successor to H100, but NVIDIA’s official naming uses B100 and B200 for Blackwell accelerators. Regardless of naming, Blackwell represents the true next-generation leap beyond Hopper’s capabilities.
Advancing AI across industries and applicationsThe NVIDIA Blackwell GPUs aren’t just about raw specs – they are poised to impact numerous industries and AI applications by enabling new possibilities in scale and performance. Here are some key domains that will benefit from Blackwell:
Generative AI and LLMs – This is the obvious one. Blackwell was designed from the ground up to handle massive generative models. Enterprises developing large language models (e.g. GPT-style chatbots, code generators, content creation AI) will see training cycles shorten and be able to experiment with models having ten trillion+ parameters. Just as importantly, inference of these models can be done in real-time. For instance, Meta’s CEO Mark Zuckerberg highlighted that Meta is “looking forward to using NVIDIA’s Blackwell to help train our open-source Llama models and build the next generation of Meta AI”. The 15–30× inference boost means interactive AI services (virtual assistants, customer support bots, generative search engines) can serve much larger models to users with low latency, improving quality of results.
AI “Reasoning” and Agentic AI – Beyond traditional neural networks, there is a trend toward AI agents that perform reasoning by orchestrating multiple models and tools. These workloads involve iterative prompting, planning, and chaining of LLMs – which are token generation intensive. Blackwell’s strengths (fast generation, huge memory, fast interconnect) make it ideal here. As an NVIDIA director noted, AI reasoning involves many model inferences and “demands infrastructure with high-speed communication, memory and compute for real-time, high-quality results” – exactly what Blackwell’s NVLink superclusters provide. We can expect research labs and enterprises to use Blackwell for advanced AI systems that simulate reasoning, do autonomous research, or run complex workflows of models (e.g. an AI agent that can browse the web, code, and compose answers using multiple LLM calls).
Cloud AI Services – Cloud providers (AWS, Azure, Google Cloud, CoreWeave, etc.) will leverage Blackwell GPUs to offer more powerful AI instances. Andy Jassy, CEO of AWS, emphasized their collaboration with NVIDIA, stating that “the new NVIDIA Blackwell GPU will run so well on AWS,” and mentioning a joint engineering project “combining NVIDIA’s next-gen Grace Blackwell Superchips with the AWS Nitro System… for NVIDIA’s own AI research”. This indicates cloud platforms will optimize their infrastructure (networking, virtualization) to fully exploit Blackwell’s capabilities. Customers of these clouds will soon rent Blackwell instances to train large models faster or serve more queries per second for AI applications, all with potentially lower cost per query due to Blackwell’s efficiency.
Enterprise AI & Analytics – Many industries like finance, healthcare, manufacturing, and retail are adopting AI at scale. Blackwell’s introduction of Confidential Computing on GPU is a boon here – banks or hospitals can fine-tune and deploy models on sensitive data with hardware-level security (data encrypted in memory and along I/O). For example, financial services can use Blackwell to run fraud detection or algorithmic trading models on encrypted data, meeting compliance needs. Similarly, healthcare organizations can do privacy-preserving medical image analysis or patient data modeling. All of this happens without slowing down the models – Blackwell’s secure enclave yields “nearly identical throughput performance compared to unencrypted modes” while protecting data and AI IP. This combination of performance and security will drive AI adoption in sensitive domains that were previously hesitant.
Data Science and Big Data Analytics – Blackwell’s decompression engine and large CPU-GPU memory synergy (in Grace-Blackwell systems) can accelerate data analytics workflows that were bottlenecked by I/O. Consider a large data warehouse or Spark cluster: using Blackwell, compressed datasets can be decompressed on-GPU at 900 GB/s and processed in-memory, dramatically speeding up SQL queries or ETL jobs. NVIDIA reports Blackwell systems perform database query benchmarks 2× faster than H100 GPUs and 6× faster than CPU-only systems by leveraging the new decompression offload. Industries like retail (for real-time analytics on sales data), logistics (supply chain optimizations), and telecommunications (network analytics) could use such GPU-accelerated data pipelines to get insights faster and at lower cost.
Scientific Computing and HPC – Although Blackwell tilts toward AI, it’s still a top-tier compute engine for HPC workloads, especially those that can use mixed precision or AI acceleration. Applications in climate modeling, astrophysics, or genomics that use AI surrogates or lower precision can run extremely fast. For pure FP64 HPC codes, Blackwell’s FP64 throughput is on par with H100 (and still many times higher than previous-gen A100), so traditional supercomputing centers will welcome the upgrade. The difference is Blackwell HPC systems will likely integrate Grace CPUs, offering a hybrid model of CPU+GPU computing with enormous memory – beneficial for memory-bound simulations. Furthermore, the improved RAS engine means HPC centers can trust Blackwell for long-running simulations with less downtime. NVIDIA’s emphasis on proactive failure prediction and quick fault isolation at the chip level aligns well with the needs of HPC installations that demand high availability over months of continuous operation.
In essence, Blackwell GPUs will enable AI models and systems that were previously out of reach due to size or latency constraints. As Demis Hassabis (CEO of Google DeepMind) observed, “Blackwell’s breakthrough capabilities will provide the critical compute needed to help the world’s brightest minds chart new scientific discoveries”. From powering open-source AI research (Meta, OpenAI, DeepMind) to delivering enterprise-grade AI services (AWS, Azure, Oracle), Blackwell is set to become the workhorse of AI’s next era.
Reliability and security for enterprise-grade AIAs AI moves from experimental to mission-critical, aspects like reliability and security become just as important as raw performance. NVIDIA Blackwell makes significant strides in these areas with its RAS engine and Secure AI features, giving enterprise IT leaders confidence to deploy large AI clusters.
The dedicated RAS (Reliability, Availability, Serviceability) engine in Blackwell continuously monitors the health of the GPU. It tracks telemetry from thousands of signals on the chip and system, using AI algorithms to detect anomalies or drift that could indicate an upcoming failure. For example, if a certain memory module starts showing correctable errors at an increasing rate, the RAS engine can flag it for replacement before it causes an uncorrectable fault.
This predictive maintenance approach is akin to having a “digital twin” of the GPU’s health, forecasting reliability issues rather than just reacting. NVIDIA mentions that the Blackwell RAS can localize issues and guide effective remediation, thereby minimizing downtime in large clusters. For enterprises running AI services 24/7 (think of a global chatbot service or an AI-powered SaaS product), this means outages due to GPU faults can be vastly reduced. The GPUs themselves can report rich diagnostic info to management software, allowing planned swaps or resets with minimal impact. Overall, this RAS innovation builds “intelligent resilience” into Blackwell-powered infrastructure, saving time, energy and cost by avoiding unexpected halts.
On the security front, Blackwell is a trailblazer as the industry’s first GPU with TEE-I/O capability. This means it can function as a Trusted Execution Environment, not only securing data while it’s stored or in transit, but also while it’s being processed in the GPU’s memory. Concretely, Blackwell supports native encryption of the PCIe/NVLink interfaces and memory, so data moving to/from the GPU or residing in HBM can be encrypted using keys that even system software can’t access.
Sensitive models (e.g. proprietary LLM weights) can be kept confidential even if the hardware is at a cloud provider or in a multi-tenant environment. And sensitive input data (like patient records or financial transactions) can be processed by the model without exposing the raw values to the system. NVIDIA claims Blackwell’s confidential computing features impose virtually no performance hit, delivering “nearly identical throughput” as normal mode. This is crucial – in the past, encryption overhead or inability to use encryption on GPU memory meant companies had to choose between security and speed. Blackwell offers both.
These RAS and security enhancements target the needs of enterprise IT. For example, a bank adopting generative AI can train on encrypted datasets (compliant with regulations) and deploy the model in a secure enclave, all the while trusting that the underlying GPU cluster is proactively self-monitoring for faults. As Satya Nadella (Microsoft CEO) noted, bringing Grace-Blackwell systems into their datacenters aligns with the goal to make AI reliable and “real for organizations everywhere”. Likewise, industries like telecommunications or defense, which require uptime and security, will appreciate that Blackwell was built with these concerns in mind, not as afterthoughts but as core features.
NVIDIA Blackwell vs competitors: MI300, TPUs, and Custom SiliconGiven Blackwell’s impressive specs, it’s worth comparing it briefly to the competition in high-end AI accelerators. The landscape includes other GPU makers (like AMD), specialized AI chips (Google TPUs, AWS Trainium), and custom silicon from hyperscalers.
AMD Instinct MI300 Series – AMD’s answer to Hopper/Blackwell is the MI300 family, including the MI300X which is an AI-optimized GPU. Like Blackwell, MI300X uses a multi-die approach and packs huge memory (AMD announced 192 GB HBM3 for MI300X, matching Blackwell’s 192 GB) to target large models. MI300 also comes in an APU variant (MI300A) that combines an EPYC CPU with GPU dies in one package, conceptually similar to NVIDIA’s Grace superchip approach. Where Blackwell likely leads is in raw compute density and precision flexibility – AMD has mentioned FP8 support, but FP4 is unique to NVIDIA currently. Blackwell’s 208B transistors and 2-die full GPU design push the envelope, whereas MI300X’s design (MCM with many chiplets) might not act as a single unified GPU in the same way. In terms of performance, MI300X is expected to significantly improve on AMD’s MI250, but NVIDIA’s aggressive claims (like 30× H100 inference) set a high bar. One analysis noted that NVIDIA kept Blackwell pricing more modest than the Hopper jump, due to competition from “AMD’s MI300X and Intel’s Gaudi” and even hyperscalers’ own chips, implying NVIDIA wants to undercut any TCO argument from rivals. In practice, enterprises will consider MI300 if they seek alternatives, but NVIDIA’s software ecosystem (CUDA, AI libraries) and the sheer maturity of its stack remain big advantages.
Google TPUs (v4/v5) – Google’s TPUs have powered many large-scale AI projects internally (e.g. PaLM, Bard). TPUv4, which is roughly contemporary with H100, offered strong BF16/INT8 performance but didn’t support high precision or as much memory per device. Google likely has TPUv5 running in 2023–2024 with higher performance, but details are limited publicly. It’s clear Google’s strategy is custom silicon for its own cloud and research; however, for enterprises at large, TPUs are only accessible via Google Cloud (and require using TensorFlow or JAX). NVIDIA Blackwell will be more broadly available and programmable with the common frameworks, so in that sense it “competes” with TPUs by continuing the trend that many customers prefer the flexibility of GPUs. That said, Google’s latest TPUs might attempt low-precision modes too. Blackwell’s introduction of 4-bit could spur others to follow suit. It will be interesting to see if TPU systems can match a Blackwell pod’s performance on LLMs; Google will certainly optimize its software to try.
AWS Trainium / Inferentia, Intel Gaudi, etc. – A number of players have AI chips aimed at either training or inference. AWS’s Trainium (training) and Inferentia (inference) chips offer cost-effective performance for specific models on AWS Cloud. Intel’s Gaudi2 (and upcoming Gaudi3) are GPUs alternatives that have shown decent performance on ResNet and some transformers at lower price points. However, none of these have the sheer scale or memory of Blackwell – they often rely on smaller memory (tens of GB) and haven’t demonstrated operation at the ultra-high-end model sizes. NVIDIA’s move with Blackwell essentially forces any competitor to also consider multi-die, high-memory designs if they want to stay relevant for frontier AI models. The abundance of H100 deployments also means Blackwell will enter a market where NVIDIA already has software dominance; competitors must not only match hardware, but also integration and developer ecosystem. NVIDIA’s strategy of “aggressive, perhaps even benevolent pricing” for Blackwell suggests it aims to lock in customers before challengers can gain footing. In short, while AMD’s MI300 and others will certainly vie for specific wins, Blackwell B200 appears set to hold the crown for highest general-purpose AI throughput per accelerator when it’s fully released in 2025.
Early adoption: CoreWeave’s Grace-Blackwell Cloud and beyondGiven Blackwell’s potential, it’s no surprise that leading AI infrastructure providers raced to deploy it first. CoreWeave, a specialized cloud provider focused on GPU compute, achieved a notable milestone by becoming the first cloud service provider to make NVIDIA Blackwell generally available. In February 2025, CoreWeave launched GB200 NVL72-based instances on its platform, effectively bringing the Grace-Blackwell Superchip and NVLink 72-GPU architecture to the cloud for any enterprise to use.
CoreWeave built a 72-GPU Blackwell cluster (one rack) with full NVLink and Quantum-2 InfiniBand integration, and then exposed it in their cloud with tools to make usage easy. In their announcement, CoreWeave touted this as “another first-to-market milestone” delivering the world’s most advanced AI infrastructure to help organizations train, deploy, and scale the most complex AI models up to 30× faster. They highlight that a single GB200 NVL72 instance provides 1.4 ExaFLOPS of AI compute per rack, enabling up to 4× faster training and 30× faster real-time inference for trillion-parameter models compared to previous gen (H100). Essentially, CoreWeave is offering an on-demand AI supercomputer. This is particularly attractive to startups and research groups who want cutting-edge performance without managing physical clusters.
To optimize these instances, CoreWeave integrated them with their Kubernetes-based scheduling. They expose NVLink topology information so that multi-GPU jobs get intelligently placed within the same 72-GPU rack (for maximum bandwidth). They also leverage Slurm on Kubernetes with custom topology plugins to distribute workloads across multiple racks when needed, ensuring jobs can scale to hundreds of GPUs efficiently. CoreWeave’s observability tooling gives users real-time insight into NVLink performance, GPU utilization, etc., to fine-tune their distributed training runs. All these details show how cloud providers are tailoring their software stack to fully exploit Blackwell’s capabilities – it’s not just a drop-in GPU, but part of an AI-oriented cloud service.
CoreWeave’s Chief Strategy Officer, Brian Venturo, said “this launch represents a force multiplier for businesses to drive innovation while maintaining efficiency at scale. CoreWeave’s portfolio of services – Kubernetes, Slurm, observability – is purpose-built to make it easier for customers to run and scale AI workloads on cutting-edge hardware”. This underscores that simply having the hardware isn’t enough; you need the right cloud infrastructure around it. CoreWeave is essentially packaging Blackwell in a user-friendly way so enterprises can rent massive clusters by the hour, accelerating their AI projects without a CapEx investment.
💡
Notably, CoreWeave has been working closely with NVIDIA and marquee customers even before general availability. They were among the first to offer H200 (Hopper) GPUs last year for fast GPT-3 training, and one of the first to demo GB200 systems in late 2024. They even announced delivering a Grace-Blackwell supercomputer to IBM Research for training IBM’s next-gen Granite foundation models. IBM’s team expressed that partnering with CoreWeave for cutting-edge compute (with IBM’s own Spectrum Scale storage) will advance their hybrid cloud AI strategy. These early collaborations hint at how Blackwell will be employed: training state-of-the-art models (like IBM’s Granite LLMs, Meta’s Llama, OpenAI’s future GPT iterations, etc.), as well as supporting inference for AI products that need fast, scalable infrastructure.
Other cloud and hyperscale players are not far behind. Microsoft Azure has indicated plans to roll out GB200-based AI infrastructure across its datacenters globally, presumably for both internal use (e.g. OpenAI, Bing AI) and external Azure customers. Google Cloud, while having TPUs, might still offer NVIDIA Blackwell for customers preferring that ecosystem – Google’s DeepMind CEO explicitly recognized Blackwell’s potential for scientific breakthroughs. Amazon’s AWS, as mentioned, co-developed a custom solution (Project Ceiba) with NVIDIA involving Grace-Blackwell superchips integrated with AWS’s networking stack. Although AWS has their own silicon, this partnership shows they value NVIDIA’s top-end tech for certain cutting-edge workloads.
In sum, the first wave of Blackwell adoption is happening via cloud providers and select large-scale users who push the limits of AI. They are reporting incredible results: up to 30× faster LLM inference, 25× lower cost and energy per inference, and 4× faster training on giant models, compared to prior-gen GPUs. These are exactly the kind of improvements enterprises and researchers have been waiting for, as AI model sizes exploded in the past two years. By mid-2025, we can expect Blackwell GPUs to be powering many of the new AI services we interact with, from more fluent chatbots to advanced recommendation engines, all delivered at scale by the likes of CoreWeave, AWS, Azure, and others who quickly embraced this architecture.
Conclusion: Pioneering the Next Era of AI ComputingThe NVIDIA Blackwell GPU architecture represents a major inflection point in accelerated computing, tailored for the age of generative AI and beyond. By combining an innovative dual-die design (breaking the reticle limit) with massive memory, new low-precision math, and ultra-fast connectivity, Blackwell GPUs provide the performance needed to train and deploy AI models of unprecedented size. The architecture’s 208 billion-transistor might and clever engineering (like FP4 precision and NVLink Switch networking) translate to real-world impacts – 4× faster training throughput at the cluster level and an astounding 30× jump in inference performance for the largest LLMs, all while improving energy efficiency by an order of magnitude.
For enterprise IT leaders, Blackwell offers a path to scale AI initiatives without the prohibitive costs and lag of previous hardware. Tasks that once required entire server farms of GPUs might be accomplished with a single Blackwell rack or even a single node, thanks to the Grace-Blackwell superchip’s unified memory and compute power. The incorporation of robust RAS and security features means these GPUs are ready for production workloads where uptime and data privacy are non-negotiable. As AI models permeate products and services (from customer support bots to medical imaging assistants), having infrastructure that is both fast and trustworthy is key – and Blackwell delivers on both fronts.
AI researchers, too, gain a potent tool. Ideas that were bottlenecked by computation – be it exploring 10-trillion parameter models, running massive Mixture-of-Experts networks, or doing AI-driven scientific simulations – are now more within reach. The excitement from the AI community is evident in stakeholder reactions. Sam Altman, CEO of OpenAI, said that “Blackwell offers massive performance leaps, and will accelerate our ability to deliver leading-edge models”. And NVIDIA’s own Ian Buck (VP of Hyperscale and HPC) summed it up: “Scaling for inference and training is one of the largest challenges… NVIDIA is collaborating with CoreWeave to enable fast, efficient generative and agentic AI with the NVIDIA GB200 Grace Blackwell Superchip, to empower organizations of all sizes to push the boundaries of AI”.
Competition in the AI chip space is heating up, but NVIDIA has set a high bar with Blackwell’s blend of performance, software support, and early availability through cloud partners. AMD’s MI300, Google’s TPUs, and other contenders will drive further innovation, yet for now NVIDIA Blackwell B200 GPUs stand as the cutting-edge platform to beat in enterprise AI acceleration.
In the coming years, we will likely see Blackwell GPUs underpin everything from real-time LLM-powered assistants and creative generative apps, to advanced analytics platforms and autonomous machines that rely on fast edge AI reasoning. By delivering leaps in speed and efficiency, Blackwell isn’t just an incremental update – it’s a foundational technology enabling the next wave of AI breakthroughs and business applications. For IT leaders strategizing their AI infrastructure, Blackwell GPUs (whether via on-prem HGX systems or cloud instances) should be on the radar as a catalyst for both innovation and ROI in AI projects. In summary, NVIDIA Blackwell marks a turning point where AI at scale becomes considerably more achievable, unlocking new possibilities across industries and keeping NVIDIA at the forefront of the AI computing revolution.
Sources﻿NVIDIA Blackwell platform arrives to power a new era of computing﻿
﻿NVIDIA DGX B200 – Data Center AI Infrastructure﻿
﻿DataCrunch: NVIDIA Blackwell B100 & B200 GPU Overview﻿
﻿NVIDIA Blackwell Architecture – Data Center Technical Overview﻿
﻿AnandTech: NVIDIA Blackwell Architecture and B200/B100 Accelerators Announced﻿
﻿Exxact: Comparing NVIDIA Tensor Core GPUs – B200 vs H200 vs H100﻿
﻿DeepLearning.AI: All About NVIDIA’s New Blackwell Architecture and B200 GPU﻿
﻿SemiAnalysis: NVIDIA Blackwell Performance and TCO Analysis – B100 vs B200 vs GB200﻿
﻿CoreWeave: NVIDIA Blackwell Now Generally Available in the Cloud﻿
﻿NVIDIA Blog: NVIDIA Blackwell Now Generally Available in the Cloud on CoreWeave﻿
﻿
﻿
﻿
GPU Model	NVIDIA B200 (Blackwell)	NVIDIA H200 (Hopper Refresh)	NVIDIA H100 (Hopper)
Architecture	Blackwell (2025)	Hopper (2024) – “H200”	Hopper (2022)
Process Node	TSMC 4N+ (4NP)	TSMC 4N (optimized)	TSMC 4N
Transistors	208 billion (2×104B)	~80 billion	80 billion
SMs / Cores	(Not publicly disclosed)	16896 CUDA (H100 specs)	16896 CUDA
Tensor Cores	528 (?) with FP4/FP8	528 (FP8 support)	512 (FP16/FP8)
Peak FP64 (TFLOPS)	30–40 (Tensor: 40)	34 (Tensor: 67)	34 (Tensor: 67)
Peak FP32 (TFLOPS)	60–80 (Tensor: 2.2 PF)	67 (Tensor: ~0.99 PF)	67 (Tensor: ~0.99 PF)
Peak BF16/FP16	4.5 PFLOPS	~2.0 PFLOPS	~2.0 PFLOPS
Peak FP8	9 PFLOPS	~4 PFLOPS	~4 PFLOPS
Peak FP4	18 PFLOPS	N/A	N/A
GPU Memory (HBM)	192 GB HBM3e (8 stacks)	141 GB HBM3e (6 stacks)	80 GB HBM3 (5–6 stacks)
Memory Bandwidth	up to 8 TB/s	~4.8 TB/s	~3.2 TB/s
NVLink Bandwidth	1.8 TB/s per GPU	900 GB/s	900 GB/s
Max MIG Instances	7 (≈27 GB each)	7 (≈16.5 GB each)	7 (≈10 GB each)
Form Factor	SXM5 (1000W module)	SXM (700W, 600W options)	SXM (700W, 500W)
Notable	Dual-die, FP4, NVLink5	HBM3e upgrade, GH200 variant	First FP8, Transformer Eng.
Add a comment
Dave Davies • 5 months ago
Azure
Tags: Articles, Hardware, Community Posts
Iterate on AI agents and models faster. Try Weights & Biases today.