Back to blog

17 minute read

NVIDIA A100 versus H100: how do they compare?

Emmanuel Ohiri

Emmanuel Ohiri

Graphics Processing Units (GPUs) have become an important technology for building the most advanced artificial intelligence (AI) models, high-performance computing (HPC) applications, and handling complex graphics workloads.

When we talk about GPUs, NVIDIA is usually the first name that comes to mind. They've been driving innovation in this area and have given us a few incredible options and among those are the A100 and the H100.

The A100, with its Ampere architecture, really set the bar high for data centers. But then, NVIDIA released the H100, built on the newer Hopper architecture, promising even greater leaps in AI and HPC. So, naturally, the question is: how do these two stack up?

The maths has shifted dramatically since the H100 launched. Cloud pricing for H100s has dropped 64-75% from 2023 peak levels, with on-demand rates now ranging from $1.50-$4.00/hour depending on provider. This pricing collapse has largely eliminated the A100's former cost advantage, making raw performance the deciding factor for most workloads.

That said, the A100 isn't obsolete. For memory-bound inference workloads, batch processing, and teams optimising for cost-per-token rather than time-to-completion, the A100 still delivers strong value at roughly half the hourly rate.

In this article, we compare the NVIDIA A100 and H100 GPUs, highlighting their architectures, performance benchmarks, AI capabilities, and power efficiency. We will not discuss the components of GPUs; for that, read our breakdown of NVIDIA GPUs here: A beginner's guide to NVIDIA GPUs.

A100 vs H100 architecture

The A100 and H100 GPUs have been designed specifically for AI and HPC workloads, driven by distinct architectural philosophies. Here is how they compare against each other:

A100's Ampere architecture

The NVIDIA A100 GPU is part of the Ampere architecture line-up, which builds on the capabilities of the previous Tesla architecture, adding numerous new features and significantly boosting performance. Ampere significantly advanced GPU technology, particularly for HPC, AI, and data analytics tasks.

a100-vs-h100-image-2

Key features of the A100 and its Ampere architecture include:

Third-generation tensor cores:

The A100's Tensor Cores significantly improve throughput compared to its predecessor, the V100. This enhanced performance is achieved through comprehensive support for a wide range of data types used in deep learning and high-performance computing (HPC). Additionally, the A100 introduces innovative Sparsity features that have the potential to double throughput, further accelerating computation.

Furthermore, including TensorFloat-32 (TF32) operations allows for faster processing of FP32 data, a common data type in deep learning applications. The A100 also supports new Bfloat16 mixed-precision operations, which can improve performance and efficiency in specific scenarios.

Advanced fabrication process:

The GA100 GPU is the foundation for the NVIDIA A100, and it was built using TSMC's 7nm N7 process node. To understand this, think of it as building a very dense city. TSMC is the company that actually fabricates the chip, and the "7nm N7" part refers to the size of the tiny components, called transistors, that make up the chip.

"nm" stands for nanometers, an incredibly small measurement. A smaller number means those components are packed much closer together. Because they're so small, the A100 can fit 54.2 billion transistors on its surface. This massive number of transistors directly translates to better performance. More transistors mean the chip can handle more calculations and data compared to older chips.

Specifically, this increased transistor count enables more complex processing units, larger and faster caches for faster data access, and improved memory bandwidth, allowing data to flow in and out of the chip's memory much faster.

These factors combine to give the A100 its impressive performance, especially when handling demanding tasks such as artificial intelligence and high-performance computing, which require massive amounts of data and computation.

Enhanced memory and cache:

The A100 features a large L1 cache and shared memory unit, providing 1.5x the aggregate capacity per streaming multiprocessor (SM) compared to the V100. It also includes 40 GB of high-speed HBM2 memory and a 40 MB Level 2 cache, both substantially larger than those of its predecessor, ensuring high computational throughput.

Multi-instance GPU (MIG):

This feature allows the A100 to be partitioned into up to seven separate GPU instances for CUDA applications, providing multiple users with dedicated GPU resources, enhancing GPU utilization, and providing quality of service and isolation between different clients, such as virtual machines, containers, and processes.

Third-generation NVIDIA NVLink:

NVLink interconnect technology enhances multi-GPU scalability, performance, and reliability by significantly increasing GPU-to-GPU communication bandwidth while improving error detection and recovery.

a100-vs-h100-image-3

The NVIDIA A100 is available to use on-demand and on-reserve on CUDO Compute today. We offer the most capable GPUs reliably and affordably. Contact us to learn more.

Compatibility with NVIDIA Magnum IO and Mellanox Solutions:

The A100's extensive compatibility with multi-GPU and multi-node systems significantly boosts its overall I/O performance. This enhanced capability allows the A100 to efficiently manage and process large volumes of input and output data, making it well-suited to handle a wide range of demanding workloads, including those that require high levels of data throughput and parallel processing.

PCIe gen 4 support with SR-IOV:

By supporting PCIe Gen 4, the A100 doubles the PCIe 3.0/3.1 bandwidth, which benefits connections to modern CPUs and fast network interfaces. It also supports single root input/output virtualization, allowing for shared and virtualized PCIe connections for multiple processes or virtual machines.

Asynchronous copy and barrier features:

The A100 includes new asynchronous copy and barrier instructions that optimize data transfers and synchronization and reduce power consumption. These features improve the efficiency of data movement and overlap with computations.

Task graph acceleration:

CUDA, a parallel computing platform and programming model developed by NVIDIA, uses task graphs within the A100 GPU architecture to optimize task submission. It enables greater efficiency in how applications interact with the GPU's resources by breaking tasks into smaller units and mapping them to a graph, improving performance and overall application efficiency. With this, the A100 can better manage dependencies and execute tasks concurrently, maximizing resource utilization and minimizing idle time.

Enhanced HBM2 DRAM subsystem:

The A100 GPU is a major upgrade in HBM2, a type of High Bandwidth Memory. This memory tech is important for handling the huge amounts of data used in HPC, AI, and data analytics.

The A100's HBM2 is way better than before, as it can move data faster and handle even bigger datasets. These improvements are a game-changer for a wide range of computationally intensive applications.

The NVIDIA A100, with its Ampere architecture, is a sophisticated, powerful GPU solution tailored to meet the demanding requirements of modern AI, HPC, and data analytics applications.

Featured Snippet:
How much faster is H100 vs A100?
The H100 GPU is up to nine times faster for AI training and thirty times faster for inference than the A100. The NVIDIA H100 80GB SXM5 is twice as fast as the NVIDIA A100 80GB SXM4 when running FlashAttention-2 training.

NVIDIA H100's Hopper architecture

NVIDIA's H100 uses the innovative Hopper architecture, explicitly designed for AI and HPC workloads. This architecture is characterized by its focus on efficiency and high performance in AI applications. Key features of the Hopper architecture include:

Fourth-generation tensor cores:

The NVIDIA Hopper architecture, and specifically the H100 GPU built on it, delivers significantly higher performance than its predecessor, the A100. This performance boost is primarily due to the enhanced Transformer Engine and the new FP8 precision format, with the optimization resulting in about 6 times faster performance than the A100 across a wide range of AI workloads.

Transformer engine:

The H100's dedicated transformer engine accelerates AI training and inference, delivering substantial speedups when working with large language models. The transformer engine is purpose-built to optimize the specific architecture and operations commonly found in transformer models, making it easier to build and deploy generative AI applications.

HBM3 memory:

The H100 is the first GPU with HBM3 memory. The advanced memory technology effectively doubles the bandwidth compared to the A100, enhancing data throughput. With the HBM3 improving data access, the H100 processes complex calculations with greater speed and efficiency, resulting in an overall boost in GPU performance

Enhanced processing rates:

The H100 delivers robust computational power with 3x faster IEEE FP64 and FP32 rates than the A100.

a100-vs-h100-image-4

You can rent or reserve the NVIDIA H100 on CUDO Compute today. Our extensive roster of scarce cutting-edge GPUs is powering AI and HPC for diverse projects. Contact us to learn more.

DPX instructions:

The NVIDIA H100 introduces dynamic programming extension (DPX ) instructions, a significant architectural enhancement designed to dramatically accelerate dynamic programming algorithms. These algorithms are essential components within various applications, including AI models, genomics (e.g., sequence alignment), and robotics (e.g., path planning).

Dynamic programming involves complex, data-dependent computations with irregular memory access patterns. DPX instructions provide specialized hardware acceleration to improve the performance of these algorithms, leading to faster execution times in applications across these fields.

How DPX works:

  • Optimized table fill operations: Dynamic programming relies heavily on filling tables with computed values, where each cell's value depends on neighboring cells. DPX instructions are specifically tailored to efficiently execute these table-fill operations. They do this by providing specialized hardware support for common dynamic programming recurrence relations, reducing the number of individual instructions required and streamlining data flow.
  • Enhanced parallelism and data locality: DPX instructions uses the H100's Tensor Cores and shared memory architecture to maximize parallelism. By optimizing data movement and keeping frequently accessed data closer to the processing units, DPX minimizes memory latency, a major bottleneck in dynamic programming.
  • Specialized Arithmetic Operations: DPX includes optimized instructions for common arithmetic operations used in dynamic programming, such as minimum/maximum selection, addition, and comparisons. These are implemented with higher throughput and lower latency than general-purpose arithmetic instructions.

Why DPX Matters:

  • Significant performance gains: By directly accelerating the core operations of dynamic programming, DPX delivers substantial performance improvements compared to traditional GPU implementations. This translates to faster analysis of genomic data, more responsive robotic control, and increased efficiency in other applications relying on these algorithms.
  • Increased energy efficiency: The specialized hardware in DPX allows for more efficient execution of dynamic programming tasks, reducing the overall energy consumption. This is particularly important for large-scale deployments in data centers.
  • Expanded application scope: The performance boost provided by DPX enables the use of more complex and computationally intensive dynamic programming algorithms, opening up new possibilities in areas like AI, drug discovery, and materials science.
  • Reduced development complexity: By providing hardware-level acceleration, DPX simplifies the development of high-performance dynamic programming applications. Developers can focus on the algorithmic logic rather than low-level optimizations.

In essence, DPX instructions on the H100 represent a targeted hardware acceleration strategy that directly addresses the computational bottlenecks of dynamic programming, leading to significant performance, efficiency, and usability improvements.

Multi-instance GPU technology:

This second-generation technology secures and efficiently partitions the GPU, catering to diverse workload requirements.

Advanced interconnect technologies:

The H100 incorporates fourth-generation NVIDIA NVLink and NVSwitch, ensuring superior connectivity and bandwidth in multi-GPU setups.

Asynchronous execution and thread block clusters:

Asynchronous execution allows the GPU to overlap data transfers and kernel execution, minimizing idle time. Thread block clusters group related thread blocks, enabling them to share on-chip resources and reducing global memory access latency.

Okay, let's break it down simply:

  • Asynchronous execution: Imagine doing two things simultaneously, like reading a recipe while preheating the oven. The GPU does this with data and calculations, saving time.
  • Thread block clusters: Think of grouping similar workers together in a factory. They can share tools and work faster because they're close. The GPU groups similar tasks, allowing them to share resources and access data quickly.

These features make the GPU work smarter by doing more in parallel and keeping related tasks close together, which is essential for handling large, complex jobs.

Distributed shared memory:

Distributed shared memory creates a fast, on-chip communication network, allowing streaming multiprocessors (SMs) to efficiently exchange data without relying on slower off-chip memory access. This streamlined communication enhances overall data processing speed.

The H100, with its Hopper architecture, marks a significant advancement in GPU technology. It reflects the continuous evolution of hardware designed to meet the growing demands of AI and HPC applications.

To learn more about streaming multiprocessors and NVIDIA GPU architecture, read: A beginner's guide to NVIDIA GPUs.

Performance benchmarks

Performance benchmarks can provide valuable insights into the capabilities of GPU accelerators like NVIDIA's A100 and H100. These benchmarks, which include Floating-Point Operations Per Second (FLOPS) for different precisions and AI-specific metrics, can help us understand where each GPU excels, particularly in real-world applications such as scientific research, AI modeling, and graphics rendering.

NVIDIA A100 performance benchmarks

NVIDIA's A100 GPU delivers impressive performance across a variety of benchmarks. In terms of Floating-Point Operations, the A100 provides up to 19.5 teraflops (TFLOPS) for double-precision (FP64) and up to 39.5 TFLOPS for single-precision (FP32) operations. This high computational throughput is essential for HPC workloads, such as scientific simulations and data analysis, that require high precision.

a100-vs-h100-image-5

Moreover, the A100 excels in tensor operations, which are crucial for AI computations. The tensor cores deliver up to 312 TFLOPS for FP16 precision and 156 TFLOPS for tensor float 32 (TF32) operations. This makes the A100 a formidable tool for AI modeling and deep learning tasks, which often require large-scale matrix operations and benefit from tensor-core acceleration.

SpecificationA100H100
Form FactorSXMSXM
FP649.7 TFLOPS34 TFLOPS
FP64 Tensor Core19.5 TFLOPS67 TFLOPS
FP3219.5 TFLOPS67 TFLOPS
TF32 Tensor Core312 TFLOPS989 TFLOPS
BFLOAT16 Tensor Core624 TFLOPS1,979 TFLOPS
FP16 Tensor Core624 TFLOPS1,979 TFLOPS
FP8 Tensor CoreNot applicable3,958 TFLOPS
INT8 Tensor Core1248 TOPS3,958 TOPS
GPU Memory80 GB HBM2e80 GB
GPU Memory Bandwidth2,039 Gbps3.35 Tbps
Max Thermal Design Power400WUp to 700W (configurable)
Multi-Instance GPUsUp to 7 MIGs @ 10 GBUp to 7 MIGs @ 10 GB each
InterconnectNVLink: 600 GB/sNVLink: 900GB/s
PCIe Gen4: 64 GB/sPCIe Gen5: 128GB/s
Server OptionsNVIDIA HGX™ A100NVIDIA HGX H100
Partner and NVIDIA-Certified Systems with 4, 8, or 16 GPUsPartner and NVIDIA-Certified Systems™ with 4 or 8 GPUs
NVIDIA AI EnterpriseIncludedAdd-on
CUDO ComputeCUDO Compute logo

Real-world inference and training benchmarks

Spec sheets tell part of the story. Here's how these GPUs perform on actual workloads, drawing from MLPerf submissions, independent testing, and production deployments.

LLM inference throughput

Token generation speed determines how many concurrent users your inference deployment can serve. For models in the 13B-70B parameter range:

GPUTokens/secondDaily throughput (1024 tokens/request)First token latency
A100 80GB (FP16)~130 t/s~11,000 requests/dayBaseline
H100 80GB (FP16)~250-300 t/s~22,000-26,000 requests/day~30% lower
H100 80GB (FP8)~400+ t/s~35,000+ requests/day~30% lower
CUDO ComputeCUDO Compute logo

Source: Article

The H100's 2x throughput advantage at FP16 means you need half as many GPUs to serve the same inference load. With FP8 quantisation enabled, that gap widens to nearly 3x-often more than offsetting the H100's higher hourly cost.

NVIDIA's TensorRT-LLM benchmarks show even starker differences: the H100 achieves up to 4.5x inference speedup over the A100 in MLPerf submissions. Using identical data types, the speedup is approximately 2x; switching to FP8 adds another 2x on top.

Training performance by workload type

Training speedups vary significantly based on model architecture and optimisation level:

WorkloadA100H100SpeedupNotes
Mixed-precision training (general)Baseline2-2.4x faster-Out-of-the-box performance
Transformer models (optimised)Baseline3-4x fasterTransformer Engine + FP8
GPT-3 175B (512 GPUs)~60 min~22 min2.7xMLPerf Training v4.0
GPT-3 175B (11,616 GPUs)N/A3.4 minNear-linear scalingNVIDIA max-scale submission
BERT NLPBaselineUp to 6.7x fasterMLPerf v2.1Transformer Engine advantage
LLaMA 70B fine-tuningBaseline2-3x fasterIndependent testing (Databricks/CoreWeave)
ResNet-50 (vision)Baseline1.5-2x fasterSmaller gains on CNNs
FlashAttention-2Baseline2x fasterMemory bandwidth advantage
CUDO ComputeCUDO Compute logo

There is a clear pattern that shows transformer-heavy workloads achieve the largest gains (3-6x), while traditional CNNs and smaller models achieve more modest gains (1.5-2x).

Why the H100 pulls ahead on transformers

Three architectural features drive the H100's advantage on modern AI workloads:

  • Transformer Engine: Purpose-built hardware that dynamically switches between FP8 and FP16 precision during training, automatically selecting the optimal format for each layer. The A100 lacks this capability entirely.
  • FP8 precision support: Native 8-bit floating-point operations reduce memory bandwidth requirements and double arithmetic throughput. In practice, FP8 delivers 2x+ speedup over FP16 on the H100 with minimal accuracy loss-a feature unavailable on the A100.
  • Memory bandwidth: The H100's 3.35 TB/s bandwidth (versus 2.0 TB/s on the A100) reduces bottlenecks during weight updates and enables larger batch sizes. This 67% improvement is particularly impactful for autoregressive generation, where models are memory-bound rather than compute-bound.

Independent validation

MosaicML (now Databricks) benchmarked LLM training across both GPUs without vendor optimisation. Their findings:

  • Smaller, unoptimised models: 2.2x speedup on H100
  • 30B parameter models with H100 optimisations: 3.3x speedup
  • Cost efficiency: With H100 cloud pricing now only 30-40% higher than A100, the 2-3x throughput advantage delivers approximately 40-60% lower cost per unit of work

MLPerf Training v4.0 confirms these results at scale. The H100 holds performance records across all eight benchmark categories-including LLMs, recommenders, computer vision, medical imaging, and speech recognition-demonstrating a consistent advantage rather than narrow optimisation for specific workloads.

NVIDIA H100 performance benchmarks

The NVIDIA H100 GPU showcases exceptional performance in various benchmarks. In terms of Floating-Point Operations, although specific TFLOPS values for double-precision (FP64) and single-precision (FP32) are not provided here, the H100 is designed to significantly enhance computational throughput, which is essential for HPC applications such as scientific simulations and data analytics.

Tensor operations are vital for AI computations, and the H100's fourth-generation Tensor Cores are expected to deliver substantial performance improvements over previous generations. These advancements make the H100 an extremely capable AI modeling and deep learning tool, enabling enhanced efficiency and speed for large-scale matrix operations and AI-specific tasks.

AI and Machine Learning capabilities

AI and machine learning capabilities are critical components of modern GPUs, with NVIDIA's A100 and H100 offering distinct features that enhance their performance in AI workloads.

Tensor cores:

The NVIDIA A100 GPU, powered by the Ampere architecture, delivers significant AI and machine learning advancements. The A100 incorporates third-generation Tensor Cores, which deliver up to 20X the performance of NVIDIA's Volta architecture (the prior generation). These Tensor Cores support various mixed-precision computations, such as Tensor Float (TF32), thereby enhancing the efficiency of AI model training and inference.

a100-vs-h100-image-6

On the other hand, the NVIDIA H100 GPU also represents a significant leap in AI and HPC performance. It features new fourth-generation Tensor Cores, which are up to 6x faster than those in the A100. These cores deliver double the matrix multiply-accumulate (MMA) computational rate per SM compared to the A100, and even greater gains when using the new FP8 data type. Additionally, H100's Tensor Cores are designed for a broader array of AI and HPC tasks and feature more efficient data management.

Multi-instance GPU (MIG) technology:

The A100 introduced MIG technology, enabling a single A100 GPU to be partitioned into up to 7 independent instances. This technology optimizes GPU resource utilization, enabling the concurrent operation of multiple networks or applications on a single A100 GPU. The A100 40GB variant can allocate up to 5GB per MIG instance, while the 80GB variant doubles this capacity to 10GB per instance.

However, the H100 incorporates second-generation MIG technology, offering approximately 3x the compute capacity and nearly 2x the memory bandwidth per GPU instance compared to the A100. This advancement further enhances the utilization of GPU-accelerated infrastructure.

New features in H100:

The H100 GPU includes a new transformer engine that uses FP8 and FP16 precisions to enhance AI training and inference, particularly for large language models. This engine can deliver up to 9x faster AI training and 30x faster AI inference than the A100. The H100 also introduces DPX instructions, providing up to 7x faster performance for dynamic programming algorithms than Ampere GPUs.

a100-vs-h100-image-7

Collectively, these improvements provide the H100 with approximately 6x the peak compute throughput of the A100, marking a substantial advancement for demanding compute workloads. The NVIDIA A100 and H100 GPUs represent significant advancements in AI and machine learning, with each generation introducing innovative features such as advanced Tensor Cores and MIG technology. The H100 builds upon the foundations laid by the A100's Ampere architecture, offering further enhancements in AI processing capabilities and overall performance.

Featured Snippet: \
Is the A100 or H100 worth purchasing?
Whether the A100 or H100 is worth purchasing depends on the user's specific needs. Both GPUs are highly suitable for high-performance computing (HPC) and artificial intelligence (AI) workloads. However, the H100 is significantly faster in AI training and inference tasks. While the H100 is more expensive, its superior speed might justify the cost for specific users.

Which GPU should you choose?

Choose the H100 when:

  • Training large language models (7B+ parameters) where time-to-completion matters
  • Running high-throughput inference that's latency-sensitive
  • Your workload benefits from FP8 precision or the Transformer Engine
  • You need to minimise the GPU count in production deployments

Choose the A100 when:

  • Running memory-bound workloads on small batches (A100s can be more cost-effective here)
  • Fine-tuning smaller models where the H100's extra performance is overkill
  • Budget is the primary constraint, and you can tolerate longer training times
  • Working with legacy codebases not yet optimised for H100 features

The cost-per-performance calculation:

If an H100 costs $3/hour and completes a job in 10 hours, the total cost is $30. If an A100 costs $1.50/hour but takes 20 hours, the total cost is also $30-but you've lost 10 hours of iteration time. For teams where speed drives business value, the H100's higher hourly rate often breaks even or wins.

Power efficiency and environmental impact

The Thermal Design Power (TDP) ratings of GPUs like NVIDIA's A100 and H100 provide valuable insights into power consumption, with implications for both performance and environmental impact.

GPU TDP:

The TDP of the A100 GPU varies by model. The standard A100 with 40 GB of HBM2 memory has a TDP of 250W. However, the SXM variant of the A100 has a higher TDP of 400W, which increases to 700W for the SXM variant with 80 GB memory. This indicates that the A100 requires a robust cooling solution and consumes significant power, which can vary by model and workload.

The TDP for the H100 PCIe version is 350W, which is close to the 300W TDP of its predecessor, the A100 80GB PCIe. The H100 SXM5, however, supports up to a 700W TDP. Despite this high TDP, the H100 GPUs are more power-effective than the A100 GPUs, with a 4x and nearly 3x increase in FP8 FLOPS/W over the A100 80GB PCIe and SXM4 predecessors, respectively. This suggests that while the H100 may have a high power consumption, it offers improved power efficiency compared to the A100, especially in terms of performance per watt.

Comparison of power efficiency:

While the A100 GPU operates at 400 watts, it can drop to 250 watts for some workloads, indicating better overall energy efficiency than the H100. The H100, on the other hand, is known for higher power consumption, which can reach up to 500 watts in certain scenarios. This comparison highlights that while both GPUs are robust and feature-rich, they differ significantly in their power consumption and efficiency, with the A100 being more energy-efficient overall.

While the NVIDIA A100 and H100 GPUs are both powerful and capable, they have different TDPs and power-efficiency profiles. The A100 varies in power consumption by model, but overall it tends to be more energy-efficient. The H100, especially in its higher-end versions, has a higher TDP but offers improved performance per watt, especially in AI and deep learning tasks. These differences are essential to consider, particularly regarding environmental impact and the need for robust cooling solutions.

Whether you choose the A100's proven efficiency or the H100's advanced capabilities, we provide the resources you need for exceptional computing performance. Get started now!

Subscribe to our Newsletter

Get the latest product news, updates and insights with the CUDO Compute newsletter.

Find the resources to empower your AI journey