13 minute read

NVIDIA A40 GPUs: everything you need to know

Emmanuel Ohiri

Apr 12, 2024, 4:40 PM

The NVIDIA A40 is a versatile GPU for various high-performance computing (HPC) tasks. It is designed to tackle demanding workloads like AI acceleration, data science, simulation, 3D design, and virtual production.

The A40 is built on the NVIDIA Ampere architecture, enhancing its capabilities to handle the above-mentioned workloads efficiently and making it a powerful tool for professionals in these fields. Understanding its specifications, performance across various applications, and price point is crucial for determining if the A40 is the right fit for your specific HPC needs.

nvidia-a40-image-3

In this article, we will discuss the NVIDIA A40's specifications, how it performs across various HPC use cases, its price, and more. This comprehensive analysis will equip you with the knowledge to make informed decisions about incorporating the A40 into your workflow.

NVIDIA A40 specification

The NVIDIA A40 is a powerful GPU specifically designed for data center visual computing and is built on the Ampere GA10x architecture. Its architecture is composed of Graphics Processing Clusters (GPCs), Texture Processing Clusters (TPCs), Streaming Multiprocessors (SMs), Raster Operators (ROPS), and memory controllers. The full A40 GPU contains 7 GPCs, 42 TPCs, and 84 SMs.

The GPC is the primary structural unit in NVIDIA GPU architecture, responsible for a significant portion of graphics and compute processing. GPCs house all essential graphics processing elements.

Each GPC includes a dedicated Raster Engine and multiple Texture Processing Clusters (TPCs), with each TPC including two Streaming Multiprocessors (SMs). Each TPC also consists of a PolyMorph Engine, which handles vertex processing tasks such as tessellation and geometry shading, which are important for creating detailed 3D images from basic geometric shapes. The Raster engine is crucial for rasterization, the process of converting vectors into pixels or dots for display on the screen, which is fundamental for rendering 2D and 3D graphics.

What is the NVIDIA A40 used for?

The NVIDIA A40 is a powerful data center GPU designed for visual computing tasks like Deep learning and artificial intelligence, Scientific simulations, High-end rendering (e.g., animation, special effects), and other HPC tasks.

As stated previously, SMs are critical for performing the calculations necessary for graphics rendering and general compute tasks. The SMs on the A40 contains the following:

256 KB Register File: This component stores data that is immediately accessible to the CUDA cores, improving data handling efficiency during processing tasks.
4 Texture Units: These units are involved in processing texture data for rendering images, which is crucial for graphics rendering to handle various surface textures in a scene.
128 KB of L1/Shared Memory: This configurable memory can be utilized either as an L1 cache or as shared memory among the threads within an SM, optimizing data sharing and cache usage depending on the workload requirements.

nvidia-a40-image-4

The SM contains 3 different types of compute resources. These are:

Tensor Cores: Tensor Cores are designed to accelerate deep learning processes. They significantly speed up neural network training and inference phases by efficiently performing large matrix operations, a common requirement in AI workloads.

The NVIDIA A40 features 4 Thrid Generation Tensor Cores. It introduced a new Tensor Float 32 (TF32) precision format that delivers up to 5 times faster training throughput than the previous generation without requiring any code modifications to existing models.

It also has hardware support for structural sparsity, doubling the inference throughput compared to previous-generation GPUs. Furthermore, they enable Deep Learning Super Sampling (DSSL) for improved image quality, AI denoising for faster rendering, and enhanced editing capabilities in select applications.

Programmable Shading Cores: These are primarily composed of CUDA Cores, which are fundamental to general-purpose computing on graphics processing units (GPGPU). CUDA Cores are highly effective for tasks that require parallel processing, such as simulations and complex computations.

It has 128 CUDA Cores, which double-speed processing for single-precision floating point (FP32) operations and improved power efficiency that provide significant performance improvements for graphics and simulation workflows, such as complex 3D computer-aided design (CAD) and computer-aided engineering (CAE) compared to the previous (Turing) generation.

RT Cores: These cores are specialized for ray tracing operations, specifically for accelerating Bounding Volume Hierarchy (BVH) traversal and the intersection of scene geometry. Since ray-tracing simulates how light behaves in the real world, the A40 utilizes RT Cores that cores excel at two key tasks:
Bounding Volume Hierarchy (BVH) traversal: Imagine a complex 3D scene being broken down into simpler shapes like boxes. This hierarchy helps the GPU quickly identify which areas of the scene a light ray might interact with instead of checking every single object.
Intersection of scene geometry: Once promising areas are identified (through BVH traversal), these cores precisely calculate where the light ray actually hits the object within that area. By excelling at these tasks, the A40 can rapidly determine how light interacts with objects in the scene, leading to highly realistic lighting and shadows in the final render.

With Second-generation RT Cores, the NVIDIA A40 delivers a significant leap in performance, boasting up to twice the throughput of the previous generation. This translates to massive speedups for workloads that rely on ray tracing, such as photorealistic rendering of movie content, architectural design evaluations, and virtual prototyping of product designs.

Specification	NVIDIA A40
GPU Architecture	NVIDIA Ampere
GPCs	7
TPCs	42
SMs	84
CUDA Cores / SM	128
CUDA Cores / GPU	10752
Tensor Cores / SM	4 (3rd Gen)
Tensor Cores / GPU	336 (3rd Gen)
RT Cores	84 (2nd Gen)
GPU Boost Clock (MHz)	1740
Peak FP32 TFLOPS (non-Tensor)	37.4
Peak INT8 TOPS (Tensor)	299.8
Peak FP16 TFLOPS (non-Tensor)	18.7
Peak INT4 TOPS (Tensor)	599.7
Peak FP32 Tensor TFLOPS	74.8/149.6
Peak FP16 Tensor TFLOPS	149.7/299.4
Peak INT8 Tensor TOPS	299.8/599.6
Peak INT4 Tensor TOPS	599.7/1199.4
Frame Buffer Memory Size and Type	49152 MB GDDR6
Memory Interface	384-bit
Memory Clock (Data Rate)	14.5 Gbps
Memory Bandwidth	696 GB/sec
ROPs	112
Pixel Fill-rate (Gigapixels/sec)	194.9
Texture Fill-rate (Gigatexels/sec)	334.6
Texture Units	336
L1 Data Cache/Shared Memory	10752 KB
L2 Cache Size	6144 KB
Register File Size	21504 KB
TGP (Total Graphics Power)	300 W
Transistor Count	28.3 Billion
Die Size	628.4 mm²
Manufacturing Process	Samsung 8 nm NVIDIA Custom Process
CUDO Compute

nvidia-a40-image-5

Furthermore, these enhanced RT Cores can concurrently run ray tracing alongside shading or denoising processes, further accelerating the rendering pipeline. In addition, it can render ray-traced motion blur, delivering faster results with superior visual accuracy.

These features together enhance the capability of each SM to handle diverse and demanding tasks in graphics rendering and general-purpose computing, making GPUs like the A40 highly effective for a variety of high-performance computing applications.

Additionally, The A40 includes new features in the ROP (Raster Operations Pipelines) units. ROP units handle pixel output by performing tasks like pixel blending and writing to memory. Unlike previous generations of GPUs, the ROPs are no longer tied to the L2 cache. They are now integrated within each GPC.

This change allows for a more direct data flow within the GPC, potentially reducing latency and increasing throughput. The redesign improves the efficiency of raster operations by increasing the number of ROPs and minimizing the mismatch in throughput between the scan conversion front end and the raster operations back end.

The inclusion of two ROP partitions per GPC, each containing eight ROP units, is a specific enhancement in the Ampere architecture, which helps improve efficiency and performance in rendering tasks.

With seven GPCs and 16 ROP units per GPC, the full GA102 GPU consists of 112 ROPs instead of the 96 ROPS previously available in a 384-bit memory interface GPU like the prior generation. This advancement in ROP count directly translates to improvements in key rendering techniques:

Multisample Anti-Aliasing (MSAA): With more ROPs, the GA102 can handle more samples per pixel during MSAA, leading to smoother edges and reduced aliasing artifacts.
Pixel Fillrate: The increased ROP count translates to a higher rate at which the GPU can process and output pixels to the framebuffer, enhancing overall rendering performance.
Blending Performance: The additional ROPs improve the efficiency of blending operations, which are crucial for combining textures and effects within a rendered scene.

nvidia-a40-image-6

You can rent NVIDIA A40 Cloud GPUs for AI and HPC acceleration on CUDO Compute today. Contact us to learn more.

Other features of the NVIDIA A40 include:

48GB of GDDR6 Memory: Provides substantial, high-bandwidth memory for efficient data access in computationally intensive tasks.
Third-Generation NVIDIA NVLink: Enables seamless interconnection of multiple A40 GPUs, scaling the total memory from 48GB to 96GB in a single system configuration. This benefits workloads with massive datasets.
Virtualization-Ready with vGPU Software: Creates larger and more powerful virtual workstation instances for remote users, enabling high-performance remote work in design, AI, and demanding compute tasks.
PCI Express Gen 4 Interface: Doubles the data transfer speed between the CPU's memory and the A40 compared to PCIe Gen 3. This benefits data-intensive applications in AI, data science, and 3D design. Faster PCIe performance also accelerates GPU direct memory access (DMA) transfers, improving video data communication for live broadcast workflows. The A40 maintains backward compatibility with PCI Express Gen 3 systems for deployment flexibility.
Data Center Efficiency and Security: The A40 prioritizes power efficiency, offering up to 2x better performance than the previous generation. It also features a secure and measured boot with hardware root of trust functionality to ensure system integrity.

Is NVIDIA A40 single precision?

The NVIDIA A40 supports both single-precision and double-precision floating-point operations. However, it offers improved performance and power efficiency for single-precision operations, making it well-suited for tasks that primarily rely on single-precision calculations.

NVIDIA A40 performance

Given the versatility of the NVIDIA A40, we can compare its performance for different use cases, but we will focus on how it performs in scientific applications:

Performance Evaluation of the NVIDIA A40 GPU in Scientific Applications

The NVIDIA A40 GPU has been evaluated across multiple scientific computing applications to ascertain its computational efficacy in replacing traditional CPU-only servers. The benchmarking was conducted on applications pertinent to geoscience, molecular dynamics, physics, and other scientific fields.

The primary metrics used to measure the A40 GPU's performance include:

Total Time (Seconds): The duration required to complete a given task.
Node Replacement Factor (NRF): A measure indicating how many CPU-only nodes can be replaced by a single GPU-accelerated node.

Applications and performance

1. Geoscience (SPECfem3D):

SPECfem3D is a software package designed to simulate seismic wave propagation in three dimensions. It is commonly used in geophysics and seismology to model how seismic waves travel through different types of geological structures.

The A40 significantly reduced the total computation time for seismic wave propagation simulations, decreasing the total time as more GPUs were utilized. With the A40, the number of CPU-only nodes replaced varied from 2x to 13x, illustrating the A40's scalability and efficiency.

Application	Metric	Bigger is better	CPU-Only	1x A40	2x A40	4x A40	8x A40
SPECFEM3D	Total Time (Sec)	no	386	203	103	53	34
SPECFEM3D	NRF	yes	1x	2x	3x	8x	13x
Source: NVIDIA
CUDO Compute

2. Molecular dynamics (AMBER, GROMACS, and NAMD):

AMBER:

Assisted Model Building with Energy Refinement (AMBER)is a suite of programs designed to simulate molecular dynamics, particularly focused on biomolecules like proteins and nucleic acids. It is used in biochemical and biophysical research communities to study biological molecules' structure, dynamics, and energetics.

For AMBER simulations involving the Cellulose NPT module, the A40 replaced 10x CPU-only nodes with a 97 ns/day performance metric scaling up to 819 ns/day for 8x A40 GPUs.

nvidia-a40-image-7

GROMACS:

The A40 GPU substantially enhanced molecular dynamics simulations, specifically using the GROMACS ADH Dodec module. The performance metric indicates a boost from 314 ns/day with a single A40 to an impressive 2,534 ns/day using 8x A40 GPUs, demonstrating the GPU's substantial scaling capabilities. Furthermore, the Node Replacement Factor (NRF) shows that one A40 GPU could replace up to 13 CPU-only nodes, indicating significant cost and energy savings.

Application	Metric	Bigger is better	CPU-Only	1x A40	2x A40	4x A40	8x A40
GROMACS	ns/day	yes	189	314	625	1,113	2,534
GROMACS	NRF	yes	1x	2x	3x	6x	13x
CUDO Compute

Source: NVIDIA

NAMD:

Nanoscale Molecular Dynamics (NAMD) is a computer software application designed for high-performance simulation of large biomolecular systems. In the NAMD application, the A40 offered an initial performance of 105 ns/day, escalating to 845 ns/day with 8x A40 GPUs, showing nearly an 8-fold increase.

Application	Metric	Bigger is better	CPU-Only	1x A40	2x A40	4x A40	8x A40
NAMD apoa1_npt_cuda	ns/day	yes	64.49	105	211	423	845
NAMD apoa1_npt_cuda	NRF	yes	1x	2x	3x	7x	13x
NAMD apoa1_nptsr_cuda	ns/day	yes	65.19	109	221	441	885
NAMD apoa1_nptsr_cuda	NRF	yes	1x	2x	3x	7x	14x
NAMD apoa1_nve_cuda	ns/day	yes	71.14	146	295	593	1,187
NAMD apoa1_nve_cuda	NRF	yes	1x	2x	4x	8x	17x
NAMD stmv_nve_cuda	ns/day	yes	6.97	11	21	42	85
NAMD stmv_nve_cuda	NRF	yes	1x	2x	3x	6x	12x
CUDO Compute

Source: NVIDIA

3. Physics (MILC):

The A40 demonstrated a 5x improvement in NRF, indicating the capability of one A40 GPU to replace five CPU-only nodes. Scalability was evidenced by a multi-fold increase in performance, peaking at a 27x NRF when utilizing 8x A40 GPUs.

Application	Metric	Test Modules	Bigger is better	CPU-Only	1x A40	2x A40	4x A40	8x A40
MILC	Total Time (sec)	Apex Medium	no	31,577	6,005	3,094	1,701	1,034
MILC	NRF	Apex Medium	yes	1x	5x	9x	17x	27x
CUDO Compute

Source: NVIDIA

Across all applications, the A40's performance improved linearly or better as more GPUs were added. The NVIDIA A40 GPU accelerates scientific computing software by targeting specific functionalities for hardware acceleration.

In molecular dynamics simulations (AMBER, NAMD), this includes:
PMEMD (Particle Mesh Ewald summation) for efficient electrostatic interaction calculations.
GB Implicit Solvent model for faster simulation of solvent effects on biomolecules.
For SPECfem3D, the A40 leverages OpenCL and CUDA hardware accelerators to improve performance.
In lattice quantum chromodynamics (MILC), A40 accelerates:
Staggered fermions calculations.
Krylov solvers for solving large systems of equations.
Gauge-link fattening technique for improved simulation accuracy.

The NVIDIA A40 GPU demonstrates substantial computational advantages across various scientific applications. Its ability to scale and replace multiple CPU-only nodes with fewer GPU-accelerated nodes proves its high performance and energy efficiency. These attributes render it a powerful solution for complex scientific computations, offering a cost-effective and performance-boosting upgrade to traditional CPU-based systems.

NVIDIA A40 price

The NVIDIA A40 GPU is primarily designed for data centers, but you don't necessarily need to own one to take advantage of its capabilities. Cloud service providers like CUDO Compute offer rental options, making the A40 accessible for various use cases.

Here's a breakdown of CUDO Compute's pricing for NVIDIA A40 GPUs. Pricing starts at:

$577.10 per month
$0.79 per hour

This makes the A40 a cheap option for different applications. You can start using the NVIDIA A40 GPU now.

Learn more:

High performance computing

Artificial intelligence

GPU

Continue reading

NVIDIA A40 GPUs: everything you need to know

Emmanuel Ohiri

NVIDIA A40 specification

What is the NVIDIA A40 used for?

Is NVIDIA A40 single precision?

NVIDIA A40 performance

Performance Evaluation of the NVIDIA A40 GPU in Scientific Applications

Applications and performance

NVIDIA A40 price

Continue reading

Key considerations for optimizing power efficiency with sustainable energy sources

Building for 70% AI-driven demand: Planning for the coming capacity surge

NVIDIA H100 versus H200: how do they compare?

NVIDIA’s Blackwell architecture: breaking down the B100, B200, and GB200

NVIDIA A40's available on request

Subscribe to our Newsletter