38 minute read

Designing AI Factories: Power, cooling & layout

Emmanuel Ohiri & 2 others

Dec 9, 2025, 6:45 PM

AI workloads are forcing data center (DC) operators to rethink the fundamentals of facility design. GPU-dense clusters create concentrated power draw, reject far more heat per node, and push airflow systems beyond what conventional layouts were built to handle. Facilities engineered in the mid-2010s for 5–10 kW racks now strain under the 30–100 kW densities common in large-scale training and inference.

NVIDIA's AI factory concept—purpose-built facilities designed around GPU-dense clusters—treats these power densities as baseline, with next-generation architectures like GB200 NVL72 pushing toward 120 kW per rack.

These pressures have shifted the design focus from incremental optimization to grid-to-rack engineering. Hardware that draws several times more power than legacy servers and depends on liquid cooling requires new electrical and thermal paths rather than retrofits.

The new generation of AI-ready facilities starts with physics. Power delivery efficiency and thermal management now define computational performance as much as the GPUs themselves. Modern designs employ medium-voltage distribution to minimize transmission losses, 48V rack-level conversion for improved efficiency, and direct liquid cooling to handle heat densities that would overwhelm traditional air systems.

These aren't optional upgrades; inefficient power delivery and inadequate cooling directly translate to higher costs per token and slower model convergence. Building an AI-ready data center means aligning the entire environment—power, cooling, and layout—with the realities of accelerated compute.

Design principles: what “AI-ready” really means

Designing_AI_Factories_image_1

Designing a facility for AI compute starts with the simple premise that training throughput is the priority metric. Every subsystem (electrical, thermal, mechanical) either preserves or erodes it. An AI-ready hall is one where power and cooling never become the bottleneck.

Throughput as the constraint: Large training runs are highly sensitive to small fluctuations in the hardware supporting them. A minor voltage sag, a small rise in coolant temperature, or a few seconds of pump instability all increase tail latency at scale. When GPUs hit thermal or power limits, they don't fail; they throttle, slowing down performance. Facilities have to be engineered to be resilient to faults at all levels, and transient events must not impact performance.

Thermal limits define hardware limits: Legacy data centers were built around 5–10 kW racks; AI clusters now routinely push 30–80 kW, with 100–150 kW per rack already shipping as productized targets. Modern rack-scale liquid-cooled solutions are now specified to handle up to 100kW per rack, with some vendors already demonstrating 150kW and above configurations.

At these densities, the heat flux simply outruns air. Air systems still shape the boundary conditions, but liquid cold plates become the primary thermal path. Rising processor TDPs make this shift inevitable, as without direct-to-chip cooling, facilities reach diminishing returns where fans consume exponentially more power for marginal cooling gains, while the chips themselves leak more current and waste more energy as temperatures rise.

When GPUs exceed their designed operating temperatures, they automatically reduce clock speeds to protect themselves, which is immediately reflected in longer convergence times and higher cost per token.

Serviceability without disruption: High-density liquid-cooled racks must be serviceable without coolant drainage or compute interruptions. Modern cold plate deployments rely on quick-disconnect couplings — connectors that separate without leaking — and rear-of-rack manifolds that distribute coolant to individual servers.

That said, component quality varies widely across vendors—quick-disconnect reliability, cold plate thermal performance, and manifold durability all differ enough to affect long-term serviceability. That's why our infrastructure partnerships prioritize vendors with proven high-density deployments, because the operational costs of unreliable components far exceed any savings at procurement.

These designs let operators isolate and service a single node without draining the entire cooling loop or affecting neighboring hardware. This approach has become the industry standard for maintaining high-density racks in production.

To turn maintenance into a hot-swap operation rather than a disruptive event, physical design is key. Front-accessible components where rack geometry allows—liquid manifolds often route to the rear to avoid cable conflicts—precisely planned cable management to keep pathways clear, and strategically placed isolation valves.

Modularity as a reliability tool: Modern AI deployments are organized into pods, which are often designed to be bounded units of racks, power, and cooling that act as scaling increments. Vendor architectures now formalize this with scalable units that can be added, isolated, or serviced as discrete blocks.

Industry reference designs and data center guidance treat these modular blocks as a practical way to contain the impact of faults while simplifying phased rollouts. Done well, this keeps localized problems and maintenance work from affecting neighboring pods.

For inference workloads, container orchestrators like Kubernetes can autoscale around failures with minimal disruption. Training is less forgiving, as any node failure forces a rollback to the last checkpoint and workload migration. Emerging orchestration platforms are beginning to address training resilience more intelligently, but robust infrastructure remains the primary defense.

Before deploying production hardware, facilities must pass integrated testing that simulates actual training workloads. This means running real power draws, thermal loads, and network patterns, not just checking against rated capacities. Typical acceptance suites start with CPU/GPU burn-in for 24–48 hours to verify hardware health, then move to performance validation: NCCL for network throughput, HPL for GPU compute, and storage I/O testing across a range of read/write workloads using both small and large files to confirm the system meets design specifications.

Commissioning teams now require full-stack validation under realistic stress, with power, cooling, controls, and orchestration running concurrently before certifying a facility for AI workloads. These core principles of throughput, thermal management, maintainability, and modularity then guide every implementation detail, from selecting medium-voltage equipment to routing coolant manifolds.

Power architecture: From utility to GPU

Designing_AI_Factories_image_2

The power chain in an AI-ready data center must handle both the massive scale and the dynamic nature of modern training workloads. Unlike traditional IT loads, which remain relatively steady, AI clusters can swing from idle to peak power in milliseconds, creating electrical transients that propagate from the GPU back to the utility feed. Every conversion stage, protection device, and distribution path must be engineered to absorb these swings without voltage sags, harmonic distortion, or cascading failures.

Medium voltage design for resilience and flexibility:

Modern AI facilities start with medium voltage (MV) distribution at 12.47 kV or 24.9 kV, reducing transmission losses and cable sizes compared to low-voltage feeds. The architecture typically employs dual utility feeds from separate substations, or a utility feed paired with on-site generation: either natural gas turbines or, increasingly, battery energy storage systems (BESS) for shorter-duration backup.

The challenge with island-mode operation for AI workloads is that training algorithms execute synchronously across GPUs, creating oscillating load profiles that can destabilize generators not designed for such rapid transients. Modern facilities address this with synchronous condensers, flywheel systems, or an inline UPS that provides a sub-millisecond response to load steps, acting as shock absorbers between the power plant and the compute load.

The MV switchgear includes automatic transfer switches (ATS) with closed-transition capability, allowing seamless transfers between sources without interrupting compute. From the MV level, transformers step down to 480V for distribution to IT halls, with some facilities now exploring direct 480V-to-48VDC conversion to eliminate an entire conversion stage.

48V distribution cuts losses and copper:

The shift from 12V to 48V rack-level distribution reduces current by 4x at the same power delivery, cutting distribution losses by 16x since losses scale with the square of current. This becomes critical at AI densities where racks pull 100+ kW — at 12V, a single rack would require over 8,000 amps, making cable sizing and connector reliability nearly impossible.

Industry momentum is clearly toward 48VDC at the rack, with future architectures likely to push toward even higher-voltage DC distribution to further reduce conversion stages and losses.

Google reported a 30% reduction in conversion losses and a 16x reduction in distribution losses when moving to 48V architectures with direct point-of-load conversion. The 48V bus powers the power shelves in each rack, which then convert directly to the voltages required by GPUs (typically 0.8-1.2V) using high-efficiency voltage regulator modules (VRMs). Modern switched-capacitor converters achieve over 98% efficiency for the 48V-to-12V intermediate conversion when required for legacy components.

Busways rated for 400-600A at 48VDC run overhead or under raised floors, with tap boxes every few meters to feed rack power shelves. In some high-density deployments, these busways use laminated copper bars with integrated cooling channels. Redundant power shelves in each rack provide N+1 redundancy at the point of use, eliminating single points of failure.

Designing_AI_Factories_image_3 Source: Article

Selective UPS strategies for AI workloads:

AI training creates dramatic power fluctuations as GPUs synchronize their operations, requiring UPS systems that can handle load swings of 50% or more in milliseconds without triggering protection circuits. Rather than protecting the entire facility with UPS, selective deployment focuses battery backup on critical subsystems**.**

For us, requirements vary by client, but at minimum, this typically covers storage, core networking, out-of-band management infrastructure, CDUs, and rear-door heat exchangers—the systems that preserve job state and prevent thermal runaway if power fluctuates.

The control plane, including job schedulers, network controllers, and storage metadata servers, receives complete UPS protection with runtime targets of 10-15 minutes. These systems must stay online to preserve job state and enable orderly shutdown if needed. Compute nodes often rely on "UPS-lite" configurations: capacitor banks or small lithium-ion modules that provide just 30-60 seconds of ride-through, enough to handle utility transfers or brief sags but not extended outages.

Modern AI-tolerant UPS systems use lithium-ion batteries that handle high-frequency charge/discharge cycles better than traditional VRLA batteries, which struggle with rapid load changes above 110% rated capacity. The UPS includes predictive controls that anticipate load ramps based on job-scheduler signals and pre-position inverters to minimize voltage deviation during transitions.

Grounding and bonding for liquid-cooled systems:

Liquid cooling introduces new grounding challenges, as conductive coolants create potential paths for electrical faults. The facility requires a comprehensive mesh bonding network (MESH-BN) that bonds all metallic components, including liquid manifolds, cold plates, and piping, to a standard ground reference.

TIA-942 specifies that computer room grounding should begin with a signal reference grid (SRG) of #6 AWG copper installed beneath raised floors or above ceilings, with equipment racks bonded to this grid to minimize ground loops and electromagnetic interference. Liquid cooling components require additional considerations:

All metallic pipes and manifolds must be bonded to the SRG using flexible copper straps that accommodate thermal expansion
Isolation transformers for pumps and CDUs prevent ground loops between the cooling system and IT equipment
Dielectric unions at rack connections prevent galvanic corrosion while maintaining safety grounding through parallel conductors.

The isolated ground system for sensitive electronics connects to the building ground at only a single point, using the highest-impedance conductor allowed by code to minimize noise coupling while maintaining safety. Ground impedance must be kept below 5 ohms, with 1 ohm or less preferred for facilities with extensive liquid cooling.

Leakage detection integrated with power control:

Modern leak detection systems use sensing cables that can detect both water and glycol-based coolants, with location accuracy within 1 meter along runs that can extend hundreds of feet. These cables use conductive polymers that change resistance when exposed to liquid, allowing the monitoring system to pinpoint leak locations.

The leak detection system interfaces with the emergency power off (EPO) logic via the building management system (BMS), but confirmation matters more than speed. Strip-style sensing cables are prone to false alarms from condensation, humidity shifts, or residual moisture from maintenance, so robust designs require multi-sensor confirmation before escalating responses. Upon confirmed detection, the BMS can automatically close solenoid valves toisolate affected cooling zones and reduce IT load in impacted racks through server management interfaces.

Selective EPO for specific rack zones remains an option, but automatic EPO triggers should be approached with caution—an unnecessary power cut to a running training cluster often causes more damage than a contained leak.

Detection zones are hierarchical, with responses calibrated to confirmed severity:

Rack-level sensors trigger local valve isolation and server throttling after secondary confirmation
Row-level detection initiates cooling zone isolation and load migration
Room-level detection can trigger full EPO, but typically requires operator confirmation unless the leak directly threatens electrical infrastructure

Liquid-cooling implementations require special attention to routing, with codes increasingly prohibiting coolant pipes from passing through electrical rooms, even with drip trays, requiring alternative routing paths that add complexity but reduce the risk of catastrophic failure.

The integration between leak detection, power systems, and cooling controls must be tested regularly. Monthly drip tests verify sensor sensitivity, quarterly valve actuations confirm isolation capability, and annual integrated tests validate the complete EPO sequence, including selective shutdown based on leak location and severity where required.

Rack layout & density: airflow, cable plant, and human factors

Even in predominantly liquid-cooled deployments, air remains the boundary condition. GPUs reject heat to cold plates, but power supplies, NICs, storage, and switch ASICs still rely on forced air cooling. The facility must manage both thermal paths without creating conflicts—liquid manifolds that block airflow channels, or hot exhaust recirculating into intake zones.

Front-to-back airflow discipline with hybrid cooling:

Modern AI racks employ hybrid thermal architectures: direct-to-chip liquid cooling handles 70-80% of the heat load from GPUs and memory, while forced air removes the remaining 20-30% from ancillary components. This requires maintaining traditional front-to-back airflow even as liquid manifolds occupy rear-of-rack space.

Facilities typically choose between hot-aisle containment and rear-door heat exchangers (RDHx)—combining the two is uncommon. Either approach must integrate with manifold assemblies without creating bypass paths.

The critical failure mode is hot-aisle bleed: when containment seals degrade, or manifold penetrations aren't properly gasketed, exhaust air mixes with intake air, raising inlet temperatures. A 3-5°C rise in inlet temperature propagates directly to GPU baseplate temperatures, which can push chips into thermal throttling even when liquid cooling is functioning perfectly.

Facilities use computational fluid dynamics (CFD) validation during design, then verify with thermal mapping during commissioning—infrared surveys at server intakes to confirm temperature uniformity across the rack face. Real-time thermal monitoring at the node level validates that service operations don't affect neighboring hardware—temperature drift on adjacent cold plates during a swap indicates inadequate isolation.

Designing_AI_Factories_image_4 Source Article

Cable management as operational infrastructure:

High-density racks generate three distinct cable plants: power (48V busbars or heavy-gauge DC feeds), networking (100GbE or 400GbE copper/fiber), and liquid connections (supply/return manifolds with quick-disconnects). Managing these without creating service impediments requires deliberate routing discipline.

Modern deployments standardize on:

Overhead cable tray, underfloor routing, or trench space for power distribution—keeping primary pathways clear for liquid piping and allowing gravity drainage of any leaks away from electrical systems
Top-of-rack (ToR) fiber staging with pre-terminated trunk cables and modular cassettes that allow port reconfigurations without re-pulling individual strands
Under-floor or side-of-rack liquid manifolds with vertical risers to each server, using flexible hoses with reinforced strain relief at connection points

The patching topology matters operationally. Spine-leaf architectures with ToR switches create local patch zones. While reconfigurations on large-scale clusters are rare, necessary interventions, such as replacing failed cabling or transceivers, happen at the rack level rather than requiring runs back to aggregation points. Patching strategies vary by installation, but best-practice provisions reserve cabling capacity from the outset. This ensures a single failure doesn't require running new cables—an operation that risks disturbing adjacent fibre bundles and causing secondary failures.

Cable routing must preserve minimum service clearances where possible: 1.2m in front of racks for component access, 1.0m behind for manifold service, with aisle widths accommodating server rail extension without blocking adjacent access. These are ideal targets, though not always practicable in dense deployments. Facilities that undersize aisles create cascading delays—one rack's maintenance blocks neighboring racks, serializing what should be parallel service operations.

Maintenance envelopes for rapid node replacement:

Traditional data centers treated node replacement as a disruptive event: drain cooling loops, power down rack segments, physically extract failed hardware, reverse the process. AI training workloads running synchronized distributed operations cannot tolerate these delays.

Modern liquid-cooled rack designs enable hot-swap maintenance through several key mechanisms:

Quick-disconnect (QD) couplings at every server allow individual nodes to be isolated without draining the rack manifold. Industrial-grade QDs use spring-loaded seals that couple/decouple in seconds with < 1mL spillage, meeting IP67 sealing standards. These aren't optional connectors—they're the critical component that determines whether maintenance extends for hours or minutes.

Drawer-style server chassis with front-accessible components and rear liquid connections. Technicians release mechanical latches, disconnect QDs at the rear, and slide the server forward on rails. The entire sequence—from identifying the failed node to bringing replacement hardware online—aims to minimize time-to-repair for single-node swaps during production operations, though actual MTTR varies by GPU generation, failure mode, and OEM.

Redundant cooling loops at the rack level, with manifolds providing N+1 capacity. If a cold plate develops a leak or a QD fails, isolation valves isolate the failure to a single server position while the rack continues operating. This requires pressure sensors and flow meters at rack granularity, feeding into BMS controls that can automatically isolate faulted segments.

The operational test is simple: can a technician replace a failed GPU node during a multi-rack training run without impacting job completion time? If the answer is no—if the job must checkpoint and pause—the facility hasn't achieved true hot-swap capability, regardless of the hardware's rated features.

Water systems, heat reuse & sustainability

Designing_AI_Factories_image_5

Liquid cooling shifts thermal management from air handling to water infrastructure, making the facility's water loop as critical as electrical distribution. At 100kW per rack, a 1,000-rack facility rejects 100MW of heat into water—requiring flow rates of 4,000-6,000 GPM through the primary cooling loop, depending on design ΔT.

Facility water loop architecture:

Modern AI facilities employ a two-stage cooling architecture: a primary loop using filtered, treated water (or water-glycol mixture) circulates through rack cold plates and rear-door heat exchangers, while a secondary loop transfers heat to outdoor cooling towers or dry coolers.

The primary loop operates as a closed system with stringent water quality requirements:

Conductivity < 100 μS/cm to prevent galvanic corrosion in dissimilar metal connections
pH maintained at 7.5-8.5 to minimize corrosion rates
Particulate filtration to 5 microns to prevent cold plate channel fouling
Dissolved oxygen < 0.5 ppm to reduce oxidation of copper and aluminum components

Coolant distribution units (CDUs) circulate primary-loop fluid, typically operating in warm-water mode up to ~45 °C supply (dew-point limited), with many deployments choosing ~30–40 °C to maximize chiller-less efficiency. Racks are often designed around a ~10–15 °C coolant ΔT to balance pump power and component temperatures, though higher ΔT can further reduce flow and energy if cold-plate and manifold design maintain good flow distribution and avoid chip hot spots. Facilities commonly specify ~15–20 LPM per rack at ~2–3 bar differential pressure, and CDUs are usually oversized ~20–25% to accommodate future rack density growth..

The secondary loop interfaces with outdoor cooling infrastructure—induced-draft cooling towers in humid climates, or hybrid fluid coolers in arid regions where water consumption matters. Free-cooling (using outdoor air to reject heat without mechanical chillers) becomes viable when ambient wet-bulb temperatures fall below the primary loop return temperature minus the heat exchanger approach temperature.

Leak detection integrated across facility systems:

Liquid cooling introduces the operational risk that legacy air-cooled facilities never faced: coolant escaping containment. Modern deployments treat leak detection not as an afterthought but as a critical safety layer integrated with power and thermal management.

Multi-layer detection topology:

Point sensors at every rack-level manifold joint and server QD connection
Sensing cables beneath raised floors along primary pipe runs, with zone isolation every 10-15 meters
Drip pans with secondary containment under CDUs and major distribution headers

The detection system feeds into hierarchical response logic through the BMS:

Rack-level leak → isolate affected server position, throttle IT load if thermal capacity degrades
Row-level detection → close zone isolation valves, redistribute cooling to adjacent rows
Room-level alert → evaluate EPO trigger based on proximity to electrical infrastructure

False positive suppression matters operationally. Condensation during seasonal transitions, residual moisture from commissioning, or humidity migration into cold zones can all trigger sensors. Detection systems now use dual-confirmation logic—both resistance change and optical sensing—before initiating automatic isolation, with manual override for maintenance scenarios.

Heat reuse and sustainability integration:

At 100MW thermal load, AI facilities become meaningful heat sources for district energy systems or industrial process heat consumers. Reusing waste heat improves facility economics and sustainability metrics, but requires careful integration with cooling system design.

District heating integration is most viable in cold climates with established infrastructure. Primary loop return temperatures of 50-60°C match well with low-temperature district heating networks. The facility acts as a heat source, selling thermal energy to offset cooling costs. Helsinki, Finland's data center strategy explicitly encourages heat reuse into municipal heating networks, with facilities achieving 40-60% heat recovery.

However, heat reuse limits the cooling system's flexibility. Higher loop temperatures reduce cooling efficiency when ambient conditions allow free cooling. Facilities must balance heat sales revenue against increased cooling energy consumption. Economic viability depends on local energy prices—cheap electricity and expensive heat create favorable conditions, while the reverse makes heat reuse financially marginal.

In practice, heat reuse programs are typically implemented at the data center provider level, with benefits flowing through as reduced facility operating costs rather than line-item credits visible to tenants.

Make-up water and treatment protocols:

Closed-loop primary systems still require make-up water to replace evaporation from secondary cooling towers (1-2% of circulation rate) and accommodate system leakage. Water treatment becomes critical for loop longevity:

Pre-treatment: Deionization or reverse osmosis to achieve the target conductivity before initial fill
Biocide dosing: Prevents bacterial growth in sumps and cooling tower basins, typically quaternary ammonium compounds at 15-30 ppm
Corrosion inhibitors: Molybdate or nitrite-based compounds for ferrous metals, azoles for copper protection
Side-stream filtration: Continuous 5-10% loop flow through bag filters to remove particulates generated by corrosion or biofilm.

Water quality monitoring occurs at multiple points—CDU supply, rack returns, and cooling tower basins—with automatic chemical dosing systems maintaining parameters within spec. Facilities typically budget for a complete primary loop changeout every 3-5 years as preventive maintenance, though well-maintained systems can operate longer.

Realistic PUE targets for liquid-cooled AI facilities:

Power Usage Effectiveness (PUE) has become the standard sustainability metric, calculated as total facility power divided by IT equipment power. Legacy data centers with air cooling typically achieve a PUE of 1.4-1.6. High-density liquid-cooled AI facilities can improve this, but not to the dramatic numbers sometimes claimed in marketing materials.

Realistic targets for modern AI facilities:

Liquid-cooled with free cooling: PUE 1.15-1.25 in temperate climates, 1.25-1.35 in hot/humid regions. Facilities in extreme cold climates, like Iceland, can push below 1.10 with near-continuous free cooling.
Hybrid liquid + air: PUE 1.20-1.30 depending on the ratio of direct-to-chip vs. air-cooled components
Legacy air-cooled retrofits: PUE 1.35-1.50 due to oversized CRAC units and inefficient distribution

The efficiency gains come from eliminating CRAC units (which consume 15-25% of IT load in air-cooled facilities) and enabling higher free-cooling hours. However, liquid cooling introduces new parasitic loads: CDU pumps, control systems, and leak detection add 2-3% overhead. The net improvement is absolute but incremental, not the order-of-magnitude changes sometimes suggested.

Facilities pursuing aggressive PUE targets must balance efficiency with reliability. Running cooling systems at minimum capacity reduces overhead but eliminates thermal margin, leaving equipment vulnerable to failures or unexpected load spikes. Operators typically target PUE at 80-85% load rather than optimizing for peak efficiency at 100% utilization.

Telemetry, controls & safeguards

Designing_AI_Factories_image_6

AI training workloads don't fail gracefully—they slow down. A voltage sag that briefly throttles GPU clock speeds, a coolant temperature rise that triggers thermal limits, or a transient network congestion event all manifest as longer step times that compound across thousands of synchronized GPUs. Facility telemetry must detect these degradations before they impact training throughput, and control systems must respond faster than human operators can react.

What to measure: the critical sensor suite

Modern AI facilities instrument every stage of the power-thermal-compute path, creating a real-time digital twin of facility state:

Electrical telemetry:

Rack-level power: True RMS power draw at 1-second intervals, capturing transient spikes that exceed average load by 20-30%
Voltage regulation: Per-phase voltage at rack PDUs, with alerting on ±3% deviations from nominal
Power factor: Harmonic distortion and reactive power consumption, which indicate failing VRMs or mismatched loads
Battery state: UPS charge level, cycle count, and temperature for predictive replacement

Thermal telemetry:

Cold plate ΔT: Temperature differential between coolant inlet and outlet at each server, indicating thermal interface degradation or flow restriction
GPU ΔT: Junction-to-baseplate temperature gradient, which rises as thermal paste ages or cold plate contact pressure degrades
Coolant flow rate: Volumetric flow per rack measured by inline turbine meters, detecting pump failures or blocked orifices before temperature alarms trigger
Differential pressure: Pressure drop across cold plates indicates fouling or blockage; across CDU filters signals maintenance needs

Cooling system health:

CDU pump RPM and vibration: Early indicators of bearing wear or cavitation
Return temperature per row: Spatial mapping reveals airflow imbalances or failed containment
Cooling tower approach: Temperature differential between tower outlet and ambient wet-bulb, measuring heat rejection efficiency

Leak detection:

Point sensors at every rack manifold and QD connection
Continuous sensing cables under raised floors with 1-meter location accuracy
Humidity monitoring in cold aisles to detect evaporative leaks before liquid pooling occurs

The sensor infrastructure generates 50-100 data points per rack every second. A 1,000-rack facility produces 5-10 million measurements per second, requiring time-series databases and real-time analytics to extract meaningful signals from noise.

In reality, temperature is the dominant signal—GPU and cold plate ΔT trends correlate most directly with impending performance degradation, while power-side fluctuations are largely smoothed by modern power supplies before they reach the compute.

Adaptive control loops for efficiency and reliability

Static set points waste energy and reduce the reliability margin. Modern facilities implement adaptive controls that adjust cooling and power delivery based on actual IT load, job scheduling, and outdoor conditions:

Load-following cooling: CDU pump speeds and coolant temperatures track IT power consumption with response times of 30-60 seconds. When training jobs ramp up, pumps accelerate, and coolant temperatures decrease proactively rather than reactively. This prevents thermal spikes during job initialization and reduces pump energy during idle periods.

Weather-responsive set-points: On cool days when ambient temperatures support free cooling, the BMS raises primary loop temperatures by 3-5°C, reducing chiller load or eliminating mechanical cooling entirely. The control system monitors outdoor wet-bulb temperature continuously and adjusts set-points hourly, maximizing free-cooling hours without risking thermal violations.

Job-scheduled optimization: Integration with workload orchestrators (Kubernetes, Slurm, or proprietary schedulers) allows facility systems to anticipate load changes. When a large training job is queued to start in 10 minutes, cooling systems pre-cool the hall and ensure CDUs are at full readiness. This coordination prevents the thermal lag that would otherwise cause throttling during job ramp-up.

Predictive maintenance triggers: Rather than time-based PM schedules, telemetry-driven maintenance uses actual component wear indicators:

CDU filter replacement when ΔP exceeds thresholds, not calendar intervals
Cold plate inspections triggered by rising ΔT trends, indicating thermal interface degradation
Pump bearing service based on vibration analysis, not runtime hours

The control logic must distinguish between transient normal conditions and sustained degradations requiring intervention. A brief coolant temperature spike during job startup is expected; a temperature rise that persists for 5+ minutes indicates insufficient capacity. Facilities implement multi-level thresholds: informational alerts, warnings that require acknowledgment, and critical alarms that trigger automatic protective actions.

Safety systems: staged protection without false positives

Emergency response in AI facilities must balance safety with operational continuity. A false EPO trigger that shuts down a multi-day training run costs millions in wasted compute and delayed results. But a delayed response to a real cooling failure can destroy hardware worth tens of millions.

Hierarchical shutdown logic:

Level 1 - Thermal alerting and customer notification: If cooling capacity degrades but remains within safe bounds, the facility alerts affected customers and provides real-time thermal telemetry. In hyperscaler deployments where operators control the full stack, this can trigger automated GPU throttling through software controls. For multi-tenant environments, workload-level response depends on customer orchestration—the facility's role is ensuring customers have the data to make informed decisions.

Level 2 - Graceful job termination: When cooling or power issues persist, integrated deployments can checkpoint running jobs and terminate them in an orderly sequence, with storage and control plane systems remaining online to preserve state. This level of coordination is realistic for hyperscalers or customers with mature orchestration pipelines, but requires tight integration between facility BMS and workload schedulers that many deployments lacks.

Level 3 - Selective EPO: For localized hazards (rack-level leak, electrical fault in a single distribution panel), the system powers down only the affected zones while keeping adjacent racks operational. Selective EPO requires careful electrical design—zone isolation breakers, separate UPS segments, and control logic that maps physical locations to failure domains.

Level 4 - Full facility EPO: Reserved for catastrophic events—fire detection, multiple leak zones indicating ruptured headers, or utility failures exceeding UPS runtime. Full EPO opens main breakers and isolates the facility, but only after automated systems have attempted staged responses.

Preventing false positives:

The operational nightmare is an inadvertent EPO trigger—a sensor fault, a BMS software bug, or a technician's accidental button press. Modern facilities implement multi-level protection:

Dual-confirmation sensors: Critical shutdown triggers require two independent sensors to agree before initiating EPO
Time delays: Brief transients (< 10 seconds) don't trigger shutdowns; systems evaluate sustained conditions
Manual override: Critical alarms require human confirmation before executing shutdown, except for immediate life-safety hazards (fire, flood, arc flash)
Protected EPO buttons: Emergency stop buttons use guarded enclosures and require deliberate two-action sequences (lift guard, then press) to prevent accidental activation

Facilities conduct quarterly EPO drills to test shutdown sequences in controlled scenarios during maintenance windows. These drills validate control logic, train operators on response procedures, and identify latent issues before real emergencies occur.

Reliability & operations

Theoretical design resilience only matters if operational teams can execute recovery procedures correctly under pressure. An AI facility with perfect redundancy but untrained operators or missing spare parts still fails during incidents. Reliability engineering in AI data centers means translating paper designs into executable playbooks, pre-positioned inventory, and practiced responses.

Operational playbooks for common failure modes

Pump failure in the primary cooling loop: CDU pumps are high-reliability components, but they still fail—due to bearing wear, seal leaks, or control board faults. Impact depends on the redundancy architecture:

N+1 pump configuration: Remaining pumps automatically ramp to compensate, with the BMS alerting to schedule replacement. No immediate IT impact if total capacity remains above 80% of peak load.
Insufficient redundancy: Coolant flow drops below minimum, triggering thermal alarms within 2-3 minutes. The playbook response should be to reduce IT load by throttling GPUs while emergency maintenance brings the backup CDU online or replaces the failed pump.
Target recovery: < 30 minutes to restore full capacity. Historically, the longest slowdowns stem not from equipment failures but from operational errors. For air-cooled deployments, inlet temperature mismanagement and its downstream effects on system behavior remain the most common root cause of performance degradation.

Spare pumps must be pre-positioned on-site, not ordered on demand. CDU pump replacements requiring component sourcing extend outages from hours to days. Facilities maintain N+2 spare inventory for critical rotating equipment.

Designing_AI_Factories_image_7 Source: Paper

CDU filter fouling: Gradual particulate buildup restricts coolant flow, manifesting as rising differential pressure across the filter housing. Unlike sudden failures, fouling degrades performance slowly—weeks or months.

Early detection through continuous ΔP monitoring allows scheduled filter changes during maintenance windows. Reactive detection occurs when coolant flow drops enough to trigger thermal alarms, forcing emergency service. Modern facilities set filter-replacement thresholds at 80% of the maximum ΔP, rather than waiting for flow restrictions to affect cooling.

Pre-staged filter cartridge inventory prevents delays. Filter changes take 30-60 minutes per CDU, including coolant drainage and system purging. Facilities with multiple CDUs implement rolling maintenance—service one unit while others carry the load, eliminating downtime.

Power shelf faults: 48V power shelves feeding rack IT equipment can fail due to blown input fuses, failed DC-DC converter modules, or control board issues. Modern rack designs implement N+1 redundancy at the power shelf level, so single failures don't interrupt compute.

The challenge is detecting degraded redundancy before a second failure causes an outage. Continuous monitoring of the power shelf output voltage, current, and temperature identifies failing components before complete failure. Alert thresholds trigger preemptive replacement during scheduled maintenance rather than emergency response.

Condensation risk during shoulder seasons: In properly designed liquid-cooled facilities, condensation should never occur—supply temperatures of 35-40°C sit well above typical dew points. If condensation does appear on rack surfaces during seasonal transitions, it signals deeper problems: failed HVAC, incorrect BMS setpoints, or fundamental design flaws that warrant immediate investigation beyond simply wiping down equipment.

Prevention relies on maintaining minimum rack surface temperatures by controlling coolant temperatures and managing room humidity. Facilities implement dew-point interlocks, and if room humidity rises so that the dew point approaches the coolant temperature, the BMS either raises the coolant temperature or activates dehumidification systems.

Shoulder-season operation requires vigilance. Facilities conduct thermal surveys during seasonal transitions to verify that no cold surfaces approach dew-point conditions, especially in under-floor plenums, where cold-air stratification can create localized condensation zones.

Spares and kitting strategy

Emergency repairs proceed only as fast as parts availability allows. AI facilities maintain a tiered spare parts inventory:

Tier 1 - Hot spares (on-site, immediately available):

Quick-disconnect couplings
Cold plates
CDU pump assemblies
Power shelf modules
Fiber optic transceivers (5% of port count) and cables (10% of runs)

Quantities are aligned to cluster size and failure domains—smaller deployments require proportionally higher spare ratios, while large-scale installations benefit from statistical pooling.

Tier 2 - Warm spares (on-site, requiring installation):

Rack manifold assemblies
Server nodes (complete systems for rapid swap)
UPS battery modules
Leak detection sensor cables

Tier 3 - Strategic inventory (regional depot, 24-hour delivery):

Large CDU assemblies
Electrical switchgear components
Cooling tower parts
Specialty fasteners and fittings

Spare parts management isn't just inventory—it's kitting. When a cold plate fails at 2 AM, technicians need the replacement part plus all consumables: thermal paste, mounting hardware, QD seal kits, and fluid top-off containers. Pre-assembled service kits eliminate the need to search for ancillary parts during high-pressure repairs.

Routine thermal audits and recommissioning:

Facility performance degrades over time through mechanisms invisible to automated monitoring: dust accumulation in server intakes, thermal paste degradation, coolant chemistry drift, and sensor calibration shift. Routine audits catch these degradations before they impact operations.

Quarterly thermal audits:

Continuous inlet temperature monitoring via server BMC sensors, with trending to identify gradual drift
Cold plate ΔT trending analysis, identifying thermal interface failures
Airflow balancing in containment zones
Coolant chemistry testing and adjustment

Periodic validation: After firmware updates, cooling system modifications, or significant rack additions, facilities conduct appropriate hardware testing to validate that incremental changes haven't introduced new failure modes or degraded thermal margins. The scope and depth of testing scales with the change—minor updates may require only telemetry verification, while major modifications warrant more comprehensive validation during scheduled maintenance windows.

Facilities that skip routine audits discover problems during emergencies, when pressure to restore service competes with thoroughness. Organizations with mature operational practices treat audits as non-negotiable overhead that prevents expensive unplanned outages.

Economics and TCO

Designing_AI_Factories_image_8

Procurement decisions in AI infrastructure often focus on server costs—GPU prices, memory configurations, and network card speeds. But in operational reality, hardware accounts for less than 40% of the total cost of ownership over a typical 3-5-year lifespan. Power consumption, cooling infrastructure, and operational labor dominate long-term economics. Making decisions based on upfront hardware costs while ignoring TCO is like buying a car solely on purchase price, ignoring fuel economy and maintenance.

Power and cooling dominate long-term costs

Consider a 10MW AI facility operating for five years:

Hardware capital: $60-80M (servers, networking, storage at replacement cost)
Facility infrastructure: $40-60M (electrical, cooling, building)
Power consumption: $50-75M (at $0.10/kWh, assuming 1.3 PUE)
Operations labor: $15-25M (staff, training, spares, maintenance contracts)

Total five-year TCO: $165-240M, with hardware representing just 36-48% of total spend. Power and cooling combined exceed hardware costs, yet they often receive less engineering scrutiny during design phases.

This math explains why efficiency improvements deliver asymmetric returns. Reducing facility PUE from 1.5 to 1.2 saves 20% of cooling overhead, translating into $10-15M in savings over 5 years for a 10MW facility. Those savings alone fund significant infrastructure upgrades or buy hundreds of additional GPUs.

This is why CUDO prioritizes partnerships with renewable-energy data centers in climates that naturally support thermal efficiency. Cooler ambient temperatures extend free-cooling hours, reducing mechanical chiller dependence; renewable power contracts lock in predictable energy costs while meeting sustainability commitments. The combination of favorable climate and clean energy isn't just an ESG checkbox—it's a direct TCO advantage that compounds over multi-year deployments.

Designing_AI_Factories_image_9 Source: Article

48V and liquid cooling compound into $/token improvements

Efficiency gains don't just reduce electricity bills—they enable higher compute density within the same power envelope, improving effective utilization:

48V distribution efficiency: Moving from 12V to 48V distribution reduces conversion losses by approximately 30% at the rack level (according to Google's reported figures). On a 10MW facility, this saves ~300kW of parasitic loss—enough to power an additional 30-50 high-density racks. That translates directly to more compute per dollar of infrastructure.

Liquid cooling efficiency: Eliminating CRAC units and enabling free cooling can reduce cooling overhead from 40% of IT load (PUE 1.4) to 15% (PUE 1.15). On 10MW of IT load, this saves 2.5MW of cooling energy—capacity to deploy another 250 racks or improve gross margins on existing workloads.

These efficiency gains directly impact the economic metric that matters—cost per token generated or cost per training FLOP. A model training run that costs $2M on air-cooled infrastructure at PUE 1.4 might cost $1.7M on liquid-cooled systems at PUE 1.15—$300K saved on a single job. Over dozens of training runs per year, the savings dwarf the incremental capital cost of liquid cooling infrastructure.

Infrastructure cost benchmarks for AI facilities

Industry benchmarks for AI-ready data center construction vary widely based on location, density targets, and availability of existing buildings vs. greenfield construction:

Capital costs ($/kW installed):

Retrofit of existing facility: $8,000-12,000/kW for electrical and cooling upgrades
Greenfield air-cooled: $12,000-18,000/kW including building, electrical, and mechanical
Greenfield liquid-cooled: $15,000-22,000/kW with CDUs, manifolds, and higher-spec electrical

Operating costs ($/kW-year):

Power: $8,000-12,000/kW-yr at $0.10/kWh with 85% utilization (heavily location-dependent)
Maintenance: $500-800/kW-yr for routine service, spare parts, and consumables
Staff: $200-400/kW-yr allocated across facility operations team

These benchmarks help evaluate build-vs-lease decisions and vendor proposals. A colocation provider quoting $15,000/kW-yr all-in (power, cooling, space) vs. owned infrastructure at $8,500/kW-yr operating cost creates a 5-year cost difference of $32.5M per MW—enough to fund ownership even accounting for capital costs and operational complexity.

Warranty models and SLA alignment

Hardware warranty terms and RMA (return merchandise authorization) processes have operational cost implications often ignored in procurement:

Standard server warranties with depot return and next-business-day replacement are already tight for AI training economics—any delay in restoring compute capacity compounds into lost throughput. AI-focused contracts require faster response:

4-hour on-site response for critical component failures
Hot-spare inventory maintained by vendor at site or regional depot
Advanced replacement shipping before failed parts return
Dedicated TAC (technical assistance center) support, not tier-1 helpdesk queues

These enhanced SLAs cost 15-30% more than standard warranties but prevent the operational delays that compound into training disruptions. Procurement teams focused solely on unit costs miss this—a server $500 cheaper but with 5-day RMA incurs more total cost over the lifetime of ownership than a slightly more expensive unit with 4-hour replacement.

Cold plate warranties deserve particular scrutiny in liquid-cooled deployments. Standard warranties cover manufacturing defects but exclude damage from coolant contamination, improper installation, or water chemistry issues. AI facilities require:

Extended warranties (5-7 years vs. standard 3 years)
Inclusion of thermal performance degradation, not just complete failure
Coverage for seal and gasket replacements as wear items

Cold plates failing after 18 months due to galvanic corrosion from poor water treatment shouldn't be the facility's cost burden when the root cause is systematic rather than random. Negotiating comprehensive warranties that cover operational issues reduces long-term TCO risk.

Putting it together: a reference "AI-ready" hall

Designing_AI_Factories_image_10

Designing an AI-ready facility means integrating hundreds of individual decisions into a coherent system where power, cooling, layout, and operations reinforce rather than conflict with each other. The following reference architecture synthesizes the principles and practices discussed throughout this guide into a practical deployment template.

Electrical path: grid to GPU

Utility connection: Dual feeds from separate substations at 12.47kV or 24.9kV medium voltage, each sized for N+1 capacity (either feed can carry full facility load). Feeds terminate in MV switchgear with closed-transition ATS for seamless source transfers during utility maintenance or failures.

Primary distribution: MV/LV transformers step down to 480V, feeding main distribution panels with busway distribution to IT halls. Power factor correction and harmonic filtering at this stage clean power before it reaches IT equipment.

48V conversion: Rack-level 480V-to-48VDC conversion using high-efficiency rectifiers (>98% efficiency), eliminating the intermediate 208V or 400V step. 48V busbars run overhead or under-floor with tap points every 3-4 meters. Each rack receives redundant 48V feeds from separate busway segments.

UPS strategy: Selective UPS protection—control plane (job schedulers, orchestrators, storage metadata) receives full 15-minute UPS backup. Compute nodes use capacitor banks for a 30-60 second ride-through, sufficient for utility transfer without job interruption, but not extended outages. This reduces UPS capital and operating costs by 60-70% compared to whole-facility protection.

Cooling architecture: liquid as primary, air as boundary

Primary cooling loop: Closed-loop water-glycol mixture at 35-40°C supply temperature, targeting 10-15°C ΔT across rack cold plates. CDUs provide 15-20 LPM per rack at 2-3 bar pressure, with N+1 pump redundancy and variable speed drives for load-following operation.

Rack-level liquid distribution: Rear-door manifolds with quick-disconnect couplings at every server position. Flexible high-pressure hoses (rated for 6+ bar burst pressure) connect manifolds to server cold plates with reinforced strain relief. Isolation valves at the manifold level allow single-server service without draining entire racks.

Cold plate deployment: Direct-to-chip liquid cooling on all GPUs and high-power processors (CPUs, memory controllers), removing 70-80% of thermal load. Remaining components (power supplies, NICs, storage) rely on forced air through front-to-back airflow paths.

Secondary cooling: Heat rejection to outdoor cooling towers or dry coolers through plate-frame heat exchangers. Free cooling operation whenever ambient wet-bulb permits, with mechanical chillers as supplemental capacity during peak summer conditions.

Air handling for residual loads: Hot-aisle containment with rear-door heat exchangers capturing residual airborne heat before it enters the plenum return. CRAC units maintain 20-25°C cold-aisle temperatures but at far lower capacity than traditional air-cooled facilities—sized for 20-30% of IT load rather than 100%.

Physical layout: pods as failure and maintenance domains

Pod organization: Facility organized into 50-100 rack pods, each a self-contained unit with dedicated CDUs, electrical feeds, and row-level isolation capability. Pods map to training job boundaries when possible, allowing entire jobs to run within a single failure domain.

Rack arrangement: 1.2m front clearance, 1.0m rear clearance minimum. Hot-aisle containment with 2.4m aisle width to accommodate simultaneous service of adjacent racks. Cold aisles at 3.0m to allow forklift access for equipment delivery.

Service accessibility: All racks on seismically rated rails, allowing front-to-back sliding for access to rear manifolds and cabling. No fixed installations that prevent rack relocation during reconfiguration. Overhead cable tray keeps floor space clear for coolant piping and human movement.

Pod-level isolation: Each pod has isolation breakers on MV feeders, isolation valves on coolant loops, and independent fire suppression zones. A fault in Pod A cannot propagate to Pod B via shared infrastructure—the design enforces blast-radius containment.

Telemetry-driven operations

Real-time monitoring: Every rack is instrumented with power, flow, pressure, temperature, and leak-detection sensors, which feed into a DCIM (data center infrastructure management) platform. BMS receives sensor data at 1-second intervals and runs anomaly detection algorithms to identify degradations before they cause failures.

Adaptive set-points: Cooling system adjusts supply temperatures and pump speeds based on actual IT load, not fixed schedules. Free-cooling hours are maximized through weather-responsive controls that raise loop temperatures when outdoor conditions permit. Job scheduler integration allows pre-cooling before large training runs commence.

Predictive maintenance: Component replacement triggered by condition monitoring rather than time intervals. CDU filter changes when ΔP reaches thresholds, cold plate service when thermal resistance trends indicate degradation, and pump bearing service based on vibration signatures.

Operational playbooks and spares: Pre-positioned inventory of critical components (cold plates, QDs, CDU pumps, power shelves) enables rapid response. Documented playbooks for common failure modes ensure a consistent response regardless of which technician responds to 2 AM alerts.

This reference architecture represents current industry best practices for AI-ready facilities, balancing capital costs, operating efficiency, and operational flexibility. Specific implementations will vary based on local conditions, existing infrastructure, and workload requirements, but the fundamental principles remain throughput preservation through redundant, maintainable infrastructure; thermal management scaled to actual densities; and operations driven by real-time telemetry rather than reactive firefighting.

Learn more:

Artificial intelligence

GPU

Continue reading

High-performance cloud GPUs

Designing AI Factories: Power, cooling & layout

Emmanuel Ohiri & 2 others

Design principles: what “AI-ready” really means

Power architecture: From utility to GPU

Rack layout & density: airflow, cable plant, and human factors

Water systems, heat reuse & sustainability

Telemetry, controls & safeguards

Reliability & operations

Economics and TCO

Putting it together: a reference "AI-ready" hall

Continue reading

AI hardware installation & maintenance: from GPU racks to memory and storage

Key considerations for optimizing power efficiency with sustainable energy sources

Building for 70% AI-driven demand: Planning for the coming capacity surge

NVIDIA H100 versus H200: how do they compare?

Subscribe to our Newsletter