Intel Gaudi 3 AI Accelerator: What You Need To Know

Intel Gaudi 3 chip leads AI acceleration with 1835 TFLOPS, doubling its predecessor’s power. 50% faster training than Nvidia, with superior inference and power efficiency. Scale AI models easily with partners like SAP, RedHat, and VMware. Flexible, scalable, and power-efficient for enterprise AI initiatives.

Intel Gaudi 3 AI Accelerator

Intel is hoping to seize an edge in AI accelerator market with its Gaudi 3 chip, boasting 1835 TFLOPS of FP8 compute throughput which is double that offered by its predecessor generation.

Furthermore, this chip trains popular models 50% faster than Nvidia’s H100 on average and boasts superior inference performance and power efficiency against leading competitors.

Accelerate AI Models

Intel’s Gaudi 3 AI accelerator fulfills its vision for enterprise generative AI (GenAI) by supporting inference applications at competitive price points, allowing enterprises to accelerate GenAI initiatives from pilot stage to production with ease. Partner companies including SAP, RedHat and VMware collaborate closely on developing open scalable systems capable of supporting all enterprise AI application stacks.

Gaudi 3 doubles processing power from its predecessor using the latest 5nm process by featuring more Matrix Math Engines and Programmable Tensor Cores, offering 1835 TFLOPs of precision throughput with 16-bit support (FP8 precision throughput of 1835 TFLOPs and improved 16-bit performance), 128GB memory capacity, 3.7TB of memory bandwidth and 96MB onboard SRAM memory for large dataset processing. Furthermore, 24 200Gb Ethernet ports facilitate flexible system scalability as well as open standard networking.

Intel recently unveiled their Gaudi 3 AI accelerator at its Vision 2024 customer and partner conference, where it boasted that it was 50% faster in inference than Nvidia’s GPU-only H100 AI accelerator while also being 40% more cost efficient at a fraction of the cost. Gaudi 3 will be offered to OEMs such as Hewlett Packard Enterprise, Lenovo and Supermicro with air and liquid-cooled versions available to purchase; next year sees Intel making plans to switch over to HBM3E technology by TSMC manufacturing technology for improved production processes.

Scale Up: Intel Gaudi 3 AI Accelerator

Intel has made significant strides since their second-generation Gaudi 2 chip debuted in 2022 by increasing networking bandwidth by double and adding 1.5X more high-bandwidth memory, as well as revamping accelerator cores to increase both efficiency and performance.

Gaudi 3 can train up to 1.7X faster than Nvidia’s market-leading H100 while also offering 50% better inference and 40% greater power efficiency across different parameterized models. Furthermore, its performance can be scaled flexibly for clusters, super-clusters or mega-clusters for inference, fine-tuning or training at any scale imaginable.

Intel’s strategy for AI data center accelerators is to compete directly with Nvidia for supply. Gaudi 3 was specifically created to be assembled into massive clusters for training and running large language models (LLMs). Intel anticipates an estimated increase of 1.7X over its predecessor generation in training performance.

Intel announced its plans to provide Gaudi 3 to original equipment manufacturers (OEMs) in various forms and configurations, such as an industry-standard OAM accelerator module that will plug directly into PCIe slots – plans of Dell Technologies, Hewlett Packard Enterprise and Lenovo among many others are in motion to use them – and also in the form of a 10.5-inch dual slot PCIe card capable of supporting up to two 400Gb Ethernet ports and linking multiple Gaudi cards together for scale out purposes.

Scale Out: Intel Gaudi 3 AI Accelerator

Gaudi 3 can support thousands of accelerators connected via industry-standard Ethernet using Intel’s RDMA over Converged Ethernet (RoCEv2) technology for large compute clusters – this will allow enterprises to tailor compute clusters to meet specific business requirements without incurring vendor lock-in from proprietary networking fabrics.

Intel asserts that their Gaudi accelerators can reach three times the performance and rack density of second-generation Xeon processors, thanks to open hardware software suite, HCCL emulation layer, and accelerated libraries that enable customers to implement frameworks of their choice and run GenAI models on the platform.

Company offers its Gaudi 3 in PCIe form factor as well, with a variant that fits standard 10.5-inch full height dual slot cards – known as HL-338 – designed for standard full height dual slot cards. It contains the same hardware found in OAM versions including 1835TFLOPS peak performance and 128GB HBM2e memory; but has a lower TDP of 600 Watts making it more power-efficient under sustained workload conditions.

Intel’s Gaudi 3 represents an impressive leap over its predecessor in terms of computing and memory bandwidth performance. Constructed on an advanced process node than previous generations, its transistor efficiency has increased considerably while network bandwidth increased twofold and 1.5x more HBM was added allowing it to handle larger language models and multimodal models more effectively than before.

Power Efficiency: Intel Gaudi 3 AI Accelerator

Gaudi 3 was designed to deliver performance while remaining space efficient, making it suitable for use across datacenter servers of various shapes and sizes. Intel claims it delivers up to four times the FP8 performance and 1.5X the BF16 performance of its predecessor while offering increased memory capacity and bandwidth – thanks to advanced manufacturing processes that enable Intel to pack more transistors and chiplets onto its die or tile, which holds accelerator logic.

Intel claims this allows it to scale from single systems with eight Gaudi accelerators (OAM Gaudis) up to 1024 node clusters of 8192 OAMs for training and inference of leading GenAI models. Furthermore, Gaudi 3’s 200Gb Ethernet networking eliminates vendor lock-in to proprietary interconnect fabrics while providing efficient scaling within large compute clusters, meeting memory and compute requirements of GenAI models of all sizes.

Gaudi 3 stands in contrast to Nvidia’s B200 NVLink systems, which rely on dedicated interconnects between GPUs and servers using passive copper cables, by employing fast Ethernet networks to link multiple OAM accelerators together in a cluster. Intel believes this approach will reduce total cost of ownership compared to Nvidia solutions that require dedicated network interface controllers both within each GPU as well as for stretching between servers to form clusters.

FAQs – Intel Gaudi 3 AI Accelerator

1. What is the Intel Gaudi 3 AI Accelerator?

The Intel Gaudi 3 AI Accelerator is a cutting-edge chip designed to accelerate artificial intelligence (AI) workloads in data centers. It boasts significant improvements over its predecessor, offering superior compute throughput, training speed, inference performance, and power efficiency.

2. What are the key features of the Intel Gaudi 3 AI Accelerator?

The Gaudi 3 chip features 1835 TFLOPS of FP8 compute throughput, doubling the processing power of its predecessor. It supports 16-bit precision, has a memory capacity of 128GB, and offers 3.7TB of memory bandwidth. Additionally, it includes 96MB of onboard SRAM memory and 24 200Gb Ethernet ports for enhanced scalability and networking capabilities.

3. How does the Intel Gaudi 3 compare to competitors like Nvidia’s H100?

Intel claims that the Gaudi 3 chip trains popular models 50% faster than Nvidia’s H100 on average, while also providing superior inference performance and power efficiency. This makes it a compelling choice for enterprises looking to accelerate AI initiatives.

4. What is Intel’s vision for enterprise generative AI (GenAI) with the Gaudi 3 AI Accelerator?

Intel aims to support inference applications at competitive price points, enabling enterprises to accelerate GenAI initiatives from pilot stage to production seamlessly. Partner companies such as SAP, RedHat, and VMware collaborate on developing open scalable systems capable of supporting all enterprise AI application stacks.

5. How does the Gaudi 3 AI Accelerator facilitate scalability?

The Gaudi 3 chip can be scaled flexibly for clusters, super-clusters, or mega-clusters for inference, fine-tuning, or training at any scale imaginable. It supports thousands of accelerators connected via industry-standard Ethernet, allowing enterprises to tailor compute clusters to meet specific business requirements without vendor lock-in.

6. What form factors are available for the Intel Gaudi 3 AI Accelerator?

Intel offers the Gaudi 3 chip in various forms and configurations, including industry-standard OAM accelerator modules that plug directly into PCIe slots and 10.5-inch dual slot PCIe cards. These options cater to different server configurations and use cases, providing flexibility for customers.

7. How does the Intel Gaudi 3 chip improve power efficiency?

Despite its impressive performance capabilities, the Gaudi 3 chip remains space-efficient and power-efficient, making it suitable for use across data center servers of various shapes and sizes. It delivers up to four times the FP8 performance and 1.5 times the BF16 performance of its predecessor while offering increased memory capacity and bandwidth.

8. What advantages does the Intel Gaudi 3 chip offer over competitors like Nvidia?

Intel’s Gaudi 3 chip stands out from competitors by leveraging fast Ethernet networks to link multiple OAM accelerators together in a cluster, eliminating the need for dedicated interconnects and reducing total cost of ownership. Additionally, it offers superior performance, scalability, and power efficiency compared to competing solutions.