AI workloads place high demands on network infrastructure in terms of performance , capacity and latency . They are difficult to cover with traditional data centre designs and technologies . AI data centre technologies can offer an alternative , providing efficient solutions for computing , storage and networking based on innovative fabric designs for back-end training and front-end inference .

Generative AI is on an unprecedented roll . Many organisations are now working with AI and machine learning , ML . Data centres are the foundation of AI , and data centre networks play a critical role in connecting expensive GPU servers that perform the compute-intensive processing involved in AI training .

AI training is the most technologically challenging part of the overall AI process , especially for complex deep learning models that require large amounts of data and distributed processing by GPUs to achieve optimal performance . For example , training a state-of-theart image recognition model can require millions of labelled images .

If the network is a bottleneck , costly processing time is wasted . To speed up training , the GPUs need to be interconnected in a high-performance structure . This dedicated structure is known as the back-end fabric , which supports both GPU training clusters and storage networks , and provides high-performance , low-latency net-working for each service .

Once the model has been trained , it is transferred to the AI inference phase , where it works in a real-world environment to make predictions or decisions based on new , unknown data . The AI inference clusters are connected to front-end networks that provide connectivity to the outside world , for example to handle inference requests from users or IoT devices .

Selection of network architecture

As organisations embrace the AI approach , the first question they should be asking is how to build such a data centre network for AI and ML workloads in a high-performance , cost-effective manner . GPUs and InfiniBand need to be thought of first as cost drivers and limiting factors .

Modern AI and ML clusters consist of hundreds , sometimes thousands , of GPUs . They are needed to provide the massive parallel computing power required to train modern AI models .

TRAINING A STATE-OF- THE-ART IMAGE RECOGNITION MODEL CAN REQUIRE MILLIONS OF LABELLED IMAGES .

GPUs need to work in clusters to be efficient . While scaling clusters improves the efficiency of the AI model , it also increases costs . Reducing job completion time and minimising or eliminating tail latency are key to reducing costs and increasing speed . Job completion time refers to the time it takes to train the AI model , and tail latency refers to the time it takes for system to wait for the last GPU to complete its calculations before the next training run begins .

With the need to optimise GPU performance , Ethernet in particular is becoming an increasingly important open networking alternative for AI data centres . In the past , InfiniBand , a proprietary high-speed , low-latency networking technology , was often the first choice for fast and efficient communication between servers and storage systems .

However , Ethernet is increasingly being used because of its operational and cost benefits . Unlike a proprietary InfiniBand network , there are many network professionals who can set up and operate an Ethernet network .

Ethernet is therefore an ideal solution to meet the specific requirements of AI applications , especially with its high throughput and low latency . Network technology is constantly evolving , with recent innovations such as 800 GbE and Data Centre Bridging , increasing speed , reliability and scalability . Improvements also include congestion management , load balancing , minimised latency for job completion time optimisation and simplified management and automation . This makes Ethernet fabrics ideal architectures for mission-critical AI traffic .

Building the network fabric

Different fabric designs can be used to network AI data centres . An any-to-any non-blocking fabric is recommended to optimise the training framework . It

Refat Al Karmi , Senior Consultant META , Juniper Networks

www . intelligentcio . com INTELLIGENTCIO MIDDLE EAST 77

Intelligent CIO Middle East Issue 106 | Page 77

TRAINING A STATE-OF- THE-ART IMAGE RECOGNITION MODEL CAN REQUIRE MILLIONS OF LABELLED IMAGES .

TRAINING A STATE-OF- THE-ART IMAGE RECOGNITION MODEL CAN REQUIRE MILLIONS OF LABELLED IMAGES .