1. FlowTracer: A Tool for Uncovering Network Path Usage Imbalance in AI Training Clusters
- Author
-
Jamil, Hasibul, Alim, Abdul, Schares, Laurent, Maniotis, Pavlos, Schour, Liran, Sydney, Ali, Kayi, Abdullah, Kosar, Tevfik, and Karacali, Bengi
- Subjects
Computer Science - Networking and Internet Architecture ,Computer Science - Distributed, Parallel, and Cluster Computing - Abstract
The increasing complexity of AI workloads, especially distributed Large Language Model (LLM) training, places significant strain on the networking infrastructure of parallel data centers and supercomputing systems. While Equal-Cost Multi- Path (ECMP) routing distributes traffic over parallel paths, hash collisions often lead to imbalanced network resource utilization and performance bottlenecks. This paper presents FlowTracer, a tool designed to analyze network path utilization and evaluate different routing strategies. FlowTracer aids in debugging network inefficiencies by providing detailed visibility into traffic distribution and helping to identify the root causes of performance degradation, such as issues caused by hash collisions. By offering flow-level insights, FlowTracer enables system operators to optimize routing, reduce congestion, and improve the performance of distributed AI workloads. We use a RoCEv2-enabled cluster with a leaf-spine network and 16 400-Gbps nodes to demonstrate how FlowTracer can be used to compare the flow imbalances of ECMP routing against a statically configured network. The example showcases a 30% reduction in imbalance, as measured by a new metric we introduce., Comment: Submitted for peer reviewing in IEEE ICC 2025
- Published
- 2024