Lower Multi‑GPU Communication Overhead via mKernel Fused Kernels

GPU communication overhead is a major bottleneck in modern AI workloads, with studies showing that up to 43.6% of the forward pass and 32% of end‑to‑end training time can be spent moving data between devices. In Mixture‑of‑Experts models this figure rises to nearly 47% of total execution time. The root cause lies in the traditional host‑driven communication model where the CPU launches NCCL or NVSHMEM calls, creates separate streams, and relies on coarse‑grained kernel boundaries to overlap compute and communication. As GPU compute scales far beyond CPU capability, microsecond‑scale host orchestration—such as cudaLaunchKernel calls, CPU‑side synchronization checks, and inter‑stream events—creates pipeline bubbles that waste precious FLOPs. Moreover, host‑driven systems cannot achieve fine‑grained overlap at the tile or chunk level, limiting the ability to hide latency inside a single kernel.

mKernel addresses these issues by moving communication onto the GPU itself. It provides a library of persistent CUDA kernels that fuse intra‑node NVLink, inter‑node RDMA, and dense compute into a single launch. Each kernel assigns thread blocks to specialized roles—compute, intra‑node communication, inter‑node send, inter‑node reduce—allowing SM resources to be tuned per workload. Because the communication backend is built directly on libibverbs, mKernel initiates GPU‑driven RDMA writes without depending on NCCL or NVSHMEM, eliminating host‑side overhead and enabling overlap at the tile/chunk granularity across both NVLink and RDMA fabrics.

The library currently offers five fused kernels: AllGather+GEMM, GEMM+AllReduce, MoE Dispatch+GEMM, Ring Attention, and GEMM+ReduceScatter. Evaluation on two‑node × 8‑H200 clusters shows measurable speedups over baseline collectives, with further scaling work underway. Requirements are NVIDIA Hopper GPUs (sm_90a target), CUDA 12.9, PyTorch, and either ConnectX‑7 InfiniBand/RoCE or AWS EFA networking. Blackwell GPU support and broader heterogeneous NIC integration are on the roadmap.

By collapsing communication and computation into a single persistent kernel, mKernel reduces pipeline bubbles, improves overlap granularity, and delivers a practical path toward higher utilization of massive GPU clusters. #AI #ML #GPUComputing #DeepLearning #HPC #Productivity