Compute-Communication Overlapping @CNCC 2024

Date:

Abstract

With the widespread application of large language models, the optimization of training and inference for these models has become increasingly important. Large language models have a large number of parameters and require significant computational resources, often making it difficult to complete computations on a single chip. As a result, distributed training and inference have become mainstream. However, the additional communication overhead introduced in distributed scenarios leads to a decrease in chip computational efficiency, resulting in resource waste. The concealment of computation and communication latency to improve computational efficiency has become an important issue of concern in both academia and industry. The integration of computation and communication technology, through fine-grained scheduling techniques, unifies computation and communication into blocks, hiding communication latency within different blocks of computation and thereby enhancing overall computational efficiency. However, the integration of computation and communication requires manual redesign and implementation of operator libraries, leading to low development efficiency and difficulty in meeting the rapid evolution needs of models. To address this, this report will introduce the use of compilation technology to automate the integration of computation and communication and generate code. It will report on the latest industrial achievements from the perspectives of compiler design and code optimization, and predict future development directions for computation and communication integration compilers.