Recommended citation: Size Zheng, Yun Liang, Shuo Wang, Renze Chen, Kaiwen Sheng: FlexTensor: An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System. ASPLOS 2020: 859-873 https://dl.acm.org/doi/10.1145/3373376.3378508

talks

FleXTensor @ASPLOS 2020 (online)

Published: March 16, 2020

Abstract

Tensor computation plays a paramount role in a broad range of domains, including machine learning, data analytics, and scientific computing. The wide adoption of tensor computation and its huge computation cost has led to high demand for flexible, portable, and high-performance library implementation on heterogeneous hardware accelerators such as GPUs and FPGAs. However, the current tensor library implementation mainly requires programmers to manually design low-level implementation and optimize from the algorithm, architecture, and compilation perspectives. Such a manual development process often takes months or even years, which falls far behind the rapid evolution of the application algorithms. In this paper, we introduce FlexTensor, which is a schedule exploration and optimization framework for tensor computation on heterogeneous systems. FlexTensor can optimize tensor computation programs without human interference, allowing programmers to only work on high-level programming abstraction without considering the hardware platform details. FlexTensor systematically explores the optimization design spaces that are composed of many different schedules for different hardware. Then, FlexTensor combines different exploration techniques, including heuristic method and machine learning method to find the optimized schedule configuration. Finally, based on the results of exploration, customized schedules are automatically generated for different hardware. In the experiments, we test 12 different kinds of tensor computations with totally hundreds of test cases and FlexTensor achieves average 1.83x performance speedup on NVIDIA V100 GPU compared to cuDNN; 1.72x performance speedup on Intel Xeon CPU compared to MKL-DNN for 2D convolution; 1.5x performance speedup on Xilinx VU9P FPGA compared to OpenCL baselines; 2.21x speedup on NVIDIA V100 GPU compared to the state-of-the-art.

AHS: An Agile Framework for Hardware Specialization and Software Mapping

Published: October 18, 2021

Overview

As Moore’s law is approaching to the end, designing specialized hardware along with the software that map the applications onto the specialized hardware is a promising solution. The hardware design determines the peak performance, while the software is also important as it determines the actual performance. Hardware/software (HW/SW) co-design can optimize the hardware acceleration and software mapping in concert and improve overall performance. However, the current flow designs hardware and software in isolation. More importantly, both hardware and software are difficult to design and optimize due to the low level programming and huge design space.

AMOS @ISCA 2022 (online)

Published: June 19, 2022

Abstract

Hardware specialization is a promising trend to sustain performance growth. Spatial hardware accelerators that employ specialized and hierarchical computation and memory resources have recently shown high performance gains for tensor applications such as deep learning, scientific computing, and data mining. To harness the power of these hardware accelerators, programmers have to use specialized instructions with certain hardware constraints. However, these hardware accelerators and instructions are quite new and there is a lack of understanding of the hardware abstraction, performance optimization space, and automatic methodologies to explore the space. Existing compilers use hand-tuned computation implementations and optimization templates, resulting in sub-optimal performance and heavy development costs.

In this paper, we propose AMOS, which is an automatic compilation framework for spatial hardware accelerators. Central to this framework is the hardware abstraction that not only clearly specifies the behavior of spatial hardware instructions, but also formally defines the mapping problem from software to hardware. Based on the abstraction, we develop algorithms and performance models to explore various mappings automatically. Finally, we build a compilation framework that uses the hardware abstraction as compiler intermediate representation (IR), explores both compute mappings and memory mappings, and generates high-performance code for different hardware backends. Our experiments show that AMOS achieves more than $2.50\times$ speedup to hand-optimized libraries on Tensor Core, $1.37\times$ speedup to TVM on vector units of Intel CPU for AVX-512, and up to $25.04\times$ speedup to AutoTVM on dot units of Mali GPU. The source code of AMOS is publicly available.

Compute-Communication Overlapping @CNCC 2024

Published: October 25, 2024

Abstract

With the widespread application of large language models, the optimization of training and inference for these models has become increasingly important. Large language models have a large number of parameters and require significant computational resources, often making it difficult to complete computations on a single chip. As a result, distributed training and inference have become mainstream. However, the additional communication overhead introduced in distributed scenarios leads to a decrease in chip computational efficiency, resulting in resource waste. The concealment of computation and communication latency to improve computational efficiency has become an important issue of concern in both academia and industry. The integration of computation and communication technology, through fine-grained scheduling techniques, unifies computation and communication into blocks, hiding communication latency within different blocks of computation and thereby enhancing overall computational efficiency. However, the integration of computation and communication requires manual redesign and implementation of operator libraries, leading to low development efficiency and difficulty in meeting the rapid evolution needs of models. To address this, this report will introduce the use of compilation technology to automate the integration of computation and communication and generate code. It will report on the latest industrial achievements from the perspectives of compiler design and code optimization, and predict future development directions for computation and communication integration compilers.

Design Space Exploration and Analysis for AI Compilers @Nvidia Redmond

Published: December 20, 2024

Abstract

Abstract: Dr. Size Zheng received his Ph.D. in Computer Science from Peking University and is currently a Research Scientist at ByteDance. His primary research areas include compilers, high-performance operators, and hardware-software co-optimization. He has published 20 papers in top conferences such as ASPLOS, ISCA, MICRO, and HPCA.

从计算-访存-通信优化看AI编译器设计 (in Chinese)

Published: December 27, 2024

Abstract

面向AI芯片的编译优化需面临三个角度的优化挑战：计算、访存、通信。本报告将分享三个方面的AI编译器设计与优化经验。计算角度，分享针对定制化加速单元的优化；访存角度，分享针对AI芯片的访存建模和优化；通信角度，分享计算通信融合的经验和进展。

teaching

TA of Compiler Design

Undergraduate course, Peking University, Computer Science, 2019

Project

Our course project is available here.

TA of Compiler Design

Undergraduate course, Peking University, Computer Science, 2020

Project

Our course project is available here.

TA of Compiler Design

Undergraduate course, Peking University, Computer Science, 2021

Si-Ze Zheng

Sitemap

Pages

Posts

portfolio

publications

talks

Abstract

Overview

Abstract

Abstract

Abstract

Abstract

teaching

Project

Project