Open Source Projects
Triton-distributed
Triton-distributed is a distributed compiler based on Triton for parallel systems. It provides a set of easy-to-use primitives to support the development of distributed compute-communication overlapping kernels, enabling efficient parallel computation on modern AI systems.
The project offers both low-level and high-level primitives for programming communication kernels, allowing users to easily combine communication with computation to design overlapping kernels. Triton-distributed can achieve comparable or better performance than hand-tuned libraries.

Key Features
- Low-level primitives for distributed compute-communication overlapping kernels
- Support for single-node and cross-node operations (GEMM, MoE, Flash-Decoding)
- High performance: comparable or better than hand-tuned libraries
- Easy-to-use API for programming communication kernels
- Support for multiple backends (NVIDIA, AMD GPUs)
- Comprehensive documentation and tutorials
Other Open Source Projects
FlexTensor
FlexTensor is an automatic schedule exploration and optimization framework for tensor computation on heterogeneous systems. It can optimize tensor computation programs without human interference, allowing programmers to only work on high-level programming abstraction without considering the hardware platform details.
FlexTensor systematically explores the optimization design spaces that are composed of many different schedules for different hardware. Then, FlexTensor combines different exploration techniques, including heuristic method and machine learning method to find the optimized schedule configuration.
compiler-and-arch
compiler-and-arch is a curated list of tutorials, papers, talks, and open-source projects for emerging compiler and architecture research. This repository serves as a comprehensive resource for researchers and practitioners interested in compiler design and computer architecture.
AMOS
AMOS (Automatic Mapping Generation, Verification, and Exploration for ISA-based Spatial Accelerators) is a framework for automatic mapping generation and optimization for spatial accelerators. It provides tools for exploring and optimizing mappings for various hardware accelerators.
MatmulTutorial
MatmulTutorial is an easy-to-understand TensorOp Matmul tutorial that provides comprehensive guides for understanding matrix multiplication operations on modern accelerators. It offers detailed explanations and examples for implementing efficient matrix multiplication kernels.
