Automating Tensor Program Partitioning on Accelerator Systems with PartIR
- 16:00 3rd December 2021 ( week 8, Michaelmas Term 2021 )Zoom
The rapid rise in demand for training large neural networks has brought into focus the need for partitioning across systems of accelerator devices. Implementing various forms of partitioning is increasingly supported through program primitives, but identifying efficient partitioning strategies requires expensive experimentation and expertise. We present the prototype of an automated partitioning system that integrates into existing compilers and existing user workflows. Our system relies on layering functional loop abstractions – that return or reduce over chunks of arrays – on top of an arbitrary array “dialect” (following the MLIR terminology) such as XLA. We use rewrite rules reminiscent of fusion rules from stream fusion to express various forms of propagation of partitioning information across a program. Our system compiles functional loops to SPMD abstractions in a lower-level dialect whose types capture distributed arrays and which includes explicit array redistribution commands. This dialect can then be lowered, compiled, and executed using the “native” backend compiler and runtime (e.g. XLA) in a device-agnostic manner. We will present the design of a search environment controlling the actions of our rewrite engine that is specifically aiming to tame the size of search space by (a) mimicking the way expert programmers would attempt to partition their programs and (b) exploiting high-level model structure already available in popular libraries for neural networks. We show promising initial results, such as the ability to automatically recover good partitioning for important neural network architectures; and we outline remaining challenges.