## Capuchin: Tensor-based GPU Memory Management for Deep Learning

Session: Tensor computation and data orchestration--Playing musical chairs!

Authors: Xuan Peng (Huazhong University of Science and Technology); Xuanhua Shi (Huazhong University of Science and Technology); Hulin Dai (Huazhong University of Science and Technology); Hai Jin (Huazhong University of Science and Technology); Weiliang Ma (Huazhong University of Science and Technology); Qian Xiong (Huazhong University of Science and Technology); Fan Yang (Microsoft Research Asia); Xuehai Qian (University of Southern California)

In recent years, deep learning has gained unprecedented success in various domains, the key of the success is the larger and deeper \emph{deep neural networks} (DNNs) that achieved very high accuracy. On the other side, since GPU global memory is a scarce resource, large models also pose a significant challenge due to memory requirement in the training process. This restriction limits the DNN architecture exploration flexibility. In this paper, we propose \textit{Capuchin}, a tensor-based GPU memory management module that reduces the memory footprint via tensor eviction/prefetching and recomputation. The key feature of \textit{Capuchin} is that it makes memory management decisions based on dynamic tensor access pattern tracked at runtime. This design is motivated by the observation that the access pattern to tensors is regular during training iterations. Based on the identified patterns, one can exploit the total memory optimization space and offer the fine-grain and flexible control of {\em when and how} to perform memory optimization techniques. We deploy \textit{Capuchin} in a widely-used deep learning framework, Tensorflow, and show that \textit{Capuchin} can reduce the memory footprint by up to 85\% among 6 state-of-the-art DNNs compared to the original Tensorflow. Especially, for the NLP task BERT, the maximum batch size that \textit{Capuchin} can outperforms Tensorflow and gradient-checkpointing by 7$\times$ and 2.1$\times$, respectively. We also show that \textit{Capuchin} outperforms vDNN and gradient-checkpointing by up to 286\% and 55\% under the same memory oversubscription.