AI大模型应用与基础系统创新 
报告题目:Effectively Scheduling Computational Graphs of Deep Neural Networks toward Their Domain-Specific Accelerators
演讲摘要:Fully exploiting the computing power of an accelerator specialized for deep neural networks (DNNs) calls for the synergy between network and hardware architectures, but existing approaches partition a computational graph of DNN into multiple sub-graphs by abstracting away hardware architecture and assign resources to each sub-graph, not only producing redundant off-core data movements but also under-utilizing the hardware resources of a domain-specific architecture (DSA). This paper introduces a systematic approach for effectively scheduling DNN computational graphs on DSA platforms. By fully taking into account hardware architecture when partitioning a computational graph into coarse-grained sub-graphs, our work enables the synergy between network and hardware architectures, addressing several challenges of prior work: (1) it produces larger but fewer kernels, converting a large number of off-core data movements into on-core data exchanges; (2) it exploits the imbalanced memory usage distribution across DNN network architecture, better saturating the DSA memory hierarchy; (3) it enables across-layer instruction scheduling not studied before, further exploiting the parallelism across different specialized compute units. Results of seven DNN inference models on a DSA platform show that our work outperforms TVM and AStitch by 11.15x and 6.16x, respectively, and obtains throughput competitive to the vendor-crafted implementation. A case study on GPU also demonstrates that generating kernels for our sub-graphs can surpass CUTLASS with and without convolution fusion by 1.06x and 1.23x, respectively.
讲者简介:赵捷,本科毕业于清华大学计算机科学与技术系,并于2019年在法国巴黎高等师范学习和INRIA共同领导下的PARKAS实验室获得博士学位。曾就职于数学工程与先进计算国家重点实验室,主要研究方向包括张量编译器、基于多面体模型的代码生成与优化以及基础数学函数库等,赵捷博士以第一作者身份在系统软件、体系结构和编译器领域的顶级会议和期刊上发表了多篇文章,包括PLDI、OSDI、MICRO、MLSys、PACT、CC等,2020年发表在MICRO-53会议上的论文获最佳论文提名。