ASPLOS 2021: Workshops and Tutorials


Wednesday, April 14 Thursday, April 15 Friday, April 16
Morning ESP AIBench ML Performance: Benchmarking Deep Learning Systems NOPE LATTE WORDS SCALE-Sim High Performance Distributed Deep Learning: A Beginner’s Guide YArch Workshop on Systems and Architectures for Robust Software 2.0
Afternoon Dynamic Data-Race Prediction Securing Computer Architecture ASTRA-Sim TBD

ESP: the Open-Source Research Platform for Agile SoC Design and Programming

Time: Wednesday, April 14 | Morning (Half Day)


The ESP open-source platform supports research on the design and programming of heterogeneous SoC architectures.

By combining a scalable modular architecture with a system-level design methodology, ESP simplifies the design of individual accelerators and automates their hardware/software integration into complete SoCs. ESP integrates third-party components, including RISC-V processors and the NVDLA accelerator, offers an automated flow for embedded machine learning accelerators, and enables rapid FPGA-based prototyping of the SoCs. With ESP, researchers in architectures, compilers, and operating systems can evaluate new ideas by running complex user applications on top of the full Linux-based software stack while invoking many different accelerators.

Dynamic Data-Race Prediction : Fundamentals, Theory and Practice

Time: Wednesday, April 14 | Afternoon (Half Day)

Data races are arguably the most insidious amongst concurrency bugs and extensive research efforts have been dedicated to effectively detect them. Predictive race detection techniques aim to expose data races missed by traditional dynamic race detectors (such as those based on Happens-Before) by inferring data races in alternate executions of the underlying program, without re-executing it. The resulting promise of enhanced coverage in race detection has recently led to the development of many dynamic race prediction techniques.

This tutorial aims to present the foundations of race prediction in a principled manner, consolidate a decade long line of work on dynamic race prediction and discusses recent algorithmic advances that make race prediction efficient and practical. This tutorial also aims to discuss some recent results on the complexity and hardness involved in reasoning about race prediction.

The techniques we will present are useful beyond data race detection and are interesting for people from programming languages, architecture and the broader systems community.

The Tutorial on BenchCouncil AIBench Scenario, Training, Inference, and Micro Benchmarks across Datacenter, HPC, IoT and Edge

Time: Wednesday, April 14 | Full Day


As a joint work with seventeen industry partners, AIBench is a comprehensive AI benchmark suite, distilling real-world application scenarios into AI Scenario, Training, Inference, and Micro Benchmarks across Datacenter, HPC, IoT, and Edge. AIBench Scenario benchmarks are proxies to industry-scale real-world applications scenarios.  Each scenario benchmark models the critical paths of a real-world application scenario as a permutation of the AI and non-AI modules. Edge AIBench is an instance of the scenario benchmark suites, modeling end-to-end performance across IoT, edge, and Datacenter. AIBench Training and AIBench Inference cover nineteen representative AI tasks with state-of-the-art models to guarantee diversity and representativeness. AIBench Micro provides the intensively-used hotspot functions, profiled from the full AIBench benchmarks, for simulation-based architecture researches. As AI training is prohibitively costly, AIBench Training provides two subsets for repeatable benchmarking and workload characterization to improve affordability; they keep the benchmarks to a minimum while maintaining representativeness. Based on the AIBench Training subset for repeatable benchmarking,  we provide HPC AI500 to evaluate large-scale HPC AI systems. AIoTBench implements the AI inference benchmarks on various IoT and embedded devices, emphasizing diverse light-weight AI frameworks and models. Finally, the hands-on demos illustrate how to use AIBench on the BenchCouncil Testbed, which is publicly available.

ML Performance: Benchmarking Deep Learning Systems

Time: Wednesday, April 14 | Full Day


The current landscape of Machine Learning (ML) and Deep Learning (DL) is rife with non-uniform models, frameworks, and system stacks. It lacks standard tools and methodologies to evaluate and profile models or systems. Due to the absence of standard tools, the state of the practice for evaluating and comparing the benefits of proposed AI innovations (be it hardware or software) on end-to-end AI pipelines is both arduous and error-prone — stifling the adoption of the innovations in a rapidly moving field.


Time: Thursday, April 15 | Morning (Half Day)

NOPE is a workshop that discusses open, honest port-mortems of research projects which ran into unexpected limitations and resulted in lessons learned. In addition, it will offer a venue to discuss contributions that have been underappreciated over time. The goals of NOPE are to reflect on negative outcomes and offer a venue to uncover opportunities to move forward by reflecting on mistakes made during the research process.

Securing Processor Architectures

Time: Thursday, April 15 | Afternoon (Half Day)

This tutorial aims to teach the participants about different topics in processor architecture and security, and in particular how to secure modern processor architectures.  The tutorial will focus especially on threats due to information leakage (side and covert channels) and also transient execution attacks.  The tutorial will also touch upon design of secure processor architectures and trusted execution environments (TEEs) and how they are impacted by the information leakage and transient execution attacks.  A number of strategies for defense against the various attacks will be presented in the context of the existing, and hypothesized, threats.  The tutorial will also cover new research opportunities for furthering the security of processor architectures.

LATTE: Languages, Tools, and Techniques for Accelerator Design

Time: Thursday, April 15 | Full Day


LATTE is a venue for discussion, debate, and brainstorming at the intersection of hardware acceleration and programming languages research. The focus is on new languages and tools that aim to let domain specialists, not just hardware experts, produce efficient accelerators. A full range of targets are in scope: ASICs (silicon), FPGAs, CGRAs, or future reconfigurable hardware.

WORDS 2021

Time: Thursday, April 15 | Full Day

Two recent trends in data centers are trying to move away from computer servers: cloud serverless computing which eschews “servers” by allowing users to directly deploy fine-grained programs that are triggered by external events, and hardware resource disaggregation which breaks a server into fine-grained, network-accessed hardware resource units.  The 2nd Workshop on Resource Disaggregation and Serverless (WORDS 2021) will bring together researchers and practitioners to engage in a lively discussion on a wide range of topics in the broad definition of resource disaggregation and serverless computing.  We solicit both position papers that explore new challenges and design spaces and short papers that include completed or early-stage work. The submission deadline is Feb 22.

SCALE-Sim: Systolic CNN accelerator simulator

Time: Friday, April 16 | Morning (Half Day)

SCALE-SIM is a cycle-accurate CNN accelerator simulator, that provides timing, power/energy, memory bandwidth, and memory access trace results for a specified accelerator configuration and neural network architecture. It is based on the systolic array architecture, used in various accelerators like Google’s TPU, Xilinx XDNN, etc. It is developed jointly by ARM Research and Georgia Tech and is open-sourced ( SCALE-SIM enables research into DNN accelerator architectures and is also suitable for system-level studies. Designing an efficient DNN accelerator is a difficult problem that requires searching in an intricate trade-off space with large numbers of architectural parameters. Moreover, recent DNN workloads are increasingly becoming memory-bound due to the increase in model sizes. A simulation infrastructure like SCALE-Sim which can provide cycle-accurate estimates of performance, memory accesses, and other design metrics is, therefore, a vital tool to enable fast and reliable design cycles. Unlike related infrastructure, which relies on analytical models to estimate the performance and operating cost of accelerator designs, SCALE-Sim lets designers capture the behavior of a simulator at each cycle of operation. The tutorial targets students, faculty, and researchers who want to, (a) Get detailed knowledge on how DNN accelerators work, OR (b) Architect and instrument novel DNN accelerators, OR (c) Study performance implications of dataflow mapping strategies and system-level integration, OR (d) Plug a DNN accelerator RTL into their system. The tutorial will be interactive, where the audience will be asked to code up small methods or fill in code snippets to demonstrate the capabilities of the tools and associated APIs.

High-Performance Distributed Learning: A Beginner’s Guide

Time: Friday, April 16 | Morning (Half Day)

Recent advancements in Artificial Intelligence (AI) have been fueled by the resurgence of Deep Neural Networks (DNNs) and various Deep Learning (DL) frameworks like TensorFlow and PyTorch. In this tutorial, we will provide an overview of interesting trends in DNN design and how cutting-edge hardware architectures and high-performance interconnects are playing a key role in moving the field forward. We will also present an overview of different DNN architectures and DL frameworks that led to advancements in emerging applications areas like Image Recognition, Speech Processing, and Autonomous Vehicle systems. Most DL frameworks started with a single-node design. However, approaches to parallelize the process of DNN training are also being actively explored. The DL community has moved along with different distributed training designs that exploit communication runtimes like gRPC, MPI, and NCCL. We highlight new challenges and opportunities for communication runtimes to exploit high-performance CPU/GPU architectures and highlight some of our co-design efforts to utilize MPI for large-scale DNN training on cutting-edge CPU and GPU architectures available on modern HPC clusters. Finally, we include hands-on exercises to enable the attendees to gain first-hand experience of running distributed DNN training experiments on a modern GPU cluster.

ASTRA-sim: Enabling SW/HW Co-Design Exploration for Distributed Deep Learning Training Platforms

Time: Friday, April 16 | Afternoon (Half Day)


Modern Deep Learning systems heavily rely on distributed training over customized high-performance accelerator (e.g., TPU, GPU)-based hardware platforms. Examples today include NVIDIA’s DGX-2, Google’s Cloud TPU and Facebook’s Zion. Deep Neural Network (DNN) training involves a complex interplay between the DNN model architecture, parallelization strategy, scheduling strategy, collective communication algorithm, network topology, and the accelerator endpoint. We call this the distributed training HW-SW co-design space. Collective communications (e.g., all-reduce, all-to-all, reduce-scatter, all-gather) are initiated at different phases for different parallelism approaches – and play a crucial role in overall runtime, if not hidden efficiently behind compute. We introduce ASTRA-sim, an open-source framework that models the co-design space described above and schedules the compute-communication interactions from distributed training over compute (e.g., SCALE-sim) and plug-and-play network simulators (e.g., analytical, Garnet, NS3). ASTRA-sim supports parameterized descriptions of the DNN model, parallelism strategy, and system architecture. It enables a systematic study of bottlenecks at the software and hardware level for scaling training. It also enables end-to-end design-space exploration for running large DNN models over future training platforms. In this tutorial, we will educate the research community about the challenges in the emerging domain of distributed training, demonstrate the capabilities of ASTRA-sim with examples and discuss ongoing development efforts.

YArch 2021

Time: Friday, April 16 | Full Day

The third Young Architect Workshop (YArch ’21, pronounced “why arch”) will provide a forum for junior graduate students and research-active undergrads studying computer architecture and related fields to present early stage or on-going work and receive constructive feedback from experts in the field as well as from their peers. Students will also receive mentoring opportunities in the form of keynote talks, a panel discussion geared toward young architects, and 1-on-1 meetings with established architects. Students will receive feedback from experts both about their research topic in general and more specifically, their research directions. Students will also have an opportunity to receive valuable career advice from leaders in the field and to network with their peers and develop long-lasting, community-wide relationships.

Workshop on Systems and Architectures for Robust Software 2.0

Time: Friday, April 16 | Full Day

Unlike Software 1.0 (conventional programs) that is manually coded with hardened parameters and explicit logics, Software 2.0 programs, usually manifested as and enabled by Deep Neural Networks (DNN), have learnt parameters and implicit logics. While the systems and architecture communities have focused, rightly so, on the efficiency of DNNs, Software 2.0 exposes a unique set of challenges for robustness, safety, and resiliency. The workshop fosters an interactive discussion about computer systems and architecture research’s role of robust, safe, and resilient Software 2.0.