Workshops and Tutorials



Sunday, March 26th

Full day



Tutorial on FireSim, Chipyard, and Hammer: End-to-end Architecture Research with RISC-V SoC Generators, Agile Test Chips, and FPGA-Accelerated Simulation

Saturday, March 25th – Full day

This tutorial gives a hands-on introduction and walk-through of FireSim (, Chipyard (, and Hammer (, which together enable end-to-end architecture/systems research with RISC-V SoC generators, agile test chips, and FPGA-accelerated simulation. We will be providing access to AWS EC2 F1 instances to attendees free-of-charge to interactively follow the tutorial. Attendees will be able to customize an industry and silicon-proven RISC-V microprocessor design, run their own high-performance FPGA-accelerated simulations of their design in the cloud, and learn how to push their design to silicon, guided by the FireSim, Chipyard, and Hammer developers.

Immersive Visual Computing From Sensing and Computing to Humans: A Primer for Computer Architects (Tutorial)

Saturday, March 25th – Morning

Emerging applications on the horizon such as autonomous machines, Augmented/Virtual Reality, and smarty city are all fundamentally visual computing applications. In real time and with low power consumption, these applications have to simultaneously acquire, analyze, and sometimes generate massive visual data, challenging computing systems and architecture. This tutorial will introduce the fundamental working principles of visual computing, spanning optics, image sensing, computing, and human perception.

XiangShan: An Open Source High Performance RISC-V Processor and Infrastructure for Architecture Research (Tutorial)

Saturday, March 25th – Morning

Over the past decade, agile and open-source hardware has gained increasing attentions in both academia and industry. We believe that open-source hardware design, and more importantly, free and open development infrastructure, has the opportunity to bring more convenience to architecture research and stimulate innovations.

In this tutorial, we will present our efforts on XiangShan project. XiangShan is an open-source, industry-competitive, high performance RISC-V processor. It has raised the performance ceiling of publicly accessible processors and set the competitive groundwork for future computer architecture research. Behind the processor itself, there is also an agile development platform called Minjie that integrates a broad set of development tools as infrastructure. We will demonstrate how XiangShan, together with Minjie, helps researchers realize their innovative ideas agilely and obtain convincing evaluation results.

Creating a Compelling and Sustainable Tutorial

Saturday, March 25th – Afternoon

Perhaps you’ve got a great new research project that you’d like to share with the world. Maybe you built a new open-source simulator or hardware design that you want to encourage others to adopt. A natural next step is to consider creating a tutorial to advertise the work and generate a user base. That sounds like a lot of work! And would you really give it more than once?

This tutorial is intended to help lower the barrier to entry of creating an academic tutorial in architecture and related fields. Attendees will develop their goals, learn about best practices, and start to think about the nuts and bolts of running a tutorial. We will focus on enabling tutorials which are repeatable to amortise the startup effort and attract interest over longer timescales.

Predicting and Optimizing Runtime Performance of Deep Learning Models

Saturday, March 25th – Afternoon

In this tutorial, we will introduce techniques to easily find the underutilization and performance bottlenecks of GPUs for deep-learning (DL) workloads. After that, we will do a brief introduction to CUDA programming as an example of a current way of addressing typical performance bottlenecks and underutilization in DL workloads. And we will wrap it up by introducing a new DL compiler Hidet (ASPLOS2023 paper), that allows rapid development of performant tensor programs in a higher-level language such as Python. Hidet documentation can be found at

Young Architect Workshop (YArch)

Sunday, March 26th – Full day

The Young Architect Workshop (YArch, pronounced “why arch”) is a workshop for junior graduate students and research-active undergraduate students studying computer architecture and related fields.

The central theme of the YArch workshop is to serve as a welcoming venue for early-stage graduate students (or undergrads interested in research) to present their ongoing work and receive feedback from experts within the community. In addition, this workshop aims to help students in building connections both with their peers and established architects in the community.

FireSim and Chipyard User/Developer Workshop

Sunday, March 26th – Full day

The FireSim and Chipyard user and developer community has experienced rapid growth, with significant cross-institution user and developer collaborations. This workshop aims to bring together these communities to help drive the future direction of this ecosystem and spawn new collaborations.

This workshop will feature talks on computer architecture, systems, programming language, and VLSI research/development from academic and industrial users of FireSim and Chipyard. We hope that the presentations in this workshop will inspire lively discussion of FireSim/Chipyard governance, feature roadmaps, outreach activities, host platform specifications, and more.

Workshop on Languages, Tools, and Techniques for Accelerator Design (LATTE)

Sunday, March 26th – Full day

LATTE is a venue for discussion, debate, and brainstorming at the intersection of hardware acceleration and programming languages research. The focus is on new languages and tools that aim to let domain specialists, not just hardware experts, produce efficient accelerators. A full range of targets are in scope: ASICs (silicon), FPGAs, CGRAs, or future reconfigurable hardware.

Real-world Processing-in-Memory Systems for Modern Workloads

Sunday, March 26th – Full day

Processing-in-Memory (PIM) is a computing paradigm that aims at overcoming the data movement bottleneck (i.e., the waste of execution cycles and energy resulting from the back-and-forth data movement between memory units and compute units) by making memory compute-capable. Explored over several decades since the 1960s, PIM systems are becoming a reality with the advent of the first commercial products and prototypes. 

This tutorial focuses on the latest advances in PIM technology, workload characterization for PIM, and programming and optimizing PIM kernels. We will (1) provide an introduction to PIM and taxonomy of PIM systems, (2) give an overview and a rigorous analysis of existing real-world PIM hardware, (3) conduct hand-on labs about important workloads (machine learning, sparse linear algebra, bioinformatics, etc.) using real PIM systems, and (4) shed light on how to improve future PIM systems for such workloads.

Benchmarking scale-out server workloads with CloudSuite 4.0

Sunday, March 26th – Morning

Global IT service providers rely on networks of datacenters to deliver an ever-increasing number of offerings such as online storage, search, social connectivity, e-commerce, and media streaming. However, researchers often use desktop benchmark suites (e.g., SPEC), HPC benchmark suites (e.g., PARSEC and Splash), or self-written workloads to evaluate their work. The lack of representative scale-out server workloads diminishes the impact of the results and conclusions drawn by evaluating the workloads from other application domains. 

In this tutorial, we will introduce CloudSuite 4.0, a benchmark suite collecting various typical scale-out server workloads. We will guide the audience on launching the workloads, tuning them into a representative state, collecting various performance metrics, and extracting the instruction/data traces using both Intel Processor Trace and QEMU for evaluation.

Design of Heterogeneous SoCs for ASIC and FPGA Targets with ESP

Sunday, March 26th – Morning

The system-on-chip (SoC) lies at the core of modern computing systems for a variety of domains, from embedded systems to data centers. In any given domain, the success of a particular SoC architecture is bound to the set of special-purpose hardware accelerators that it features next to general-purpose processors. This heterogeneity of SoC components brings new challenges to hardware designers as well as software programmers.

This proposed tutorial illustrates ESP, an open-source platform that supports research on the design and programming of heterogeneous SoC architectures. By combining a scalable modular architecture with a system-level design methodology, ESP simplifies the design of individual accelerators and automates their hardware/software integration into complete SoCs. ESP provides flows for realizing these SoCs both as FPGA prototypes and, more recently, real ASIC implementations. This tutorial will focus on recent developments to ESP that enhance its architecture and methodology. First, we will showcase how to design a new accelerator in SystemC, leveraging the commercial Catapult HLS tool with the open-source MatchLib library. Next, we will demonstrate new dynamic partial reconfiguration capabilities of FPGA-based ESP implementations that allow for swapping accelerator implementations at runtime. Finally, we will describe the design flow that enabled the first ASIC implementations of ESP and demonstrate how to integrate a technology available in an open-source process design kit into ESP-based chip designs. 

For more information, please see:
– the ESP release on GitHub:
– the ESP documentation:
– the ESP publications:
– the ESP tutorials:

Enabling Detailed Cycle-Level Simulation of AI and HPC Applications with Detailed Memory Hierarchy using STONNE, OMEGA and SST-STONNE

Sunday, March 26th – Afternoon

The design of specialized architectures for accelerating the Deep Learning (DL) inference is a booming area of research nowadays. While first-generation accelerators used simple systolic structures with fixed dataflows tailored for dense Deep Neural Networks (DNNs) applications, more recent architectures from startups such as SambaNova and Cerebras, and academia (such as MAERI and SIGMA) have argued for flexibility to efficiently support a wide variety of layer types, dimensions, and sparsity, by enabling dataflow execution. In addition, the recent appearance of Graph Neural Networks (GNNs) applications has resulted into multi-phase accelerators that combine the execution of multiple kernels in a pipelined manner, making the architectures much more complex. Moreover, the complexity is futher exacerbated by increasingly heterogeneous hardware. As the complexity of these accelerators grows, the analytical models currently being used for design-space exploration are unable to capture execution-time subtleties, leading to inexact results in many cases. This opens up a need for cycle-level simulation tools to allow for fast and accurate design-space exploration of DL accelerators, and rapid quantification of the efficacy of architectural enhancements during the early stages of a design. To this end, (STONNE Simulation TOol of Neural Network Engines) is a cycle-level microarchitectural simulation framework that can plug into any high-level DL framework as an accelerator device and perform full-model evaluation of state-of-the-art systolic and flexible DNN accelerators, both with and without sparsity support. STONNE is developed by the University of Murcia and the Georgia Institute of Technology and is open-sourced under the terms of the MIT license ( In this tutorial we demonstrate how STONNE enables research on DNNs accelerators by means of several use cases that range from the microarchitectural networks on-chip present in DNN accelerators to the scheduling strategies that can be utilized to improve energy efficiency in sparse accelerators. Further, we also present various enhancements to the core STONNE simulator – (1) OMEGA: A simulator for Graph Neural Network Dataflows (2) SST-STONNE: Integration of STONNE as an element of Structural Simulation Toolkit (SST) (3) Emerging sparse dataflows: Simulation of Outer-product and Gustavson’s dataflows.

ASTRA-sim: Enabling SW/HW Co-Design Exploration for Distributed Deep Learning Training Platforms

Sunday, March 26th – Afternoon

Modern deep learning systems heavily rely on distributed training over customized high-performance accelerator (e.g., TPU, GPU)-based hardware platforms connected via high-performance interconnects (e.g., NVLink, XeLink). Examples today include NVIDIA’s HGX H100, Google’s Cloud TPU, Facebook’s Zion, and Intel’s Gaudi HLS-1. Deep Neural Network (DNN) training involves a complex interplay between the DNN model architecture, parallelization strategy, scheduling strategy, collective communication algorithm, network topology, and the accelerator endpoint. Collective communications (e.g., All-Reduce, All-to-All, Reduce-Scatter, All-Gather) are initiated at different phases for different parallelism approaches and play a crucial role in overall runtime, if not hidden efficiently behind compute. This problem becomes paramount as recent models for NLP such as GPT-3 or MSFT-1T and Recommendations such as DLRM have billions to trillions of model parameters and need to be scaled across tens to hundreds to thousands of accelerator nodes. As innovation in AI/ML models continues to grow at an accelerated rate, there is a need for a comprehensive methodology to understand and navigate this complex design space to (i) architect future platforms and (ii) develop novel parallelism schemes to support efficient training of future DNN models.

As an ongoing collaboration between Intel, Meta, and Georgia Tech, we have been jointly developing a detailed cycle-accurate distributed training simulator called ASTRA-sim. ASTRA-sim models the co-design space described above and schedules the compute-communication interactions from distributed training over plug-and-play compute and network simulators. It enables a systematic study of bottlenecks at the software and hardware level for scaling training. It also enables end-to-end design-space exploration for running large DNN models over future training platforms. Currently, ASTRA-sim supports two compute models (roofline and SCALE-sim, a Google TPU-like simulator) and several network models (analytical network, Garnet from gem5, and NS3) to go from a simple analytical to the detailed cycle-accurate simulation of large-scale training platforms. In this tutorial, we will educate the research community about the challenges in the emerging domain of distributed training, demonstrate the capabilities of ASTRA-sim with hands-on examples and discuss ongoing development efforts.