Hardware Software Co-Design for Machine Learning Systems (Spring 2025)

ECE 8803 / CS 8803 – HML

Spring 2025

Course Information

Instructor	Professor Tushar Krishna
Email	tushar@ece.gatech.edu
Office	KACB 2318
Office Hours	By Appointment

Lectures
Hours	Tue & Thu 3:30 – 5:00 pm
Room	TBA

Expected Prerequisite(s): ECE 4100 / ECE 6100 (Advanced Computer Architecture) or CS 4290 / CS 6290 (High Performance Computer Architecture). Simultaneous registration will be allowed.

Course Description

The advancement in Artificial Intelligence (AI) can be attributed to the synergistic advancements in big data sets, machine learning (ML) algorithms, and the hardware and systems used to deploy these models. Specifically, deep neural networks (DNNs) have showcased highly promising results in tasks across vision, speech and natural language processing. Unfortunately, DNNs come with significant computational and memory demands — which can be Zeta (10^21) FLOPs and Tera (10^12) Bytes respectively for in Large Language Models such as those driving ChatGPT. Efficient processing of these DNNs necessitates HW-SW co-design. Such co-design efforts have led to the emergence of (i) specialized hardware accelerators designed for DNNs (e.g., Google’s TPU, Meta’s MTIA, Amazon’s Inferentia & Trainium, and so on) and (ii) specialized distributed systems comprising hundreds to thousands of these accelerators connected via specialized fabrics (e.g., . Furthermore, GPUs and FPGA architectures and libraries have also evolved to accelerate DNNs.

This course aims to present recent advancements that strive to achieve efficient processing of DNNs. Specifically, it will offer an overview of DNNs, delve into techniques to distribute the workload, dive into various architectures and systems that support DNNs, and highlight key trends in recent techniques for efficient processing. These techniques aim to reduce the computational and communication costs associated with DNNs, either through hardware and system optimizations. The course will also provide a summary of various development resources to help researchers and practitioners initiate DNN deployments swiftly. Additionally, it will emphasize crucial benchmarking metrics and design considerations for evaluating the rapidly expanding array of DNN hardware designs, system optimizations, proposed in both academia and industry.

Course Objectives: This course will present recent advances towards the goal of enabling efficient processing of DNNs. Specifically, it will provide an overview of DNNs, discuss various platforms and architectures that support DNNs – both single-chip and distributed – and highlight key trends in recent efficient processing techniques that reduce the computation and communication cost of DNNs via hardware-software co-design. It will also summarize various development resources that can enable researchers and practitioners to quickly get started on DNN accelerator and system architectures, and highlight important benchmarking metrics and design considerations that should be used for evaluating the rapidly growing number of DNN hardware platforms being proposed in academia and industry.

Learning Outcomes: As part of this course, students will: understand the key design considerations for efficient DNN processing; understand tradeoffs between various hardware architectures and platforms; understand the need and means to distributed ML; evaluate the utility of various DNN strategies for end-to-end efficient execution; and understand future trends and opportunities from ML algorithms, system innovations, down to emerging technologies.

Course Structure and Content: The course will involve a mix of lectures interspersed with heavy paper reading and discussions. The course will also feature guest lectures from industry practitioners. In lieu of exams, there will be multiple programming assignments involving a mix of simulation and real-systems. The material for this course will be derived from papers from recent computer architecture conferences (ISCA, MICRO, HPCA, ASPLOS) on hardware acceleration and ML conferences (ICML, NeurIPS, ICLR) focusing on hardware friendly optimizations, and from blog articles from industry (Google, Meta, Microsoft, NVIDIA, Intel, ..).

Course Schedule:

Slides

L01-Intro.pdf
- will be publicly available for students who are not registered by Lecture 1.
Remaining slides available on Canvas for registered students

Lab Assignments

Lab 0: Running AI models on CPU and GPU
Lab 1: DNN Operator Analysis
Lab 2: Design-space Exploration of AI Accelerator
Lab 3: Real-system Execution of Distributed AI models
Lab 4: Design-space Exploration of Distributed AI System
Lab 5: Topology-aware Collective Algorithms

FAQs

What is the required background? Do I need to know Machine Learning?
Some basic understanding of Machine Learning, especially Deep Neural Networks will be useful. It is not an enforced requirement – but your time in the course will be better spent if you know the algorithms at the high-level and can learn their HW implementation and optimizations (which is the focus of the course), rather than hearing about .these algorithms for the first time.

Do I need to know how to use the ML frameworks such as Tensor Flow, PyTorch, etc.?
Not extensively, but having some experience running one of these frameworks will be useful.

What sort of coding background is needed? C++? RTL?
Familiarity with Python and C++ will be crucial.

Honor Code:

Students are expected to abide by the Georgia Tech Academic Honor Code. Honest and ethical behavior is expected at all times. All incidents of suspected dishonesty will be reported to and handled by the office of student affairs. You will have to do all assignments individually unless explicitly told otherwise. You may discuss with classmates but you may not copy any solution (or any part of a solution).