ECE826:
Theoretical Foundations of Large-scale Machine Learning

Spring 2024


Engineering Hall ENGR 1156



This course will explore the mathematical foundations of a rapidly evolving new field: large­-scale machine learning. We will focus on recent texts in machine learning, statistics, and optimization, with the goal to understand the tradeoffs that are driving algorithmic design in this new discipline. These trade­offs will revolve around statistical accuracy, scalability, algorithmic complexity, and implementation.

Sample topics include:
  • Optimization and Learning
    • Memorization, Generalization, and Algorithmic Stability
    • Stochastic Methods for Convex and Nonconvex Settings
    • Expressive Power of Neural Nets, Hardness, and Recent Results
    • From Supervised Learning to Language Modeling

  • Large Scale Learning and Systems
    • Modern Architectures: Transformers and State Space Models
    • System Tradeoffs, Platforms, and Modern Architectures
    • Centralized and Decentralized Distributed Optimization
    • Delays, Communication Bottlenecks, and Adversarial Attacks
[syllabus]

Lectures

Week 1

  • Introduction and Course Overview

References

  1. Convex Optimization: Algorithms and Complexity, Sebastien Bubeck.
  2. Understanding Machine Learning: From Theory to Algorithms, by Shai Shalev-Shwartz and Shai Ben-David.
  3. Optimization Methods for Large-Scale Machine Learning, by Léon Bottou, Frank E. Curtis, and Jorge Nocedal.
  4. Introductory Lectures on Stochastic Optimization, by John Duchi.
  5. Accelerating Stochastic Gradient Descent using Predictive Variance Reduction, by Rie Johnson and Tong Zhang.
  6. Efficiency of coordinate descent methods on huge-scale optimization problems, by Yurii Nesterov.
  7. Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition, by Hamed Karimi, Julie Nutini, and Mark Schmidt.
  8. Subgradients (course notes) , by Stephen Boyd, John Duchi, and Lieven Vandenberghe.
  9. Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization, by Martin Jaggi.
  10. Projection-free Online Learning, by Elad Hazan and Satyen Kale.
  11. Stochastic First- and Zeroth-order Methods for Nonconvex Stochastic Programming, by by Saeed Ghadimi and Guanghui Lan.
  12. No bad local minima: Data independent training error guarantees for multilayer neural networks, by Daniel Soudry and Yair Carmon.
  13. Stability and Generalization, by Olivier Bousquet and André Elisseef.
  14. Train faster, generalize better: Stability of stochastic gradient descent, by Moritz Hardt, Benjamin Recht, and Yoram Singer.
  15. HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent, by Feng Niu, Benjamin Recht, Christopher Ré and Stephen J. Wright.
  16. Perturbed Iterate Analysis for Asynchronous Stochastic Optimization, by Horia Mania, Xinghao Pan, Dimitris Papailiopoulos, Benjamin Recht, Kannan Ramchandran, Michael I. Jordan.
  17. CYCLADES: Conflict-free Asynchronous Machine Learning, by Xinghao Pan, Maximilian Lam, Stephen Tu, Dimitris Papailiopoulos, Ce Zhang, Michael I. Jordan, Kannan Ramchandran, Christopher Ré, Benjamin Recht.
  18. The phase transition in site percolation on pseudo-random graphs, by Michael Krivelevich.
  19. Better mini-batch algorithms via accelerated gradient methods, by Cotter A, Shamir O, Srebro N, Sridharan K., 2011.
  20. Gradient Diversity: a Key Ingredient for Scalable Distributed Learning, by Yin, D., Pananjady, A., Lam, M., Papailiopoulos, D., Ramchandran, K. and Bartlett, P., AISTATS 2018.
  21. QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding, D. Alistarh, D. Grubic, J. Li, R. Tomioka, M. Vojnovic, NIPS 2017.
  22. TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning, by W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, H. Li, NIPS 2017.
  23. Speeding up distributed machine learning using codes, by K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, K. Ramchandran, IEEE Trans. Info. Theory, 2017.
  24. Gradient coding: Avoiding stragglers in distributed learning, by R. Tandon, Q. Lei, A. G. Dimakis, N. Karampatziakis, ICML 2017.
  25. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding, by S. Han, H. Mao, W. J. Dally, ICLR 2016.
  26. XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks, by M. Rastegari, V. Ordonez, J. Redmon, A. Farhadi, ECCV 2016.
  27. An Analysis of Deep Neural Network Models for Practical Applications, by M. Alfredo Canziani, Adam Paszke, Eugenio Culurciello, arXiv 2016.