Large-scale Machine Learning and Optimization

Spring 2018

This course will explore the mathematical foundations of a rapidly evolving new field: large­-scale optimization and machine learning. We will focus on recent texts in machine learning, optimization, and randomized algorithms, with the goal to understand the tradeoffs that are driving algorithmic design in this new discipline. These trade­offs will revolve around statistical accuracy, scalability, algorithmic complexity, and implementation.

Sample topics include:
  • Optimization and Learning
    • Stochastic Methods for Convex and Nonconvex Settings
    • Overfitting, Generalization, and Algorithmic Stability
    • Expressive Power of Neural Nets, Hardness, and Recent Results

  • Large Scale Learning and Systems
    • System Tradeoffs, Platforms, and Bottlenecks
    • Synchronous and Asynchronous Distributed Optimization
    • Stragglers and Adversarial Attacks during Distributed Learning


Week 1

  • Introduction and Course Overview
    Slides: [lecture 1]
  • Concentration of the Empirical Risk
    Reading material: Chapters 4 and 6 of [2], Sections 2.2 and 2.3 of [3].
    Lecture notes: [pdf]
    Lecture scribe: [pdf]

Week 2

  • Computational Aspects of the ERM and Families of Loss Functions
    Reading material: Chapters 6 and 12 of [2].
    Lecture notes: [pdf]
    Lecture scribe: [pdf]
  • Convexity in Learning and Intro to Gradient Descent
    Reading material: Chapter 12 of [2], Chapter 3 of [1].
    Lecture notes: [pdf]
    Lecture scribe: [pdf]

Week 3

  • Convergence Rates and Complexity of Gradient Descent
    Reading material: Chapter 3 of [1], Chapter 14 of [2].
    Lecture notes: [pdf]
    Lecture scribe: [pdf]
  • The Stochastic Gradient Method
    Reading material: Chapter 6 of [1], Chapter 14 of [2], Section 3 of [3], Chapter 2 of [4].
    Lecture notes: [pdf]
    Lecture scribe: [pdf]

Week 4

  • Accelerating SGD with Variance Reduction
    Reading material: [5], Section 6.3 of [1], Section 5.3 of [3].
    Lecture notes: [pdf]
    Lecture scribe: [pdf]
  • Random Coordinate Descent and Importance Sampling
    Reading material: Section 6.4 of [1], Section 7.3 of [3], [6], [7].
    Lecture notes: [pdf]
    Lecture scribe: [pdf]

Week 5

  • SGD Bounds for some Nonconvex Problems
    Reading material: [11], [7].
    Lecture notes: [pdf]
    Lecture scribe: [pdf]
  • Stepsize Rules (of Thumb), Subgradients, and Projections
    Reading material: [8] , [4], Sections 1.2, 3.1 of [1].
    Lecture notes: [pdf]
    Lecture scribe:

  • Week 6

    • Challenges in Distributed and Parallel Machine Learning
      Lecture slides: [pptx]
    • Scaling-up SGD with Mini-Batches
      Reading material: [19], [20].
      Lecture slides: [pptx]
      Lecture scribe:

    • Week 7

      • Lifting the Synchronization Barriers and Freeing the Locks
        Reading material: [15]
        Lecture slides: [pptx]
        Lecture scribe: [pdf]
      • Understanding Hogwild!: Convergence Rates and Challenges
        Reading material: [15], [16]
        Lecture slides: [pptx]
        Lecture scribe: [pdf]

      • Week 8

      • No Class

      • Week 9

      • Proposal Presentations

      • Week 11

        • Serial Equivalence in Asynchronous Machine Learning
          Reading material: [17], [18]
          Lecture slides: [pptx]
          Lecture scribe: [pdf]
        • Communication Bottlenecks and Gradient Quantization
          Reading material: [21], [22]
          Lecture slides: [pptx]
          Lecture scribe:

        • Week 12

          • Mitigating Stragglers in Distributed Computation
            Reading material: [23], [24]
            Lecture slides: [pptx]
            Lecture scribe:
          • Inference on Deep Networks, Model Compression and Quantization
            Reading material: [25], [26], [27]
            Lecture slides: [pptx]

          • Week 13

            • Robustness of Predictions and Adversarial Examples
              Lecture slides: [pptx]


  1. Convex Optimization: Algorithms and Complexity, Sebastien Bubeck.
  2. Understanding Machine Learning: From Theory to Algorithms, by Shai Shalev-Shwartz and Shai Ben-David.
  3. Optimization Methods for Large-Scale Machine Learning, by Léon Bottou, Frank E. Curtis, and Jorge Nocedal.
  4. Introductory Lectures on Stochastic Optimization, by John Duchi.
  5. Accelerating Stochastic Gradient Descent using Predictive Variance Reduction, by Rie Johnson and Tong Zhang.
  6. Efficiency of coordinate descent methods on huge-scale optimization problems, by Yurii Nesterov.
  7. Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition, by Hamed Karimi, Julie Nutini, and Mark Schmidt.
  8. Subgradients (course notes) , by Stephen Boyd, John Duchi, and Lieven Vandenberghe.
  9. Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization, by Martin Jaggi.
  10. Projection-free Online Learning, by Elad Hazan and Satyen Kale.
  11. Stochastic First- and Zeroth-order Methods for Nonconvex Stochastic Programming, by by Saeed Ghadimi and Guanghui Lan.
  12. No bad local minima: Data independent training error guarantees for multilayer neural networks, by Daniel Soudry and Yair Carmon.
  13. Stability and Generalization, by Olivier Bousquet and André Elisseef.
  14. Train faster, generalize better: Stability of stochastic gradient descent, by Moritz Hardt, Benjamin Recht, and Yoram Singer.
  15. HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent, by Feng Niu, Benjamin Recht, Christopher Ré and Stephen J. Wright.
  16. Perturbed Iterate Analysis for Asynchronous Stochastic Optimization, by Horia Mania, Xinghao Pan, Dimitris Papailiopoulos, Benjamin Recht, Kannan Ramchandran, Michael I. Jordan.
  17. CYCLADES: Conflict-free Asynchronous Machine Learning, by Xinghao Pan, Maximilian Lam, Stephen Tu, Dimitris Papailiopoulos, Ce Zhang, Michael I. Jordan, Kannan Ramchandran, Christopher Ré, Benjamin Recht.
  18. The phase transition in site percolation on pseudo-random graphs, by Michael Krivelevich.
  19. Better mini-batch algorithms via accelerated gradient methods, by Cotter A, Shamir O, Srebro N, Sridharan K., 2011.
  20. Gradient Diversity: a Key Ingredient for Scalable Distributed Learning, by Yin, D., Pananjady, A., Lam, M., Papailiopoulos, D., Ramchandran, K. and Bartlett, P., AISTATS 2018.
  21. QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding, D. Alistarh, D. Grubic, J. Li, R. Tomioka, M. Vojnovic, NIPS 2017.
  22. TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning, by W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, H. Li, NIPS 2017.
  23. Speeding up distributed machine learning using codes, by K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, K. Ramchandran, IEEE Trans. Info. Theory, 2017.
  24. Gradient coding: Avoiding stragglers in distributed learning, by R. Tandon, Q. Lei, A. G. Dimakis, N. Karampatziakis, ICML 2017.
  25. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding, by S. Han, H. Mao, W. J. Dally, ICLR 2016.
  26. XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks, by M. Rastegari, V. Ordonez, J. Redmon, A. Farhadi, ECCV 2016.
  27. An Analysis of Deep Neural Network Models for Practical Applications, by M. Alfredo Canziani, Adam Paszke, Eugenio Culurciello, arXiv 2016.

Downloads and useful links