Dimitris Papailiopoulos

ECE901/CS838:
Large-scale Machine Learning and Optimization

Spring 2018

This course will explore the mathematical foundations of a rapidly evolving new field: large-scale optimization and machine learning. We will focus on recent texts in machine learning, optimization, and randomized algorithms, with the goal to understand the tradeoffs that are driving algorithmic design in this new discipline. These tradeoffs will revolve around statistical accuracy, scalability, algorithmic complexity, and implementation.

Sample topics include:

Optimization and Learning

Stochastic Methods for Convex and Nonconvex Settings

Overfitting, Generalization, and Algorithmic Stability

Expressive Power of Neural Nets, Hardness, and Recent Results

Large Scale Learning and Systems

System Tradeoffs, Platforms, and Bottlenecks

Synchronous and Asynchronous Distributed Optimization

Stragglers and Adversarial Attacks during Distributed Learning

[syllabus]

Lectures

Week 1

Introduction and Course Overview

Slides: [lecture 1]
Concentration of the Empirical Risk

Reading material: Chapters 4 and 6 of [2], Sections 2.2 and 2.3 of [3].
Lecture notes: [pdf]
Lecture scribe: [pdf]

Week 2

Computational Aspects of the ERM and Families of Loss Functions

Reading material: Chapters 6 and 12 of [2].
Lecture notes: [pdf]
Lecture scribe: [pdf]
Convexity in Learning and Intro to Gradient Descent

Reading material: Chapter 12 of [2], Chapter 3 of [1].
Lecture notes: [pdf]
Lecture scribe: [pdf]

Week 3

Convergence Rates and Complexity of Gradient Descent

Reading material: Chapter 3 of [1], Chapter 14 of [2].
Lecture notes: [pdf]
Lecture scribe: [pdf]
The Stochastic Gradient Method

Reading material: Chapter 6 of [1], Chapter 14 of [2], Section 3 of [3], Chapter 2 of [4].
Lecture notes: [pdf]
Lecture scribe: [pdf]

Week 4

Accelerating SGD with Variance Reduction

Reading material: [5], Section 6.3 of [1], Section 5.3 of [3].
Lecture notes: [pdf]
Lecture scribe: [pdf]
Random Coordinate Descent and Importance Sampling

Reading material: Section 6.4 of [1], Section 7.3 of [3], [6], [7].
Lecture notes: [pdf]
Lecture scribe: [pdf]

Week 5

SGD Bounds for some Nonconvex Problems

Reading material: [11], [7].
Lecture notes: [pdf]
Lecture scribe: [pdf]
Stepsize Rules (of Thumb), Subgradients, and Projections

Reading material: [8] , [4], Sections 1.2, 3.1 of [1].
Lecture notes: [pdf]
Lecture scribe:

Week 6

Challenges in Distributed and Parallel Machine Learning

Lecture slides: [pptx]
Scaling-up SGD with Mini-Batches

Reading material: [19], [20].
Lecture slides: [pptx]
Lecture scribe:

Week 7

Lifting the Synchronization Barriers and Freeing the Locks

Reading material: [15]
Lecture slides: [pptx]
Lecture scribe: [pdf]
Understanding Hogwild!: Convergence Rates and Challenges

Reading material: [15], [16]
Lecture slides: [pptx]
Lecture scribe: [pdf]

Week 8

No Class

Week 9

Proposal Presentations

Week 11

Serial Equivalence in Asynchronous Machine Learning

Reading material: [17], [18]
Lecture slides: [pptx]
Lecture scribe: [pdf]
Communication Bottlenecks and Gradient Quantization

Reading material: [21], [22]
Lecture slides: [pptx]
Lecture scribe:

Week 12

Mitigating Stragglers in Distributed Computation

Reading material: [23], [24]
Lecture slides: [pptx]
Lecture scribe:
Inference on Deep Networks, Model Compression and Quantization

Reading material: [25], [26], [27]
Lecture slides: [pptx]

Week 13

Robustness of Predictions and Adversarial Examples

Lecture slides: [pptx]
Defenses against Adversarial Examples

Lecture slides: [pptx]

Week 14

Training in the face of Adversity

Lecture slides: [pptx]
Open Problems in Large-Scale Learning

Lecture slides: [pptx]

References

Convex Optimization: Algorithms and Complexity, Sebastien Bubeck.

Understanding Machine Learning: From Theory to Algorithms, by Shai Shalev-Shwartz and Shai Ben-David.

Optimization Methods for Large-Scale Machine Learning, by Léon Bottou, Frank E. Curtis, and Jorge Nocedal.

Introductory Lectures on Stochastic Optimization, by John Duchi.

Accelerating Stochastic Gradient Descent using Predictive Variance Reduction, by Rie Johnson and Tong Zhang.

Efficiency of coordinate descent methods on huge-scale optimization problems, by Yurii Nesterov.

Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition, by Hamed Karimi, Julie Nutini, and Mark Schmidt.

Subgradients (course notes) , by Stephen Boyd, John Duchi, and Lieven Vandenberghe.

Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization, by Martin Jaggi.

Projection-free Online Learning, by Elad Hazan and Satyen Kale.

Stochastic First- and Zeroth-order Methods for Nonconvex Stochastic Programming, by by Saeed Ghadimi and Guanghui Lan.

No bad local minima: Data independent training error guarantees for multilayer neural networks, by Daniel Soudry and Yair Carmon.

Stability and Generalization, by Olivier Bousquet and André Elisseef.

Train faster, generalize better: Stability of stochastic gradient descent, by Moritz Hardt, Benjamin Recht, and Yoram Singer.

HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent, by Feng Niu, Benjamin Recht, Christopher Ré and Stephen J. Wright.

Perturbed Iterate Analysis for Asynchronous Stochastic Optimization, by Horia Mania, Xinghao Pan, Dimitris Papailiopoulos, Benjamin Recht, Kannan Ramchandran, Michael I. Jordan.

CYCLADES: Conflict-free Asynchronous Machine Learning, by Xinghao Pan, Maximilian Lam, Stephen Tu, Dimitris Papailiopoulos, Ce Zhang, Michael I. Jordan, Kannan Ramchandran, Christopher Ré, Benjamin Recht.

The phase transition in site percolation on pseudo-random graphs, by Michael Krivelevich.

Better mini-batch algorithms via accelerated gradient methods, by Cotter A, Shamir O, Srebro N, Sridharan K., 2011.

Gradient Diversity: a Key Ingredient for Scalable Distributed Learning, by Yin, D., Pananjady, A., Lam, M., Papailiopoulos, D., Ramchandran, K. and Bartlett, P., AISTATS 2018.

QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding, D. Alistarh, D. Grubic, J. Li, R. Tomioka, M. Vojnovic, NIPS 2017.

TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning, by W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, H. Li, NIPS 2017.

Speeding up distributed machine learning using codes, by K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, K. Ramchandran, IEEE Trans. Info. Theory, 2017.

Gradient coding: Avoiding stragglers in distributed learning, by R. Tandon, Q. Lei, A. G. Dimakis, N. Karampatziakis, ICML 2017.

Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding, by S. Han, H. Mao, W. J. Dally, ICLR 2016.

XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks, by M. Rastegari, V. Ordonez, J. Redmon, A. Farhadi, ECCV 2016.

An Analysis of Deep Neural Network Models for Practical Applications, by M. Alfredo Canziani, Adam Paszke, Eugenio Culurciello, arXiv 2016.

Downloads and useful links

Course syllabus

Slides for lecture 1

Quizz 0

Scribe template

SGD and GD iPython notebook: examples on classification and regression tasks, on artificial and real data.

Tensorflow Playground: experiment with SGD, learning rates, NN parameters, and regularization.

ECE901/CS838: Large-scale Machine Learning and Optimization

Spring 2018

[syllabus]

Lectures

References

Downloads and useful links

Course syllabus Slides for lecture 1 Quizz 0 Scribe template SGD and GD iPython notebook: examples on classification and regression tasks, on artificial and real data. Tensorflow Playground: experiment with SGD, learning rates, NN parameters, and regularization.

ECE901/CS838:
Large-scale Machine Learning and Optimization

Course syllabus

Slides for lecture 1

Quizz 0

Scribe template

SGD and GD iPython notebook: examples on classification and regression tasks, on artificial and real data.

Tensorflow Playground: experiment with SGD, learning rates, NN parameters, and regularization.