CSE Distinguished Seminar | Michael Mahoney
Prof. Mahoney's Website 
    Abstract
We have seen deep learning bring significant advances in computer vision, natural language processing, and related areas, and this has led to models that match or exceed human-level performance. An often overlooked aspect of this so-called deep learning revolution, however, is that practically-useful techniques are largely ad-hoc, basically since theory often does not provide even a qualitative guide to practice. This problem is typically “solved” by introducing many parameters and hyperparameters, which are typically determined by massive parameter/hyperparameter tuning procedures and random/quasi-random search. This requires extensive computation, and it leads to a situation that is not scalable, is not robust, and is not reproducible. In this talk, I will argue that systematic use of second order methods (that use second derivative or Hessian information) provides a path forward to make possible the next leap in the deep learning revolution, and I will show that (despite popular wisdom viewing them as purely theoretical tools) these methods can result in significant gains in practice.
First, I will discuss large-batch training, including the efficiency/inefficiency of training with SGD, KFAC, and other second order methods.  Second, I will address a common misconception that computing second order information is slow by presenting a new scalable framework for computing Hessian information, including its full eigenvalue spectrum. The framework is written on top of Ray, and it supports adaptive scaling of compute nodes. Using this, I will present scaling results showing that computing Hessian information is comparable to gradient computation times, and I will present a test case to show how the Hessian can be used during training.  This uses adversarial examples to implement a form of robust optimization, thereby smoothing the Hessian landscape, and it results in significant speed ups. Third, I will describe a new systematic approach to model compression using second order information, resulting in unprecedentedly small models for a range of challenging problems for image classification, object detection, and natural language processing.  In particuler, second order information can be used to provide significant improvements for quantization of a range of modern networks including state-of-the-art models including: (i) ResNet50/152, Inception-V3, and SqueezeNext for ImageNet; (ii) RetinaNet-ResNet50 for Microsoft COCO object detection; and (iii) BERT model for natural language processing.  All results are obtained using academic resources and without any expensive search, but they exceed *all* industry-level results, including expensive SGD-based Auto-ML methods.  Finally I will discuss some future directions involving stochastic second order methods.
Research Interests
On the theory side, we develop algorithmic and statistical methods for matrix, graph, regression, optimization, and related problems. On the implementation side, we provide implementations (e.g., on single machine, distributed data system, and supercomputer environments) of a range of matrix, graph, and optimization algorithms. On the applied side, we apply these methods to a range of problems in internet and social media analysis, social networks analysis, as well as genetics, mass spec imaging, astronomy, climate, and a range of other scientific applications.
Bio
Michael W. Mahoney is at the University of California at Berkeley in the Department of Statistics and at the International Computer Science Institute (ICSI). He works on algorithmic and statistical aspects of modern large-scale data analysis. Much of his recent research has focused on large-scale machine learning, including randomized matrix algorithms and randomized numerical linear algebra, geometric network analysis tools for structure extraction in large informatics graphs, scalable implicit regularization methods, and applications in genetics, astronomy, medical imaging, social network analysis, and internet data analysis. He received him PhD from Yale University with a dissertation in computational statistical mechanics, and he has worked and taught at Yale University in the mathematics department, at Yahoo Research, and at Stanford University in the mathematics department. Among other things, he is on the national advisory committee of the Statistical and Applied Mathematical Sciences Institute (SAMSI), he was on the National Research Council’s Committee on the Analysis of Massive Data, he co-organized the Simons Institute’s fall 2013 and 2018 programs on the foundations of data science, and he runs the biennial MMDS Workshops on Algorithms for Modern Massive Data Sets. He is currently the Director of the NSF/TRIPODS-funded FODA (Foundations of Data Analysis) Institute at UC Berkeley. He holds several patents for work done at Yahoo Research and as Lead Data Scientist for Vieu Labs, Inc., a startup reimagining consumer video for billions of users.