**Applied Algorithms for ML**

# Schedule

## Day 1: Monday, June 10, 2024

Time | Event | Location |
---|---|---|

8:00 - 9:00 AM | Breakfast | Club de la Chasse |

9:00 - 9:15 AM | Opening Remarks | Club de la Chasse |

9:15 - 10:00 AM |
Adversarial Training Should Be Cast as a Non-Zero-Sum Game
One prominent approach toward resolving the adversarial vulnerability of deep neural networks is the two-player zero-sum paradigm of adversarial training, in which predictors are trained against adversarially-chosen perturbations of data. Despite the promise of this approach, algorithms based on this paradigm have not engendered sufficient levels of robustness, and suffer from pathological behavior like robust overfitting. To understand this shortcoming, we first show that the commonly used surrogate-based relaxation used in adversarial training algorithms voids all guarantees on the robustness of trained classifiers. The identification of this pitfall informs a novel non-zero-sum bilevel formulation of adversarial training, wherein each player optimizes a different objective function. Our formulation naturally yields a simple algorithmic framework that matches and in some cases outperforms state-of-the-art attacks, attains comparable levels of robustness to standard adversarial training algorithms, and does not suffer from robust overfitting.
By Volkan Cevher (EPFL) Slides | Club de la Chasse |

10:00 - 10:45 AM | Self-supervised learning for the geosciences
Digitized data has become abundant, especially in the geosciences, but preprocessing it to create the “labeled data” needed for supervised machine learning is often costly, time-consuming, or even impossible. Fortuitously, in very large-scale data domains, “self-supervised” machine learning methods are now actually outperforming supervised learning methods. In this talk, I will first define self-supervised deep learning, including the notion of a “pretext task.” Then I will survey our lab’s work developing self-supervised learning approaches for several tasks in the geosciences, such as downscaling and temporal interpolation of spatiotemporal data.
By Claire Monteleoni (INRIA/Boulder) Slides | Club de la Chasse |

10:45 - 11:15 AM | Coffee Break | Club de la Chasse |

11:15 AM - 12:00 PM | Accelerating Gradient Descent by Stepsize Hedging
Is it possible to accelerate the convergence of gradient descent without changing the algorithm — simply by judiciously choosing stepsizes? Surprisingly, we show that the answer is yes.
Our proposed Silver Stepsize Schedule optimizes strongly convex functions in $\kappa^{\log_\rho 2} = \kappa^{0.7864}$ iterations, where $\rho=1+\sqrt{2}$ is the silver ratio and $\kappa$ is the condition number. This lies in between the textbook unaccelerated rate $\kappa$ and Nesterov's accelerated rate $\sqrt{\kappa}$ from 1983. The non-strongly convex setting is conceptually identical (and technically simpler), with an accelerated rate $\eps^{-\log_\rho 2} = \eps^{-0.7864}$. We conjecture, and provide partial evidence, that these rates are optimal among all possible stepsize schedules. The Silver Stepsize Schedule is non-monotonic, fractal-like, and (in the strongly convex case) approximately periodic with a period of $\kappa^{\log_\rho 2}$. Why do such stepsizes help? The core intuition is “hedging” between individually suboptimal strategies — short steps and long steps — since bad cases for one are good cases for the other, and vice versa. Properly combining these stepsizes yields faster convergence due to the misalignment of worst-case functions. Based on joint work with Jason Altschuler (UPenn). https://arxiv.org/abs/2309.07879 https://arxiv.org/abs/2309.16530 By Pablo Parrilo (MIT) Slides | Club de la Chasse |

12:00 - 2:00 PM | Lunch (on your own) Restaurant Recommendations | |

2:00 - 2:45 PM | The Combinatorial Side of Graph Neural Networks
Graph Neural Networks (GNNs) are a machine learning architecture to learn functions on graphs. For example, since problem instances for combinatorial optimisation tasks are often modelled as graphs, GNNs have recently received attention as a natural framework for finding good heuristics in neural optimisation approaches.
The question which functions can actually be learnt by message-passing GNNs and which ones exceed their power has been studied extensively. In this talk, I will consider it from a graph-theoretical perspective. I will survey the Weisfeiler-Leman algorithm as a combinatorial procedure to analyse and compare graph structure, and I will discuss some results concerning the power of the algorithm on natural graph classes. The findings directly translate into insights about the power of GNNs and of their extensions to higher-dimensional neural networks.
By Sandra Kiefer (Oxford) | Club de la Chasse |

2:45 - 3:30 PM | Divide-and-Conquer Posterior Sampling with Denoising Diffusion Priors
Recent advancements in Bayesian inverse problem-solving have spotlighted Denoising Diffusion Models (DDMs) as effective priors. Despite their potential, the challenge lies in efficient sampling from the complex posterior distribution these models yield. Classical approaches necessitate either cumbersome retraining of model-specific components to align with the likelihood or introduce approximation errors affecting the accuracy of the posterior. In our study, we introduce an innovative framework that leverages the inherent structure of DDMs to construct a series of simplified diffusion guidance problems. This novel method significantly reduces the approximation error associated with current techniques, without the need for retraining. We demonstrate the versatility and effectiveness of our approach across a broad spectrum of Bayesian inverse problems.
By Éric Moulines (Ecole Polytechnique) Slides | Club de la Chasse |

3:30 - 4:00 PM | Coffee Break | Club de la Chasse |

4:00 - 4:45 PM | Probabilistic Embedding in MPC and Strong OV conjecture
In this talk on Probabilistic Embedding in Massively Parallel Computation (MPC), we explore advanced techniques for embedding high-dimensional data into lower-dimensional spaces efficiently. Acknowledging key contributors and advocating for the public sharing of educational resources, the presentation delves into the utility of tree structures and the importance of maintaining non-contracting embeddings in large-scale data scenarios. Key themes include optimizing space efficiency, understanding algorithmic lower bounds, and the critical role of dynamic programming in solving complex computational problems. The discussion concludes with insights into recent research findings and the ongoing challenges in enhancing embedding methodologies for MPC.
By MohammadTaghi HajiAghayi (UMD) Slides | Club de la Chasse |

4:45 - 5:30 PM | Self-Play Preference Optimization for Language Model Alignment
Traditional reinforcement learning from human feedback (RLHF) approaches relying on parametric models like the Bradley-Terry model fall short in capturing the intransitivity and irrationality in human preferences. Recent advancements suggest that directly working with preference probabilities can yield a more accurate reflection of human preferences, enabling more flexible and accurate language model alignment. In this paper, we propose a self-play-based method for language model alignment, which treats the problem as a constant-sum two-player game aimed at identifying the Nash equilibrium policy. Our approach, dubbed Self-play Probabilistic Preference Optimization (SPPO), approximates the Nash equilibrium through iterative policy updates and enjoys a theoretical convergence guarantee. Our method can effectively increase the log-likelihood of the chosen response and decrease that of the rejected response, which cannot be trivially achieved by symmetric pairwise loss such as Direct Preference Optimization (DPO) and Identity Preference Optimization (IPO). In our experiments, using only 60k prompts (without responses) from the UltraFeedback dataset and without any prompt augmentation, by leveraging a pre-trained preference model PairRM with only 0.4B parameters, SPPO can obtain a model from fine-tuning Mistral-7B-Instruct-v0.2 that achieves the state-of-the-art length-controlled win-rate of 28.53% against GPT-4-Turbo on AlpacaEval 2.0. It also outperforms the (iterative) DPO and IPO on MT-Bench and the Open LLM Leaderboard. Notably, the strong performance of SPPO is achieved without additional external supervision (e.g., responses, preferences, etc.) from GPT-4 or other stronger language models.
By Quanquan Gu (UCLA) Slides | Club de la Chasse |

5:30 - 7:00 PM | Poster Session + Reception (Note: Change of location) | Rice Global Paris Center |

## Day 2: Tuesday, June 11, 2024

Time | Event | Location |
---|---|---|

8:00 - 9:00 AM | Breakfast | Club de la Chasse |

9:00 - 9:45 AM | A Bi-metric Framework for Fast Similarity Search
We propose a new ``bi-metric'' framework for designing nearest neighbor data structures. Our framework assumes two dissimilarity functions: a *ground-truth* metric that is accurate but expensive to compute, and a *proxy* metric that is cheaper but less accurate. In both theory and practice, we show how to construct data structures using only the proxy metric such that the query procedure achieves the accuracy of the expensive metric, while using only a limited number of calls to both metrics. Our theoretical results instantiate this framework for two popular nearest neighbor search algorithms: DiskANN and Cover Tree. In both cases we show that, as long as the proxy metric used to construct the data structure approximates the ground-truth metric up to a constant factor, our data structure achieves arbitrarily good approximation guarantees with respect to the ground-truth metric. On the empirical side, we apply the framework to the text retrieval problem with two dissimilarity functions evaluated by ML models with vastly different computational costs. We observe that for almost all data sets in the MTEB benchmark, our approach achieves a considerably better accuracy-efficiency tradeoff than the alternatives.
By Piotr Indyk (MIT) Slides | Club de la Chasse |

9:45 - 10:30 AM | Artur Czumaj (U of Warwick)
Slides | Club de la Chasse |

10:30 - 11:15 AM | Coffee Break | Club de la Chasse |

11:15 - 12:00 PM | Elias Bareinboim (Columbia) | Club de la Chasse |

12:00 - 2:00 PM | Lunch (on your own) Restaurant Recommendations | |

2:00 - 2:45 PM | A methodology for reconciling computer science and legal approaches to privacy
Computer systems are increasingly making decisions which involve sensitive private information. Do they withstand legal privacy requirements? We will explore some of the gaps between the legal and computer science approaches to defining privacy protection that make it hard to answer this question. Then we will present a methodology for mitigating these gaps while respecting the values of both disciplines.
The presentation is based in part on recent work with Micah Altman and Aloni Cohen. By Kobbi Nissim (Georgetown) Slides | Club de la Chasse |

2:45 - 3:30 PM | New local differentially private protocols for frequency and mean estimation
Consider the following examples of distributed applications: a texting app wants to train ML models for autocomplete based on text history residing on-device across millions of devices, or the developers of some other app want to understand common app settings by their users. In both cases, and many others, a third party wants to understand something in the aggregate about a large distributed database but under the constraint that each individual record requires some guarantee of privacy. Protocols satisfying so-called local differential privacy have become the gold standard for guaranteeing privacy in such situations, and in this talk I will discuss new such protocols for two of the most common problems that require solutions in this framework: frequency estimation, and mean estimation.
Based on joint works with subsets of Hilal Asi, Vitaly Feldman, Huy Le Nguyen, and Kunal Talwar. By Jelani Nelson (UC Berkeley) | Club de la Chasse |

3:30 - 4:00 PM | Coffee Break | Club de la Chasse |

4:00 - 4:45 PM |
The First Optimal Parallel SGD (in the Presence of Data, Compute and Communication Heterogeneity)
The design of efficient parallel/distributed optimization methods and tight analysis of their theoretical properties are important research endeavors. While minimax complexities are known for sequential optimization methods, the theory of parallel optimization methods is surprisinglyt much less explored, especially in the presence of data, compute and communication heterogeneity.
In the first part of the talk [1], we establish the first optimal time complexities for parallel optimization methods (Rennala SGD and Malenia SGD) that have access to an unbiased stochastic gradient oracle with bounded variance, under the assumption that the workers compute stochastic gradients with different speeds, i.e., we assume compute heterogeneity. We prove lower bounds and develop optimal algorithms that attain them, both in the data homogeneous and heterogeneous regimes. In the second part of the talk [2], we establish the first optimal time complexities for parallel optimization methods (Shadowheart SGD) that have access to an unbiased stochastic gradient oracle with bounded variance, under the assumption that the workers compute stochastic gradients with different speeds, as before, but under the further assumption that the worker-to-server communication times are nonzero and heterogeneous. We prove lower bounds and develop optimal algorithms that attain them, in the data homogeneous regime only. Our results have surprising consequences for the literature of asynchronous optimization methods: in contrast with prior attempts to tame compute heterogeneity via "complete/wild" compute and update asynchronicity, our methods alternate fast asynchornous computation of a minibatch of stochastic gradients with infrequent synchronous update steps. [1] Alexander Tyurin and Peter Richtárik. Optimal time complexities of parallel stochastic optimization methods under a fixed computation model, Advances in Neural Information Processing Systems 36 (NeurIPS 2023). [2] Alexander Tyurin, Marta Pozzi, Ivan Ilin, and Peter Richtárik. Shadowheart SGD: Distributed asynchronous SGD with optimal time complexity under arbitrary computation and communication heterogeneity, arXiv:2402.04785, 2024. By Peter Richtarik (KAUST) Slides | Club de la Chasse |

4:45 - 5:30 PM | Machine learning for discovering new algorithms and solving mathematical problems
In the modern era of computing, algorithms for solving fundamental problems such as computing the shortest path in graphs and solving linear equations, are used billions of times every day. However, paradoxically, such algorithms which are central to modern computing were often designed well before the advent of modern computation. In fact, discovering new efficient algorithms is notoriously difficult, and often involves solving prohibitively large combinatorial problems. In this talk, I will describe how to use machine learning techniques - in particular reinforcement learning and large language models - to discover new algorithms and solve long-standing mathematical problems. We will particularly focus on designing new algorithms for fundamental computational problems, such as matrix multiplication.
By Alhussein Fawzi (DeepMind) | Club de la Chasse |

5:30 - 7:00 PM | Poster Session + Light Reception (Note: Change of location) | Rice Global Paris Center |

## Day 3: Wednesday, June 12, 2024

Time | Event | Location |
---|---|---|

8:00 - 9:00 AM | Breakfast | Club de la Chasse |

9:00 - 9:45 AM |
Hash functions bridging the gap from theory to practice
Randomized algorithms are often enjoyed for their simplicity, but the
hash functions employed to yield the desired probabilistic guarantees
are often too complicated to be practical.
Hash functions are used everywhere in computing, e.g., hash tables,
sketching, dimensionality reduction, sampling, and estimation. Many of
these applications are relevant to Machine Learning, where we are
often interested in similarity between high dimensional objects. Reducing
the dimensionality is key to efficient processing.
Abstractly, we like to think of hashing as fully-random
hashing, assigning independent hash values to every possible key, but
essentially this requires us to store the hash values for all keys,
which is unrealistic for most key universes, e.g., 64-bit keys.
In practice we have to settle for implementable hash functions, and
often practitioners settle for implementations that are too simple in
that the algorithms ends up working only for sufficiently random
input. However, the real world is full of structured/non-random input.
The issue is severe, for simplistic hash functions will often work very well
in tests with random input. Moreover, the issue is often that error
events that should never happen in practice, happen with way too high
probability. This does not show in a few test, but will show up over
time when you put the system into production.
Over the last decade there has been major developments in simple to
implement tabulation based hash functions offering strong theoretical
guarantees, so as to support fundamental properties such as Chernoff
bounds, Sparse Johnson-Lindenstrauss transforms, and fully-random
hashing on a given set w.h.p. etc.
I will discuss some of the principles of these developments and offer
insights on how far we can bridge from theory (assuming fully-random
hash functions) to practice (needing something that can actually
implemented efficiently).
By Mikkel Thorup (U of Copenhagen) Slides | Club de la Chasse |

9:45 - 10:30 AM | An alternative view of denoising diffusion models
Denoising diffusion models have led to impressive generative models in many domains. In this talk, I will present recent progress, with a focus on formulations that do not involve stochastic differential equations.
By Francis Bach (INRIA) Slides | Club de la Chasse |

10:30 - 11:00 AM | Coffee Break | Club de la Chasse |

11:00 - 11:20 AM | Rising Stars talk: Lin Yang | Club de la Chasse |

11:20 - 11:40 AM | Rising Stars talk:
Amartya Sanyal
Differentially Private Algorithms with Correlated data Differential Privacy (DP) promises to protect individual data privacy by ensuring that the inclusion or exclusion of a single data point does not significantly alter the output of an algorithm. However, DP typically assumes no correlation between individual data records, an assumption violated in several real-world settings. This talk will explore two such scenarios: when data undergoes non-private, data-dependent pre-processing (e.g., imputation, deduplication, PCA) prior to the DP operation and when data is adversarially selected in an online setting. In the first half of the talk, we will introduce a framework and provide a theoretical upper bound to assess privacy leakage from non-private pre-processing. In the second half, we will provide a lower bound for differentially private online learning, suggesting a learning theoretic separation between private and non-private online learning. This talk is based on two recent papers: "Provable Privacy with Non-Private Pre-Processing" accepted at ICML 2024, and "On the Growth of Mistakes in Differentially Private Online Learning: A Lower Bound Perspective," accepted at COLT 2024.
| Club de la Chasse |