Schedule

Day 1: Monday, June 10, 2024

TimeEventLocation
8:00 - 9:00 AMBreakfastClub de la Chasse
9:00 - 9:15 AMOpening RemarksClub de la Chasse
9:15 - 10:00 AM Adversarial Training Should Be Cast as a Non-Zero-Sum Game
One prominent approach toward resolving the adversarial vulnerability of deep neural networks is the two-player zero-sum paradigm of adversarial training, in which predictors are trained against adversarially-chosen perturbations of data. Despite the promise of this approach, algorithms based on this paradigm have not engendered sufficient levels of robustness, and suffer from pathological behavior like robust overfitting. To understand this shortcoming, we first show that the commonly used surrogate-based relaxation used in adversarial training algorithms voids all guarantees on the robustness of trained classifiers. The identification of this pitfall informs a novel non-zero-sum bilevel formulation of adversarial training, wherein each player optimizes a different objective function. Our formulation naturally yields a simple algorithmic framework that matches and in some cases outperforms state-of-the-art attacks, attains comparable levels of robustness to standard adversarial training algorithms, and does not suffer from robust overfitting.

By Volkan Cevher (EPFL)
Slides
Club de la Chasse
10:00 - 10:45 AMSelf-supervised learning for the geosciences
Digitized data has become abundant, especially in the geosciences, but preprocessing it to create the “labeled data” needed for supervised machine learning is often costly, time-consuming, or even impossible. Fortuitously, in very large-scale data domains, “self-supervised” machine learning methods are now actually outperforming supervised learning methods. In this talk, I will first define self-supervised deep learning, including the notion of a “pretext task.” Then I will survey our lab’s work developing self-supervised learning approaches for several tasks in the geosciences, such as downscaling and temporal interpolation of spatiotemporal data.

By Claire Monteleoni (INRIA/Boulder)
Slides
Club de la Chasse
10:45 - 11:15 AMCoffee BreakClub de la Chasse
11:15 AM - 12:00 PMAccelerating Gradient Descent by Stepsize Hedging
Is it possible to accelerate the convergence of gradient descent without changing the algorithm — simply by judiciously choosing stepsizes? Surprisingly, we show that the answer is yes.
Our proposed Silver Stepsize Schedule optimizes strongly convex functions in $\kappa^{\log_\rho 2} = \kappa^{0.7864}$ iterations, where $\rho=1+\sqrt{2}$ is the silver ratio and $\kappa$ is the condition number. This lies in between the textbook unaccelerated rate $\kappa$ and Nesterov's accelerated rate $\sqrt{\kappa}$ from 1983. The non-strongly convex setting is conceptually identical (and technically simpler), with an accelerated rate $\eps^{-\log_\rho 2} = \eps^{-0.7864}$. We conjecture, and provide partial evidence, that these rates are optimal among all possible stepsize schedules.
The Silver Stepsize Schedule is non-monotonic, fractal-like, and (in the strongly convex case) approximately periodic with a period of $\kappa^{\log_\rho 2}$. Why do such stepsizes help? The core intuition is “hedging” between individually suboptimal strategies — short steps and long steps — since bad cases for one are good cases for the other, and vice versa. Properly combining these stepsizes yields faster convergence due to the misalignment of worst-case functions.
Based on joint work with Jason Altschuler (UPenn).
https://arxiv.org/abs/2309.07879
https://arxiv.org/abs/2309.16530

By Pablo Parrilo (MIT)
Slides
Club de la Chasse
12:00 - 2:00 PMLunch (on your own)
Restaurant Recommendations
2:00 - 2:45 PMThe Combinatorial Side of Graph Neural Networks
Graph Neural Networks (GNNs) are a machine learning architecture to learn functions on graphs. For example, since problem instances for combinatorial optimisation tasks are often modelled as graphs, GNNs have recently received attention as a natural framework for finding good heuristics in neural optimisation approaches. The question which functions can actually be learnt by message-passing GNNs and which ones exceed their power has been studied extensively. In this talk, I will consider it from a graph-theoretical perspective. I will survey the Weisfeiler-Leman algorithm as a combinatorial procedure to analyse and compare graph structure, and I will discuss some results concerning the power of the algorithm on natural graph classes. The findings directly translate into insights about the power of GNNs and of their extensions to higher-dimensional neural networks.

By Sandra Kiefer (Oxford)
Club de la Chasse
2:45 - 3:30 PMDivide-and-Conquer Posterior Sampling with Denoising Diffusion Priors
Recent advancements in Bayesian inverse problem-solving have spotlighted Denoising Diffusion Models (DDMs) as effective priors. Despite their potential, the challenge lies in efficient sampling from the complex posterior distribution these models yield. Classical approaches necessitate either cumbersome retraining of model-specific components to align with the likelihood or introduce approximation errors affecting the accuracy of the posterior. In our study, we introduce an innovative framework that leverages the inherent structure of DDMs to construct a series of simplified diffusion guidance problems. This novel method significantly reduces the approximation error associated with current techniques, without the need for retraining. We demonstrate the versatility and effectiveness of our approach across a broad spectrum of Bayesian inverse problems.

By Éric Moulines (Ecole Polytechnique)
Slides
Club de la Chasse
3:30 - 4:00 PMCoffee BreakClub de la Chasse
4:00 - 4:45 PMProbabilistic Embedding in MPC and Strong OV conjecture
In this talk on Probabilistic Embedding in Massively Parallel Computation (MPC), we explore advanced techniques for embedding high-dimensional data into lower-dimensional spaces efficiently. Acknowledging key contributors and advocating for the public sharing of educational resources, the presentation delves into the utility of tree structures and the importance of maintaining non-contracting embeddings in large-scale data scenarios. Key themes include optimizing space efficiency, understanding algorithmic lower bounds, and the critical role of dynamic programming in solving complex computational problems. The discussion concludes with insights into recent research findings and the ongoing challenges in enhancing embedding methodologies for MPC.

By MohammadTaghi HajiAghayi (UMD)
Slides
Club de la Chasse
4:45 - 5:30 PMSelf-Play Preference Optimization for Language Model Alignment
Traditional reinforcement learning from human feedback (RLHF) approaches relying on parametric models like the Bradley-Terry model fall short in capturing the intransitivity and irrationality in human preferences. Recent advancements suggest that directly working with preference probabilities can yield a more accurate reflection of human preferences, enabling more flexible and accurate language model alignment. In this paper, we propose a self-play-based method for language model alignment, which treats the problem as a constant-sum two-player game aimed at identifying the Nash equilibrium policy. Our approach, dubbed Self-play Probabilistic Preference Optimization (SPPO), approximates the Nash equilibrium through iterative policy updates and enjoys a theoretical convergence guarantee. Our method can effectively increase the log-likelihood of the chosen response and decrease that of the rejected response, which cannot be trivially achieved by symmetric pairwise loss such as Direct Preference Optimization (DPO) and Identity Preference Optimization (IPO). In our experiments, using only 60k prompts (without responses) from the UltraFeedback dataset and without any prompt augmentation, by leveraging a pre-trained preference model PairRM with only 0.4B parameters, SPPO can obtain a model from fine-tuning Mistral-7B-Instruct-v0.2 that achieves the state-of-the-art length-controlled win-rate of 28.53% against GPT-4-Turbo on AlpacaEval 2.0. It also outperforms the (iterative) DPO and IPO on MT-Bench and the Open LLM Leaderboard. Notably, the strong performance of SPPO is achieved without additional external supervision (e.g., responses, preferences, etc.) from GPT-4 or other stronger language models.

By Quanquan Gu (UCLA)
Slides
Club de la Chasse
5:30 - 7:00 PMPoster Session + Reception
(Note: Change of location)
Rice Global Paris Center

Day 2: Tuesday, June 11, 2024

TimeEventLocation
8:00 - 9:00 AMBreakfastClub de la Chasse
9:00 - 9:45 AMA Bi-metric Framework for Fast Similarity Search
We propose a new ``bi-metric'' framework for designing nearest neighbor data structures. Our framework assumes two dissimilarity functions: a *ground-truth* metric that is accurate but expensive to compute, and a *proxy* metric that is cheaper but less accurate. In both theory and practice, we show how to construct data structures using only the proxy metric such that the query procedure achieves the accuracy of the expensive metric, while using only a limited number of calls to both metrics. Our theoretical results instantiate this framework for two popular nearest neighbor search algorithms: DiskANN and Cover Tree. In both cases we show that, as long as the proxy metric used to construct the data structure approximates the ground-truth metric up to a constant factor, our data structure achieves arbitrarily good approximation guarantees with respect to the ground-truth metric. On the empirical side, we apply the framework to the text retrieval problem with two dissimilarity functions evaluated by ML models with vastly different computational costs. We observe that for almost all data sets in the MTEB benchmark, our approach achieves a considerably better accuracy-efficiency tradeoff than the alternatives.

By Piotr Indyk (MIT)
Slides
Club de la Chasse
9:45 - 10:30 AMArtur Czumaj (U of Warwick)
Slides
Club de la Chasse
10:30 - 11:15 AMCoffee BreakClub de la Chasse
11:15 - 12:00 PMElias Bareinboim (Columbia)Club de la Chasse
12:00 - 2:00 PMLunch (on your own)
Restaurant Recommendations
2:00 - 2:45 PMA methodology for reconciling computer science and legal approaches to privacy
Computer systems are increasingly making decisions which involve sensitive private information. Do they withstand legal privacy requirements? We will explore some of the gaps between the legal and computer science approaches to defining privacy protection that make it hard to answer this question. Then we will present a methodology for mitigating these gaps while respecting the values of both disciplines.
The presentation is based in part on recent work with Micah Altman and Aloni Cohen.

By Kobbi Nissim (Georgetown)
Slides
Club de la Chasse
2:45 - 3:30 PMNew local differentially private protocols for frequency and mean estimation
Consider the following examples of distributed applications: a texting app wants to train ML models for autocomplete based on text history residing on-device across millions of devices, or the developers of some other app want to understand common app settings by their users. In both cases, and many others, a third party wants to understand something in the aggregate about a large distributed database but under the constraint that each individual record requires some guarantee of privacy. Protocols satisfying so-called local differential privacy have become the gold standard for guaranteeing privacy in such situations, and in this talk I will discuss new such protocols for two of the most common problems that require solutions in this framework: frequency estimation, and mean estimation.
Based on joint works with subsets of Hilal Asi, Vitaly Feldman, Huy Le Nguyen, and Kunal Talwar.

By Jelani Nelson (UC Berkeley)
Club de la Chasse
3:30 - 4:00 PMCoffee BreakClub de la Chasse
4:00 - 4:45 PM The First Optimal Parallel SGD (in the Presence of Data, Compute and Communication Heterogeneity)
The design of efficient parallel/distributed optimization methods and tight analysis of their theoretical properties are important research endeavors. While minimax complexities are known for sequential optimization methods, the theory of parallel optimization methods is surprisinglyt much less explored, especially in the presence of data, compute and communication heterogeneity.
In the first part of the talk [1], we establish the first optimal time complexities for parallel optimization methods (Rennala SGD and Malenia SGD) that have access to an unbiased stochastic gradient oracle with bounded variance, under the assumption that the workers compute stochastic gradients with different speeds, i.e., we assume compute heterogeneity. We prove lower bounds and develop optimal algorithms that attain them, both in the data homogeneous and heterogeneous regimes.
In the second part of the talk [2], we establish the first optimal time complexities for parallel optimization methods (Shadowheart SGD) that have access to an unbiased stochastic gradient oracle with bounded variance, under the assumption that the workers compute stochastic gradients with different speeds, as before, but under the further assumption that the worker-to-server communication times are nonzero and heterogeneous. We prove lower bounds and develop optimal algorithms that attain them, in the data homogeneous regime only.
Our results have surprising consequences for the literature of asynchronous optimization methods: in contrast with prior attempts to tame compute heterogeneity via "complete/wild" compute and update asynchronicity, our methods alternate fast asynchornous computation of a minibatch of stochastic gradients with infrequent synchronous update steps.
[1] Alexander Tyurin and Peter Richtárik. Optimal time complexities of parallel stochastic optimization methods under a fixed computation model, Advances in Neural Information Processing Systems 36 (NeurIPS 2023).
[2] Alexander Tyurin, Marta Pozzi, Ivan Ilin, and Peter Richtárik. Shadowheart SGD: Distributed asynchronous SGD with optimal time complexity under arbitrary computation and communication heterogeneity, arXiv:2402.04785, 2024.

By Peter Richtarik (KAUST)
Slides
Club de la Chasse
4:45 - 5:30 PMMachine learning for discovering new algorithms and solving mathematical problems
In the modern era of computing, algorithms for solving fundamental problems such as computing the shortest path in graphs and solving linear equations, are used billions of times every day. However, paradoxically, such algorithms which are central to modern computing were often designed well before the advent of modern computation. In fact, discovering new efficient algorithms is notoriously difficult, and often involves solving prohibitively large combinatorial problems. In this talk, I will describe how to use machine learning techniques - in particular reinforcement learning and large language models - to discover new algorithms and solve long-standing mathematical problems. We will particularly focus on designing new algorithms for fundamental computational problems, such as matrix multiplication.

By Alhussein Fawzi (DeepMind)
Club de la Chasse
5:30 - 7:00 PMPoster Session + Light Reception
(Note: Change of location)
Rice Global Paris Center

Day 3: Wednesday, June 12, 2024

TimeEventLocation
8:00 - 9:00 AMBreakfastClub de la Chasse
9:00 - 9:45 AM Hash functions bridging the gap from theory to practice
Randomized algorithms are often enjoyed for their simplicity, but the hash functions employed to yield the desired probabilistic guarantees are often too complicated to be practical. Hash functions are used everywhere in computing, e.g., hash tables, sketching, dimensionality reduction, sampling, and estimation. Many of these applications are relevant to Machine Learning, where we are often interested in similarity between high dimensional objects. Reducing the dimensionality is key to efficient processing. Abstractly, we like to think of hashing as fully-random hashing, assigning independent hash values to every possible key, but essentially this requires us to store the hash values for all keys, which is unrealistic for most key universes, e.g., 64-bit keys. In practice we have to settle for implementable hash functions, and often practitioners settle for implementations that are too simple in that the algorithms ends up working only for sufficiently random input. However, the real world is full of structured/non-random input. The issue is severe, for simplistic hash functions will often work very well in tests with random input. Moreover, the issue is often that error events that should never happen in practice, happen with way too high probability. This does not show in a few test, but will show up over time when you put the system into production. Over the last decade there has been major developments in simple to implement tabulation based hash functions offering strong theoretical guarantees, so as to support fundamental properties such as Chernoff bounds, Sparse Johnson-Lindenstrauss transforms, and fully-random hashing on a given set w.h.p. etc. I will discuss some of the principles of these developments and offer insights on how far we can bridge from theory (assuming fully-random hash functions) to practice (needing something that can actually implemented efficiently).

By Mikkel Thorup (U of Copenhagen)
Slides
Club de la Chasse
9:45 - 10:30 AMAn alternative view of denoising diffusion models
Denoising diffusion models have led to impressive generative models in many domains. In this talk, I will present recent progress, with a focus on formulations that do not involve stochastic differential equations.

By Francis Bach (INRIA)
Slides
Club de la Chasse
10:30 - 11:00 AMCoffee BreakClub de la Chasse
11:00 - 11:20 AMRising Stars talk: Lin YangClub de la Chasse
11:20 - 11:40 AMRising Stars talk: Amartya Sanyal
Differentially Private Algorithms with Correlated data
Differential Privacy (DP) promises to protect individual data privacy by ensuring that the inclusion or exclusion of a single data point does not significantly alter the output of an algorithm. However, DP typically assumes no correlation between individual data records, an assumption violated in several real-world settings. This talk will explore two such scenarios: when data undergoes non-private, data-dependent pre-processing (e.g., imputation, deduplication, PCA) prior to the DP operation and when data is adversarially selected in an online setting. In the first half of the talk, we will introduce a framework and provide a theoretical upper bound to assess privacy leakage from non-private pre-processing. In the second half, we will provide a lower bound for differentially private online learning, suggesting a learning theoretic separation between private and non-private online learning. This talk is based on two recent papers: "Provable Privacy with Non-Private Pre-Processing" accepted at ICML 2024, and "On the Growth of Mistakes in Differentially Private Online Learning: A Lower Bound Perspective," accepted at COLT 2024.
Club de la Chasse