Alireza Fallah, Ph.D. Candidate, EECS. Based on a joint work with Aryan Mokhtari, UT Austin, and Asu Ozdaglar, MIT.

**Introduction**

Imagine sitting in your autonomous car, going for a vacation. Your vehicle should follow the directions provided by the navigation app, and also use multiple sensors to monitor other vehicles, road signs, street light, etc. As a result, during the course of your journey, your car might need to take actions within a few seconds, such as turning or stopping. The question is how should your vehicle be programmed to be able to adapt to the new tasks within a short amount of time and limited data.

In a large fraction of artificial intelligence problems, ranging from robotics to image classification and pattern recognition, the goal is to design systems that use prior experience and knowledge to learn new skills more efficiently. **Meta-learning** or **learning to learn** formalizes this goal by using data from previous tasks to learn update rules or model parameters that can be fine-tuned to perform well on new tasks with small amount of data.

A particularly simple and effective approach for this problem, proposed by Finn et al., is **model-agnostic meta learning (MAML)**. This approach finds a meta initialization which can be updated after the arrival of the new task and by using a gradient-based method. Several recent papers studied the empirical performance of various methods in this setting. In our work Fallah et al., we establish the first convergence guarantees of MAML type methods for general non-convex setting.

**Problem Formulation**

Let denote the set of all tasks and let *p* be the probability distribution over tasks *T*.

Each task *T _{i }* has a corresponding loss function which measures how well the model

*w*performs on that task. We also define as the expectation of

*f*over all tasks, i.e.,

_{i}Figure 1a: Traditional statistical learning problem, Figure 1b: MAML problem

In traditional statistical learning, we aim to minimize the loss function *f*, as we expect its solution to be a proper approximation for the optimal solution of a new unseen task *T _{i}* (see Figure 1).

However, in the model-agnostic meta-learning problem (Figure 1), we aim to find the best point

*w*that performs well as an initial point for learning a new task

^{*}*T*when

_{i}*we have budget for running a few steps of gradient descent*. For simplicity, we focus on finding an initialization

*w*

^{*}such that, after observing a new task

*T*, one gradient step would lead to a good approximation for the minimizer of . We can formulate this goal as the following optimization problem

_{i}*k*, first selects a batch of tasks

*B*, and then proceeds in two stages: the

_{k}*inner loop*and the

*outer loop*. In the inner loop, for each chosen task

*T*in

_{i}*B*, MAML computes a mid-point using a step of stochastic gradient on

_{k}*f*. Then, in the outer loop, MAML runs one step of stochastic gradient descent with an estimate of computed over

_{i}*B*using the the mid-points for each task

_{k}*T*evaluated in the inner loop. The final update rule of MAML can be written as:

_{i}Table 1 : Our theoretical results for convergence of MAML, first-order approximation of MAML (FO-MAML), and our proposed Hessian-free MAML (HF-MAML) to a first-order stationary point (FOSP) in nonconvex settings. Here, *d* is the problem dimension, is a bound on the standard deviation of from its mean , and is a bound on the standard deviation of estimating using a single data point.

**Results and Conclusion**

*d*. To resolve this issue, the authors in Finn et al. propose the first-order approximation of MAML (

*FO-MAML*) which ignores the second-order term in the update rule of MAML. In our work, we show that if the learning rate used for updating each individual task is small or the tasks are statistically close to each other, then the error induced by this first-order approximation is negligible. Nevertheless, in general, in contrast to MAML which can find an epsilon-first order stationary point for any arbitrary epsilon, the performance of FO-MAML is limited and it cannot achieve any small desired level of accuracy.

*Hessian-free MAML (HF-MAML)*, which recovers the complexity bounds for MAML while it does not require computation of the Hessian or a Hessian-vector product. In fact, we show that HF-MAML enjoys a better convergence rate in comparison to FO-MAML, and, for any positive epsilon, it can find an epsilon-first-order stationary point after while keeping the computational cost at each iteration of . Hence, HF-MAML has the best of both worlds meaning that it has a low computational complexity as in FO-MAML and it can achieve any arbitrary accuracy for first-order stationarity as in MAML. These results are briefly provided in Table 1 and the details can be found in Fallah et al..

*i*is . Here,

*M*is rank one, generated as where

_{i}*g*is a random Gaussian vector. The variance of

_{i}*g*controls task similarity . To capture noisy gradients and Hessians we add random Gaussian noise with variance to the tasks gradients and Hessians.

_{i}- In Figure 2a, we assume gradients and Hessians are exact and focus on task variation. We assume that the batch-size for tasks is equal to number of tasks equal to 20. In this case, even though the gradients are exact, FO-MAML converges to an error level with a gap compared to two others. This is consistent with our results.
- We next consider noisy gradients and Hessians. We also set number of tasks equal to 50 and take a batch of tasks with size 10 at each iteration. In Figure 2b, we choose the variance of
*g*small to ensure the tasks are relatively similar. Here the additional persistent error of FO-MAML is negligible compared to the other term and all three methods behave similarly._{i} - In Figure 2c, we increase the variance of
*g*(tasks are less similar). In this case, the term dominates the other term in the error of FO-MAML, and it has worse performance compared to two others._{i}

*Alireza Fallah is a third-year Ph.D. student in the Electrical Engineering and Computer Science department, advised by Professor Asuman Ozdaglar. Before that, he earned a dual B.Sc. degree in Electrical Engineering and Pure Mathematics from Sharif University of Technology, Tehran, Iran. His research interest is mainly in optimization, the theory of machine learning, statistical inference, and Game Theory. His current research focuses on the theoretical understanding of optimization methods as well as developing fast and robust algorithms for machine learning applications. He has received several awards and fellowships, including Mathworks Engineering Fellowship, Siebel Scholarship, First prize in International Mathematics Competition (IMC) in 2016, and the silver medal in International Mathematical Olympiad (IMO) in 2011 and 2012. He can be contacted *here*. *