Skip to main content

Multi-Task Learning as a Bargaining Game

In this article, we explore gradient combination in multi-task learning (MTL) as a cooperative bargaining game, and discuss Nash MTL — a novel approach — in detail.
Created on May 29|Last edited on May 12
Multi-task Learning, or MTL, is a paradigm in which a joint model is trained to simultaneously make predictions for several tasks. Joint training reduces computation costs and improves data efficiency. However, since the gradients of these different tasks may conflict, training a joint model for MTL often yields lower performance than its corresponding single-task counterparts.
One of the main reasons for such degradations is gradient conflict, i.e., these per-task gradients have conflicting directions or a large difference in magnitudes, with the largest gradient dominating the update direction. The degraded performance of multi-task learning due to poor training, compared with its potential to improve performance due to better generalization, has a major impact on many real-world systems.
Improving MTL optimization algorithms is, therefore, an important task with significant implications for many systems.

A General MTL Optimization Scheme

Currently, most MTL optimization algorithms follow a general scheme:
  1. Compute the gradients for all tasks g1,...,gkg_{1},..., g_{k}.
  2. Combine those gradients into a joint direction, Δ=A(g1,,gK)\Delta=\mathcal{A}\left(g_{1}, \ldots, g_{K}\right) using an aggregation algorithm A\mathcal{A}.
  3. Finally, the model parameters are updated using a single-task optimization algorithm, replacing the gradients with Δ\Delta.
Up until the date of writing, multiple heuristics have been proposed for the aggregation algorithm A\mathcal{A}. However, to the best of our knowledge, a principled, axiomatic, approach to gradient aggregation is still missing.
💡
In the paper Multi-Task Learning as a Bargaining Game, the authors propose viewing the gradients combination step as a cooperative bargaining game, where tasks negotiate to reach an agreement on a joint direction of parameter update. The authors propose the Nash Bargaining Solution as a principled approach to multi-task learning, and in doing so describe a new and efficient optimization algorithm for the MTL paradigm called Nash-MTL where the gradients are combined at each step using the Nash bargaining solution. The authors also demonstrate theoretically the fact that it's guaranteed to converge in the convex and nonconvex cases. In this article, we will be discussing the novel ideas proposed by this paper in detail.
This article was written as a Weights & Biases Report which is a project management and collaboration tool for machine learning projects. Reports let you organize and embed visualizations, describe your findings, share updates with collaborators, and more. To know more about reports, check out Collaborative Reports.
💡



Relevant Background On MTL

Let us set up and revisit some concepts which would act as the background for our understanding of the concepts of Nash-MTL as proposed by the authors.

Pareto Optimality

Optimization for Multi-task Learning is a specific case of Multi-objective Optimization, which can be formally defined as follows:
Given objective functions 1,,K\ell_{1}, \ldots, \ell_{K}, the performace of a solution xx is measured by a vector of objective values (1(x),,K(x))\left(\ell_{1}(x), \ldots, \ell_{K}(x)\right).
One main property of Multi-objective Optimization is that since there is no natural linear ordering on vectors, it is not always possible to compare solutions so there is no clear optimal value. It is said that a solution xx dominates xx^{\prime} if it is better on one or more objectives and not worse on any other objectives. A solution that is not dominated by any other is called Pareto optimal, and the set of all such solutions is called the Pareto front. For non-convex problems, a point is defined as local Pareto optimal if it is Pareto optimal in some open set containing it.
It is important to note that there is no clear way to select between different Pareto optimal solutions without additional assumptions or prior about the user preferences.
💡
For a detailed understanding of Pareto Optimality, you can refer to the following video



Nash Bargaining Solution

In this section, we provide a brief background on cooperative bargaining games and the Nash bargaining solution.
Cooperative bargaining is a process in which two or more people decide how to share a surplus that they can jointly generate. In many cases, the surplus created by the players can be shared in many ways, forcing the players to negotiate which division of payoffs to choose. Such surplus-sharing problems (also called the bargaining problem) are faced by management and labor in the division of a firm's profit, by trade partners in the specification of the terms of trade, and more.
Let's formalize a few concepts of bargaining games:
  • In a bargaining problem, we have KK players, each with their own utility function ui:A{D}Ru_{i}: A \cup\{D\} \rightarrow \mathbb{R}, which they wish to maximize.
  • AA is the set of possible agreements and DD is the disagreement point which the players default to if they fail to reach an agreement.
  • The set of possible payoffs is defined as U={(u1(x),,uK(x)):xA}RKU=\left\{\left(u_{1}(x), \ldots, u_{K}(x)\right): x \in A\right\} \subset \mathbb{R}^{K} and d=(u1(D),,uK(D))d=\left(u_{1}(D), \ldots, u_{K}(D)\right).
  • It is assumed that UU is convex, compact and that there exists a point in UU that strictly dominates dd, namely there exists a uUu \in U such that i:ui>di\forall i: u_{i}>d_{i}.
John Nash, in his seminal paper Two-person Cooperative Games from 1953 showed that for such payoff set UU, the two-player bargaining problem has a unique solution that satisfies the following properties or axioms:
  1. Pareto Optimality: The agreed solution must not be dominated by another option, i.e. there cannot be any other agreement that is better for at least one player and not worse for any of the players. As it is a cooperative game, it makes little sense that the players will curtail another player without any personal gains. It's natural to assume the agreed solution will not be dominated by another.
  2. Symmetry: The solution should be invariant to permuting the order of the players.
  3. Independence of Irrelevant Alternatives: If we enlarge the of possible payoffs to U~U\tilde{U} \supsetneq U, and the solution is in the original set UU, uUu^{*} \in U, then the agreed point when the set of possible payoffs is UU will stay uu^{*}.
  4. Invariant to Affine Transformations: If we transform each utility function ui(x)u_{i}(x) to u~i(x)=ciui(x)+bi\tilde{u}_{i}(x)=c_{i}\cdot u_{i}(x)+b_{i} with ci>0c_{i}>0 then if the original agreement had utilities (y1,,yk)\left(y_{1}, \ldots, y_{k}\right) the agreement after the transformation has utilities (c1y1+b1,,ckyk+bk)\left(c_{1} y_{1}+b_{1}, \ldots, c_{k} y_{k}+b_{k}\right).
The unique point satisfying all these axioms is called the Nash bargaining solution and is given as:
u=argmaxuUilog(uidi) s.t. i:ui>di\Large{\begin{gathered} u^{*}=\arg \max _{u \in U} \sum_{i} \log \left(u_{i}-d_{i}\right) \\ \text { s.t. } \forall i: u_{i}>d_{i} \end{gathered}}



An illustrative example of Optimization trajectories in the loss space.
5

In the above panel, we show 5 different initializations (black dots), and their trajectories are colored from orange to purple.
Losses have a large difference in scale. For linear scalarization (LS), PCGrad, and CAGrad, the optimization process is controlled by the gradient of 2\ell_{2}, since it has a larger magnitude, resulting in imbalanced solutions between tasks (mostly ending at the bottom right). These three methods also fail to converge to an optimal solution for the rightmost initialization points.
In contrast, MGDA is inclined towards the task with the smallest gradient magnitude (1\ell_{1}). Our method, Nash-MTL, is invariant to changes in loss scale and produces solutions that are well-balanced across the Pareto front.



Nash-MTL: The Algorithm

Now that we have set up the necessary background let's explore the proposed Nash-MTL algorithm in detail.
Given an MTL optimization problem and parameters θ\theta, we search for an updated vector Δθ\Delta\theta in the ball of radius η\eta centered around zero. This is framed as a bargaining problem with the agreement set BηB_{\eta} and the disagreement point at 0, i.e., staying at the current parameters θ\theta. The utility function for each player is defined as ui(Δθ)=giΔθu_{i}(\Delta \theta)=g_{i}^{\top} \Delta \theta where gig_{i} is the gradient of the loss of task ii at θ\theta.
Note that since the agreement set is compact and convex and the utilities are linear then the set of possible payoffs is also compact and convex.
💡
The authors have one main assumption apart from the ones used by Nash, which is that if θ\theta is not locally Pareto optimal then the gradients are linearly independent. Under this assumption, we also have that the disagreement point Δθ=0\Delta\theta = 0 is dominated by another in BηB_{\eta}. If θ\theta is not on the Pareto front, the unique Nash bargaining solution has the following form:
Let GG be the d×Kd \times K matrix whose columns are the gradients gig_{i}. The solution to argmaxΔθBϵilog(Δθgi)\arg \max _{\Delta \theta \in B_{\epsilon}} \sum_{i} \log \left(\Delta \theta^{\top} g_{i}\right) is iαigi\sum_{i} \alpha_{i} g_{i} where αR+K\alpha \in \mathbb{R}_{+}^{K} is the solution to GGα=1/αG^{\top} G \alpha=1 / \alpha where 1/α1 / \alpha is the element-wise reciprocal.
For detailed proof of the above claim, you can refer to section 3.1 of the paper. You may also refer to section 3.2 of the paper for a description of how to find the optimal solution to GGα=1/αG^{\top} G \alpha=1 / \alpha.
💡

Algorithm for Nash-MTL
0




Practical Speedup

One shortcoming of many leading MTL methods is that all task gradients are required for obtaining the joint update direction. When the number of tasks KK becomes large, this may be too computationally expensive as it requires one to perform KK backward passes through the shared backbone to compute the KK gradients. Prior works suggest the following speedups:
Note that this issue is not unique to our method, but rather is shared to all methods that compute all gradients for all tasks.
💡
In practice, the authors observe that using feature-level gradients as a surrogate to the gradient of the shared parameters dramatically degrades the performance of Nash-MTL. As an alternative, the authors suggest updating the gradient weights α(t)\alpha^{(t)} once every few iterations instead of every iteration.
This simple yet effective solution greatly reduces the runtime while maintaining high performance. The authors provide experimental results while varying the frequency of task weights update on the QM9 dataset and the MT10 benchmark, which show that Nash-MTL runtime can be reduced to about the same as linear scalarization (or STL) while maintaining competitive results compared to other baselines, although we do see a noticeable drop in performance compared with our standard approach. Specifically, the authors show that when updating task weights once every 100 steps in the MT10 benchmark, Nash-MTL outperforms all MTL methods while being only ∼ ×1.1 slower than the fastest baseline.
For more details regarding the performance, you can refer to section C of the paper.
💡



Scene Understanding Experiment

The authors follow the protocol described in the paper End-to-End Multi-Task Learning with Attention and evaluate Nash-MTL on the NYUv2 dataset. NYUv2 is an indoor scene dataset that consists of 1449 RGBD images and dense per-pixel labeling with 13 classes. This dataset is used as a multitask learning benchmark for semantic segmentation, depth estimation, and surface normal prediction. For this experiment, the authors train a SegNet model. Each method is trained for 200 epochs with the Adam optimizer and an initial learning rate of 1e41e-4. The learning rate is halved to 5e55e-5 after 100 epochs.

Results of Scene Understanding Experiment on NYUv2 Dataset
12




Conclusion

  • In this article, we discuss Nash-MTL, a novel and principled approach for multitask learning presented in the paper Multi-Task Learning as a Bargaining Game.
  • The authors frame the gradient combination step in MTL as a bargaining game and use the Nash bargaining solution to find the optimal update direction.
  • The authors highlight the importance of the scale invariance approach for multitask learning, specifically for setups with varying loss scales and gradient magnitudes.
  • The authors also provide a theoretical convergence analysis for Nash-MTL, showing that it converges to Pareto optimal and Pareto stationary points in the convex and non-convex settings, respectively.
  • Finally, the experiments conducted by the authors show that Nash-MTL achieves state-of-the-art results on various benchmarks across multiple domains.


Shai Meital
Shai Meital •  
Great piece, thank you
Reply
Iterate on AI agents and models faster. Try Weights & Biases today.