Multi-Task Learning as a Bargaining Game

In this article, we explore gradient combination in multi-task learning (MTL) as a cooperative bargaining game, and discuss Nash MTL — a novel approach — in detail.
Soumik Rakshit
Created on May 29|Last edited on May 12
Comment
Multi-task Learning, or MTL, is a paradigm in which a joint model is trained to simultaneously make predictions for several tasks. Joint training reduces computation costs and improves data efficiency. However, since the gradients of these different tasks may conflict, training a joint model for MTL often yields lower performance than its corresponding single-task counterparts. 
One of the main reasons for such degradations is gradient conflict, i.e., these per-task gradients have conflicting directions or a large difference in magnitudes, with the largest gradient dominating the update direction. The degraded performance of multi-task learning due to poor training, compared with its potential to improve performance due to better generalization, has a major impact on many real-world systems. 
Improving MTL optimization algorithms is, therefore, an important task with significant implications for many systems.
A General MTL Optimization SchemeCurrently, most MTL optimization algorithms follow a general scheme:
Compute the gradients for all tasks g1,...,gkg_{1},..., g_{k}g1​,...,gk​﻿.
Combine those gradients into a joint direction, Δ=A(g1,…,gK)\Delta=\mathcal{A}\left(g_{1}, \ldots, g_{K}\right)Δ=A(g1​,…,gK​)﻿ using an aggregation algorithm A\mathcal{A}A﻿.
Finally, the model parameters are updated using a single-task optimization algorithm, replacing the gradients with Δ\DeltaΔ﻿. 
Up until the date of writing, multiple heuristics have been proposed for the aggregation algorithm A\mathcal{A}A﻿. However, to the best of our knowledge, a principled, axiomatic, approach to gradient aggregation is still missing.
💡
In the paper Multi-Task Learning as a Bargaining Game, the authors propose viewing the gradients combination step as a cooperative bargaining game, where tasks negotiate to reach an agreement on a joint direction of parameter update. The authors propose the Nash Bargaining Solution as a principled approach to multi-task learning, and in doing so describe a new and efficient optimization algorithm for the MTL paradigm called Nash-MTL where the gradients are combined at each step using the Nash bargaining solution. The authors also demonstrate theoretically the fact that it's guaranteed to converge in the convex and nonconvex cases. In this article, we will be discussing the novel ideas proposed by this paper in detail.
This article was written as a Weights & Biases Report which is a project management and collaboration tool for machine learning projects. Reports let you organize and embed visualizations, describe your findings, share updates with collaborators, and more. To know more about reports, check out Collaborative Reports.
💡
﻿
Relevant Background On MTLLet us set up and revisit some concepts which would act as the background for our understanding of the concepts of Nash-MTL as proposed by the authors.
Pareto OptimalityOptimization for Multi-task Learning is a specific case of Multi-objective Optimization, which can be formally defined as follows:
Given objective functions ℓ1,…,ℓK\ell_{1}, \ldots, \ell_{K}ℓ1​,…,ℓK​﻿, the performace of a solution xxx﻿ is measured by a vector of objective values (ℓ1(x),…,ℓK(x))\left(\ell_{1}(x), \ldots, \ell_{K}(x)\right)(ℓ1​(x),…,ℓK​(x))﻿.One main property of Multi-objective Optimization is that since there is no natural linear ordering on vectors, it is not always possible to compare solutions so there is no clear optimal value. It is said that a solution xxx﻿ dominates x′x^{\prime}x′﻿ if it is better on one or more objectives and not worse on any other objectives. A solution that is not dominated by any other is called Pareto optimal, and the set of all such solutions is called the Pareto front. For non-convex problems, a point is defined as local Pareto optimal if it is Pareto optimal in some open set containing it.
 It is important to note that there is no clear way to select between different Pareto optimal solutions without additional assumptions or prior about the user preferences.
💡
For a detailed understanding of Pareto Optimality, you can refer to the following video
﻿
﻿
Nash Bargaining SolutionIn this section, we provide a brief background on cooperative bargaining games and the Nash bargaining solution.
Cooperative bargaining is a process in which two or more people decide how to share a surplus that they can jointly generate. In many cases, the surplus created by the players can be shared in many ways, forcing the players to negotiate which division of payoffs to choose. Such surplus-sharing problems (also called the bargaining problem) are faced by management and labor in the division of a firm's profit, by trade partners in the specification of the terms of trade, and more. 
Let's formalize a few concepts of bargaining games:
In a bargaining problem, we have KKK﻿ players, each with their own utility function ui:A∪{D}→Ru_{i}: A \cup\{D\} \rightarrow \mathbb{R}ui​:A∪{D}→R﻿, which they wish to maximize.
﻿AAA﻿ is the set of possible agreements and DDD﻿ is the disagreement point which the players default to if they fail to reach an agreement.
The set of possible payoffs is defined as U={(u1(x),…,uK(x)):x∈A}⊂RKU=\left\{\left(u_{1}(x), \ldots, u_{K}(x)\right): x \in A\right\} \subset \mathbb{R}^{K}U={(u1​(x),…,uK​(x)):x∈A}⊂RK﻿ and d=(u1(D),…,uK(D))d=\left(u_{1}(D), \ldots, u_{K}(D)\right)d=(u1​(D),…,uK​(D))﻿.
It is assumed that UUU﻿ is convex, compact and that there exists a point in UUU﻿ that strictly dominates ddd﻿, namely there exists a u∈Uu \in Uu∈U﻿ such that ∀i:ui>di\forall i: u_{i}>d_{i}∀i:ui​>di​﻿.
﻿John Nash, in his seminal paper Two-person Cooperative Games from 1953 showed that  for such payoff set UUU﻿, the two-player bargaining problem has a unique solution that satisfies the following properties or axioms:
Pareto Optimality: The agreed solution must not be dominated by another option, i.e. there cannot be any other agreement that is better for at least one player and not worse for any of the players. As it is a cooperative game, it makes little sense that the players will curtail another player without any personal gains. It's natural to assume the agreed solution will not be dominated by another.
Symmetry: The solution should be invariant to permuting the order of the players.
Independence of Irrelevant Alternatives: If we enlarge the of possible payoffs to U~⊋U\tilde{U} \supsetneq UU~⊋U﻿, and the solution is in the original set UUU﻿, u∗∈Uu^{*} \in Uu∗∈U﻿, then the agreed point when the set of possible payoffs is UUU﻿ will stay u∗u^{*}u∗﻿.
Invariant to Affine Transformations: If we transform each utility function ui(x)u_{i}(x)ui​(x)﻿ to u~i(x)=ci⋅ui(x)+bi\tilde{u}_{i}(x)=c_{i}\cdot u_{i}(x)+b_{i}u~i​(x)=ci​⋅ui​(x)+bi​﻿ with ci>0c_{i}>0ci​>0﻿ then if the original agreement had utilities (y1,…,yk)\left(y_{1}, \ldots, y_{k}\right)(y1​,…,yk​)﻿ the agreement after the transformation has utilities (c1y1+b1,…,ckyk+bk)\left(c_{1} y_{1}+b_{1}, \ldots, c_{k} y_{k}+b_{k}\right)(c1​y1​+b1​,…,ck​yk​+bk​)﻿.
The unique point satisfying all these axioms is called the Nash bargaining solution and is given as:
u∗=arg⁡max⁡u∈U∑ilog⁡(ui−di) s.t. ∀i:ui>di\Large{\begin{gathered}
u^{*}=\arg \max _{u \in U} \sum_{i} \log \left(u_{i}-d_{i}\right) \\
\text { s.t. } \forall i: u_{i}>d_{i}
\end{gathered}}u∗=argu∈Umax​i∑​log(ui​−di​) s.t. ∀i:ui​>di​​﻿
﻿
﻿
An illustrative example of Optimization trajectories in the loss space.5
﻿
In the above panel, we show 5 different initializations (black dots), and their trajectories are colored from orange to purple. 
Losses have a large difference in scale. For linear scalarization (LS), PCGrad, and CAGrad, the optimization process is controlled by the gradient of  ℓ2\ell_{2}ℓ2​﻿, since it has a larger magnitude, resulting in imbalanced solutions between tasks (mostly ending at the bottom right). These three methods also fail to converge to an optimal solution for the rightmost initialization points. 
In contrast, MGDA is inclined towards the task with the smallest gradient magnitude (ℓ1\ell_{1}ℓ1​﻿). Our method, Nash-MTL, is invariant to changes in loss scale and produces solutions that are well-balanced across the Pareto front.
﻿
Nash-MTL: The AlgorithmNow that we have set up the necessary background let's explore the proposed Nash-MTL algorithm in detail.
Given an MTL optimization problem and parameters θ\thetaθ﻿, we search for an updated vector Δθ\Delta\thetaΔθ﻿ in the ball of radius η\etaη﻿ centered around zero. This is framed as a bargaining problem with the agreement set BηB_{\eta}Bη​﻿ and the disagreement point at 0, i.e., staying at the current parameters θ\thetaθ﻿. The utility function for each player is defined as ui(Δθ)=gi⊤Δθu_{i}(\Delta \theta)=g_{i}^{\top} \Delta \thetaui​(Δθ)=gi⊤​Δθ﻿ where gig_{i}gi​﻿ is the gradient of the loss of task iii﻿ at θ\thetaθ﻿.
Note that since the agreement set is compact and convex and the utilities are linear then the set of possible payoffs is also compact and convex.
💡
The authors have one main assumption apart from the ones used by Nash, which is that if θ\thetaθ﻿ is not locally Pareto optimal then the gradients are linearly independent.  Under this assumption, we also have that the disagreement point Δθ=0\Delta\theta = 0Δθ=0﻿ is dominated by another in BηB_{\eta}Bη​﻿.  If θ\thetaθ﻿ is not on the Pareto front, the unique Nash bargaining solution has the following form:
Let GGG﻿ be the d×Kd \times Kd×K﻿ matrix whose columns are the gradients gig_{i}gi​﻿. The solution to arg⁡max⁡Δθ∈Bϵ∑ilog⁡(Δθ⊤gi)\arg \max _{\Delta \theta \in B_{\epsilon}} \sum_{i} \log \left(\Delta \theta^{\top} g_{i}\right)argmaxΔθ∈Bϵ​​∑i​log(Δθ⊤gi​)﻿ is ∑iαigi\sum_{i} \alpha_{i} g_{i}∑i​αi​gi​﻿ where α∈R+K\alpha \in \mathbb{R}_{+}^{K}α∈R+K​﻿ is the solution to G⊤Gα=1/αG^{\top} G \alpha=1 / \alphaG⊤Gα=1/α﻿ where 1/α1 / \alpha1/α﻿ is the element-wise reciprocal.For detailed proof of the above claim, you can refer to section 3.1 of the paper. You may also refer to section 3.2 of the paper for a description of how to find the optimal solution to G⊤Gα=1/αG^{\top} G \alpha=1 / \alphaG⊤Gα=1/α﻿.
💡
﻿
Algorithm for Nash-MTL0
﻿
﻿
Practical SpeedupOne shortcoming of many leading MTL methods is that all task gradients are required for obtaining the joint update direction. When the number of tasks KKK﻿ becomes large, this may be too computationally expensive as it requires one to perform KKK﻿ backward passes through the shared backbone to compute the KKK﻿ gradients. Prior works suggest the following speedups:
Using a subset of tasks as proposed by Conflict-Averse Gradient Descent for Multi-task Learning﻿
Replacing the task gradients with the feature level gradients as proposed by End-to-End Multi-Task Learning with Attention.
Note that this issue is not unique to our method, but rather is shared to all methods that compute all gradients for all tasks.
💡
In practice, the authors observe that using feature-level gradients as a surrogate to the gradient of the shared parameters dramatically degrades the performance of Nash-MTL. As an alternative, the authors suggest updating the gradient weights α(t)\alpha^{(t)}α(t)﻿ once every few iterations instead of every iteration. 
This simple yet effective solution greatly reduces the runtime while maintaining high performance. The authors provide experimental results while varying the frequency of task weights update on the QM9 dataset and the MT10 benchmark, which show that Nash-MTL runtime can be reduced to about the same as linear scalarization (or STL) while maintaining competitive results compared to other baselines, although we do see a noticeable drop in performance compared with our standard approach. Specifically, the authors show that when updating task weights once every 100 steps in the MT10 benchmark, Nash-MTL outperforms all MTL methods while being only ∼ ×1.1 slower than the fastest baseline.
For more details regarding the performance, you can refer to section C of the paper. 
💡
﻿
Scene Understanding ExperimentThe authors follow the protocol described in the paper End-to-End Multi-Task Learning with Attention and evaluate Nash-MTL on the NYUv2 dataset﻿. NYUv2 is an indoor scene dataset that consists of 1449 RGBD images and dense per-pixel labeling with 13 classes. This dataset is used as a multitask learning benchmark for semantic segmentation, depth estimation, and surface normal prediction. For this experiment, the authors train a ﻿﻿SegNet model. Each method is trained for 200 epochs with the Adam optimizer and an initial learning rate of 1e−41e-41e−4﻿. The learning rate is halved to 5e−55e-55e��5﻿ after 100 epochs.
﻿
Results of Scene Understanding Experiment on NYUv2 Dataset12
﻿
﻿
ConclusionIn this article, we discuss Nash-MTL, a novel and principled approach for multitask learning presented in the paper Multi-Task Learning as a Bargaining Game.
The authors frame the gradient combination step in MTL as a bargaining game and use the Nash bargaining solution to find the optimal update direction.
The authors highlight the importance of the scale invariance approach for multitask learning, specifically for setups with varying loss scales and gradient magnitudes.
The authors also provide a theoretical convergence analysis for Nash-MTL, showing that it converges to Pareto optimal and Pareto stationary points in the convex and non-convex settings, respectively.
Finally, the experiments conducted by the authors show that Nash-MTL achieves state-of-the-art results on various benchmarks across multiple domains.
﻿
﻿
Add a comment
Shai Meital • 2 years ago
Great piece, thank you
Tags: Articles, Experiment, Tutorial, Advanced, Domain Agnostic, Plots, Panels
Iterate on AI agents and models faster. Try Weights & Biases today.