Impact of Square Root on SiLog

Many published studies in monocular depth that train models using SiLog loss add a square root on top of the loss calculation in their code, without providing the reason or citing any experimental support for the modification. This report helps to fill that void.
Nate Cibik
Created on September 18|Last edited on September 18
Comment
This report is the experimental component of  "The Mystery of SiLog Loss". For an introduction on the topic and the context for this experiment, please refer to the article. In summary, a mysterious practice of taking the square root of the SiLog loss (Eigen et al., 2014) has been introduced into the literature with seemingly no explanation. This report covers three runs which seek to provide experimental support for the analysis laid out in the article: that, given that the potential for taming large input values doesn't apply in the context of the difference of logs input, the only remaining conceivable benefit of taking the square root of the SiLog loss would be to help generate larger gradients at near zero loss with the risk of encountering very large gradients approaching zero, and that a larger learning rate or loss weight (in the case of multiple losses) may be a safer alternative to get more activity from the gradients during training, without introducing the concerning mathematical property of rapidly increasing gradients towards zero, which can in theory create training instability and prevent weight convergence.
The model being trained is a Segformer-b0 with two task heads, one for semantic segmentation training on Cross Entropy loss, and the second training monocular depth using SiLog loss with λ = 0.25 (to increase the effect of metric depth error component). The model was pre-trained on the semantic segmentation task before these runs, and the dataset being used is the synthetic SHIFT dataset for autonomous vehicle research. We compare the results of three training runs: two comparing the effect of taking the square root of the SiLog loss with all other variables held constant, including learning rate, and a third without the square root but with a higher learning rate. Below we can see the high-level training and evaluation metrics for these runs over 5 epochs (the lr=1e-5 nosqrt run was cut short due to time constraint, but the trend is clear).
﻿
Run set4
﻿
Comparing the metrics from the run taking the square root of the SiLog loss to the one which did not, we can see that the runs were quite similar, but that the one with the square root did get slightly better evaluation results over time, which is likely due to the fact that both the loss value and the gradients would be larger at the lower limit of of the training error, which gives an advantage at the same learning rate. While the depth training went slightly better in the square root loss run, there was a slight decrease in the semantic segmentation scores, which is probably due to the depth loss gradients being relatively larger and dominating the backprop, since the losses were being weighted equally in all runs.
For comparison, the third run was done without square root but with a 5x higher learning rate at 1e-5, to see if this made for a more fair comparison across time to account for the lower gradients in the non-square rooted loss. The results were better across the board for all tasks, which may indicate the learning rate was a bit low in the first place, but does support the theory that rather than increasing the gradients near zero by taking the square root of the loss, we could just increase the impact of gradients in backprop by increasing the learning rate or the task loss weight instead, and not unnecessarily introduce undesirable mathematical properties to be concerned about.
Looking at the smoothed training losses, we can see that at step 65,000, when the shorter run was stopped, the loss for the higher learning rate run hovers around 0.029, with the nosqrt lr=1e-5 run around 0.037, and the square root loss run at 0.180. Squaring the last value for comparison gives us 0.032, which is right in between the two nosqrt runs.
In conclusion, although the square root loss shows a slight advantage at an equal learning rate due to higher gradients at near-zero loss, increasing the overall effect of the gradients via a higher learning rate is the safer solution to making more progress when training error becomes near-zero. In the context of gradient descent, passing a first order function through the square root function has unattractive mathematical property of gradients exploding to infinity as you approach zero. It must be mentioned, however, that the minimum squared SiLog loss over the 95,000 training steps was 0.168, the square of which is 0.028, which has a gradient of 2.982 on the square root curve, so the loss does not appear to get low enough to enter the region of extremely high gradients. Nevertheless, it seems to go against intuition to have gradients grow larger as weights approach their optimums, so until convinced otherwise, I will not be using square root on SiLog loss when training models.
﻿
Add a comment