SF-Net: Single-Frame Supervision for Temporal Action Localization

Submission for Reproducibility Challenge 2021 for the paper "SF-Net: Single-Frame Supervision for Temporal Action Localization" by Ma et al. (2020), accepted to ECCV 2020
Kajal Puri
Created on July 13|Last edited on August 10
Comment
﻿
Reproducibility SummaryThis report is a reproduction of ECCV 2020 paper "SF-Net: Single-Frame Supervision for Temporal Action Localization" by Ma et al. (2020). The original code is available here.
The paper itself introduces a new data annotation technique for video frames as well as a new network that gives promising results based on the same methodology. This consolidated system is called SF-Net. They perform various experiments on the task of Temporal Action Localization using benchmark datasets like THUMOS14, GTEA and BEOID. This report aims at replicating their reported results as well as using W&B to track various hyper-parameters during the training and evaluation phases. 
Scope of Reproducibility The authors have open-sourced the newly annotated datasets for THUMOS14, GTEA and BEOID (using single-frame supervision technique) along with the extracted features that they have used during training. They have used the single-frame supervision technique to train temporal action localization models to predict the "actionness" score, pseudo background and action frame mining. It outperforms the weakly-supervision technique by a huge margin and performs on-par with fully-supervised technique. 
MethodologyTo reproduce the paper we have mainly used the newly annotated datasets, pre-trained, extracted features, and code from the above-linked open-sourced github repository. There were a few missing libraries in the requirements section and version mismatched but it was resolved while reproducing the code. We reproduced the results mainly on THUMOS14 dataset only as this is the largest dataset with the best results reported in the paper. We used Weights & Biases for the logging of various hyper-parameters during training and inference. 
ResultsWe were able to reproduce the main results reported in the paper using the GPUs. Due to lack of time and resources, we couldn't explore the ablation studies and few other secondary experiments reported in the paper. The primary results that we obtained are very much aligned with the reported results. We also played with different values of hyper-parameters to see the change in the results. Although, they report in their paper the technique is not sensitive to various hyper-parameter values but we still observed few points shift (decrease) in the resulting mAP values. But, overall the results obtained from replicating the code and experiments supports the authors' claim that single-frame supervision technique is much more efficient than weakly-supervised methodologies.
What was easyBecause we were able to easily locate the open-sourced data, code, and extracted features, it wasn't difficult to understand the code and experiments referenced in this paper. They report the values for each hyper-parameter on the implementation section of the paper as well as in a separate file in their code. And although the values are a bit mismatched, it was easy to spot and correct that as well. The network used is simplified in the code and the whole process is divided into two modules i.e. classification and actionness which makes it easy to change the code and perform other experiments.
What was difficultAs mentioned the values of the hyper-parameters are mismatched in the paper and github repository. Additionally, they do not provide the code for reproducing the various graphs/plots that they have used in the qualitative analysis section of the paper so it is difficult to say if that was in fact, true or not. They have uploaded the extracted features of video datasets on Google Drive which also gives a challenge to download sometimes as the size of one of the files is 6.8 GB. The annotation of the datasets has been provided by the authors but it is almost impossible to check the validity of their results on some other dataset unless we annotate the huge datasets according to the methodology mentioned in the paper.
Communication with the AuthorsThe authors have replied to our queries regarding the above claims and helped in giving references of the code not provided in the paper. They also actively reply on the github open issues. We are thankful to them for their work and responsiveness.
IntroductionThis report is a reproduction of the ECCV 2020 paper "SF-Net: Single-Frame Supervision for Temporal Action Localization" by F. Ma et al. (2020). 
The paper introduces a novel technique i.e. single-frame supervision to annotate the datasets in order to minimize the effort, time and resources for the action localization task. Additionally, they also introduce an end-to-end architecture that gives best results compatible with this type of dateset. They prove the efficacy of their algorithm by bench-marking on various datasets and comparing it with the existing weakly and fully supervised techniques. This paper replicates all the information provided in the paper and supports their claim by reporting very similar results.
Scope of ReproducibilityThe paper proposes that in single-frame supervision, only one frame of action is annotated in the video along with its timestamp. The action localization task is mainly composed of three steps. In first step, they predict the actionness score for each frame in the video. This score is just a probability of an action happening in the frame. In second step, they mine the background and pseudo-action frame, which is based on their single-frame supervised annotations. In the last step, they train the classifier using the ground truth labels as well as the pseudo labels and perform experiments on benchmark datasets. So we can see the key points of this paper are : 
Actionness Score for each video frame.
Mining of pseudo background and action frames.
Train the classifier using single-frame annotation ground truth and pseudo labels.
Actionness ScoreThey have an Actionness Module in their architecture that is responsible for the identification of positive action frames in the video. It takes the input of one frame and gives the output of a scalar probability indicating the chances of that particular frame involving some action or not. 
To predict this score, they use two temporal convolution layers connected with a fully-connected layer which gives an output of a matrix that is called actionness score. They also apply sigmoid function on this score and calculate a binary classification loss (Action_Loss).
Frame MiningIn this technique, each action has one frame annotated per video so the positive action frame examples are scarce. To overcome this problem during training, they introduce a novel pseudo-background and action frames mining strategy. 
For action frames, they treat the annotated frame as "anchor" frame and then select a radius to extend it in the previous and in the future frames. For the background mining strategy, they explicitly introduce a background category so that the misclassification of action class can be avoided. To mine K background frames, they collect the classification scores of all the unlabeled frames as a first step. In the next step, they calculate the ratio of background frames with the labeled frames. At the end, they sort these along the background frames to select the top K scores. They also calculate the background frame mining loss based on these frames (Frame_Loss).
Objective Function :Additionally, they also use one video-level loss (Video_Loss) function for handling the problem of multi-label action classification in videos. As a result, the final objective function for the training becomes:
﻿Total_Loss=Action_Loss+αFrame_Loss+βVideo_LossTotal\_Loss = Action\_Loss + \alpha Frame\_Loss + \beta Video\_LossTotal_Loss=Action_Loss+αFrame_Loss+βVideo_Loss﻿﻿
In above equation, alpha and beta are hyper-parameters weighing the importance of the losses. It's value lies in the range of [0,1).
Classifier Training :The classification module of their architecture deals with training the classifier for the data. One of the input to this module is the extracted features of all the frames in the videos. The output of this module is the classification score for all the frames of each action class. This module comprises of three fully-connected layers to receive the classification scores. This score is also further used to compute frame level loss and temporally pooled to compute video level classification loss.
MethodologyTo get the low-level understanding of the method used in this paper, we investigated the code on their github repo : https://github.com/Flowerfan/SF-Net . To make their code run, we installed the missing libraries along with the versions of libraries that they were using. We also downloaded the extracted features of all the datasets from the google drive : https://drive.google.com/drive/folders/1DfLDau7hqb-5huhB3W-3XljeuFu2YcF9 . We also modified their code to add the logging support of "Weights and Biases" library so that we can track the progress of training and various hyper-parameters.
We tried to use the freely available resources like Google Colab but couldn't succeed so we used GCP instead to carry out the experiments and training of the network.
Model DescriptionsThe model presented in the paper evaluates on different datasets mainly on the two following tasks :
Segment Localisation : Identifying a segment (start time, end time) of frames for each action instance.
Single-frame Localisation : Detecting one frame per action instance.
They use I3D network trained on Kinetics dataset to extract features from the videos of the datasets. The I3D model uses the input of 16 frames. For the RGB stream, they rescale the dimensions to 256 and then perform a standard center cropping size of 224 x 224. For the flow stream, they use the TV-L1 optical flow algorithm. At the end they perform fusion on both the stream predictions to cover the appearance as well as motion of the videos.
DatasetsThey perform experiments on 3 datasets (THUMOS14, GTEA, BEOID) in order to prove the efficacy of their method. But as these datasets are annotated in a fully/weakly supervised fashion, they re-annotate these datasets by using single-frame supervision technique. The fully open-source single-frame annotation of these datasets in order for future experiments on their github repository. 
THUMOS14 : There are 1010 validation and 1574 test videos from 101 action categories in THUMOS14. Out of these, 20 categories have temporal annotations in 200 validation and 213 test videos.  The single-frame annotations are available at this link : https://github.com/Flowerfan/SF-Net/tree/master/data/Thumos14-Annotations/single_frames﻿﻿﻿
﻿﻿ GTEA : In this dataset, there are in total 28 videos of which 7 fine-grained types of daily activities in a kitchen. The single-frame annotations are available at https://github.com/Flowerfan/SF-Net/tree/master/data/GTEA-Annotations/single_frames﻿﻿﻿
BEOID : It contains of 58 videos in total. Each video has on an average of 12 action instances. The single-frame annotations are available at this link : https://github.com/Flowerfan/SF-Net/tree/master/data/BEOID-Annotations/single_frames﻿﻿﻿﻿﻿﻿﻿﻿﻿
Hyper-ParametersWe have used the same hyper-parameters mentioned on their github repository to reproduce the results. Few of the primary ones are : Learning rate is set to be 0.001, batch size is 32 and ADAM optimizer has been used for all the experiments. The value of α\alphaα﻿ and β\betaβ﻿ has been set to 1 whereas η\etaη﻿ is set to be 5 for experiments performed on THUMOS14 dataset. The number of iterations are 2000 for THUMOS14.
Experimental Set-UpFor all the training experiments using THUMOS14 dataset, we trained the models using the NVIDIA GeForce RTX 2080 Ti GPU. It took 3-4 hours for the training on our GPU with hyper-parameters mentioned above. All the experiments were conducted based on the code that has been released publicly by the authors are at their github repository : https://github.com/Flowerfan/SF-Net 
ResultsWe reproduced the SF-Net for temporal action localisation task on THUMOS14 dataset specifically. We used the pre-extracted video features trained on I3D network, open-sourced by the authors to train the SF-Net initially. The results are the following :
Reproduced Results


















DatasetModelsmAP@IoU=0.3mAP@IoU=0.5mAP@IoU=0.7
THUMOS14SFBAE52.6729.3810.51
﻿
Original Results (Source : Github Repo)


















DatasetModelmAP@IoU=0.3mAP@IoU=0.5mAP@IoU=0.7
THUMOS14SFBAE53.0429.8210.87
﻿
As we can see that the reproduced results are very close to the results reported in the paper as well as on their github repository. We have reported the log results along with the hyper-parameters in the panel below.
DiscussionBased on the results obtained, we have the sufficient evidence to validate the authors' claims made in the SF-Net paper. SF-Net significantly outperforms its counterpart weakly-supervised methods and gives a tough competition to fully-supervised methods. The one ordeal could be the annotation cost of the datasets. As the action-recognition datasets are labeled according to fully-supervised methodology, it'd be quite a tedious task to annotate them again using single-frame supervision technique. Even authors have acknowledged that they do not plan to extend this method for ActivityNet dataset because of lack of the funding to annotate as it is a huge dataset for action recognition. So the requirement for the single-frame annotation of dataset remain a challenging task to experiment more with this methodology. 
Due to lack of time, code and computation resources, we could not reproduce the results for ablation studies so we can't validate the claims mentioned in the paper for the same.
What was easyThe paper has been written in a clear and concise manner which makes it easy for someone to understand, even when one is not familiar with Video analytics domain. The contributions are clearly described along with the experimental setup for the dataset as well as the training part. Moreover, the open-sourced code has proven to be extremely helpful while reproducing the results.
What was difficultIt was a little difficult to download the pre-extracted video features from their google drive link because of one of the files being 7 GB. The requirements file doesn't install a few libraries like tensorboard and we had to install them manually using pip. There is discrepancy in the number of iterations they mention in the paper (5000 itr) but on their github repository they only compute it for 2000 iterations. The implementation code for ablation studies isn't available as well making it difficult to validate their claims. 
Communication with the AuthorsThe authors have replied to our queries regarding the above claims and helped in giving references of the code not provided in the paper. They also actively reply on the github open issues. We are thankful to them for the same.
ConclusionIn SF-Net the authors have successfully created a single-frame annotation technique along with a network that helps in reducing the resources as well as increasing the accuracy for the temporal action localization task in video analysis. There are a few challenges involving the re-annotation of the datasets, action and background frame mining etc. but we believe it is a productive step towards the aim of decreasing supervision during annotation process. We would like to encourage the authors to extend this idea to other datasets if possible. 
Logs﻿
Run set3
﻿
﻿
Dataset	Models	mAP@IoU=0.3	mAP@IoU=0.5	mAP@IoU=0.7
THUMOS14	SFBAE	52.67	29.38	10.51
Dataset	Model	mAP@IoU=0.3	mAP@IoU=0.5	mAP@IoU=0.7
THUMOS14	SFBAE	53.04	29.82	10.87
Add a comment
Tags: Intermediate, Computer Vision, Video, Research, SF-Net, Plots
Iterate on AI agents and models faster. Try Weights & Biases today.