Military Vehicles Detection in Satellite & Aerial Imagery using Instance Segmentation

The objective of this project is to explore the potential of instance segmentation for military vehicles detection in satellite and aerial imagery using transfer learning. The methodology employs the Mask R-CNN network from Meta AI Research's next-generation library, Detectron2. This report aims to present the results of the experiments conducted, providing an in-depth examination of the approach taken and the underlying thought process.
Nader Narcisse
Created on September 29|Last edited on August 2
Comment
﻿
				Satellite imagery featuring Main Battle Tanks and BTRs - Map Data: Google, ©2021 or newer, Maxar Technologies
﻿
Table Of ContentsTable Of ContentsAbstractIntroductionFeasibility of LearningProposed MethodData Acquisition, Image Processing & Target FeaturesChallenges & LimitationsHyperparametersFeasibility TestModel Selection ExperimentExploratory PhaseExperiment Outcome Interpretation, Discussion & Final ResultsConclusion & Future Work
﻿
﻿
AbstractIn this report, a transfer learning approach for detecting military vehicles from satellite and aerial imagery is presented using the Detectron2 Mask R-CNN framework, which is a deep learning-based object detection algorithm. The aim of this project is to investigate the potential of instance segmentation models and assess their performance for military vehicles detection by analyzing both qualitative and quantitative results. The ultimate goal is to find and enhance the precision and efficiency of the best detection model. The models were trained on a dataset consisting of 200 satellite and aerial images in RGB format, which contained 10,624 annotated Russian military vehicles. The models were then tested on a separate set of unseen images to evaluate its performance. The results presented show that the proposed method is able to achieve good performance and efficiency in detecting military vehicles, even in complex and cluttered environments. 
﻿
IntroductionMask R-CNN (Region-based Convolutional Neural Network) is a deep learning-based object detection model that was introduced by researchers at Facebook AI Research (FAIR) in 2017. It is a Convolutional Neural Networks (CNN) and it extends the Faster R-CNN model, a popular object detection algorithm, by incorporating a "mask" branch to the network architecture, enabling the prediction of segmentation masks for each detected object. The model is trained on a large dataset of annotated images, in which the objects of interest are accurately labeled and segmented. During training, the model learns to predict the locations, classes, and shapes of objects in new images by generating bounding boxes and segmentation masks. The learning process is achieved by optimizing the model parameters using the annotated training data. At inference time, the trained Mask R-CNN model can be utilized for object detection in new images, classifying each detected object and generating a segmentation mask that outlines its precise boundaries. This approach has demonstrated state-of-the-art results on a range of object detection and image segmentation benchmarks and is widely used in applications such as computer vision, robotics, and autonomous vehicles. The task of detecting military vehicles from satellite and aerial images poses significant challenges due to factors such as high resolution, large image size, and the presence of clutter and occlusions. In recent years, there has been an increased demand for efficient and accurate methods for military vehicles detection, as these vehicles play a central role in defense and security operations, and their detection and tracking can provide valuable information for decision-making. Traditional methods, such as manual inspection and rule-based algorithms, are time-consuming and may not be robust enough to address these challenges.
The contributions of this work is the application of the Mask R-CNN algorithm using the Detectron2 framework for enhanced military vehicles detection performance and efficiency. The dataset used in this project was manually collected from satellite and aerial images sourced from Google Earth. Furthermore, the dataset was supplemented by utilizing high-resolution imagery, analytics, and geospatial data obtained from Maxar Technologies through publicly available sources on the internet. The locations of Russian military bases across the globe and their capabilities were identified using The Georgian Foundation for Strategic and International Studies map. The images were meticulously annotated using the LabelMe tool, a web-based image annotation tool developed by the MIT Computer Science and Artificial Intelligence Laboratory, to label the objects with masks and attributes.           It should be noted that this research project focuses solely on Russian military vehicles due to their prevalence and visibility on Google Earth. This choice was only made to facilitate data collection and does not reflect any political or ideological bias. The project is intended solely for academic and research purposes.
﻿
Feasibility of LearningFor the success of this research, it is fundamental to establish clear project boundaries and have a comprehensive understanding of the challenges and limitations in the learning process. Conducting a feasibility analysis of the learning process is necessary to determine the feasibility of the project and identify any potential obstacles or setbacks that may impede progress. In the context of military vehicles detection in satellite and aerial imagery using instance segmentation, several key factors influence the feasibility of learning. One major obstacle is the limited availability of labeled data for this specific task, which can hinder the development of a high-performing model. While publicly available datasets exist, they may not be sufficient in size, diversity, or quality to train a model capable of meeting the performance requirements for the task. Transfer learning can enhance model performance even with limited labeled data by leveraging pre-existing knowledge from a pre-trained model on a related task. In addition to the data limitation, the complexity of the task is another challenge. The variability in the appearance of military vehicles, as well as the presence of occlusions and camouflage, can make it difficult for a model to accurately detect and classify them. Factors such as lighting and weather conditions can also impact image quality and add to the complexity of the task. Finally, the computational resources required for training and deploying a model can be significant. Deep learning techniques and large neural network architectures can require considerable computational power and memory, which should be carefully considered. To minimize costs, free cloud-based training platforms like Google Colab will be used, but it is indispensable to ensure that the resources are sufficient to achieve the desired level of performance.
﻿
Proposed MethodThe focus of this project lies in the transfer learning based approach for the detection of military vehicles in satellite and aerial imagery. Given the limited availability of annotated data for this specific task, transfer learning provides a viable solution for leveraging pre-existing knowledge to enhance model performance. The methodology employs the Mask R-CNN algorithm, integrated within the Detectron2 framework. The process first begins by gathering and annotating a dataset of images containing military vehicles. The Mask R-CNN model, composed of a feature extractor, region proposal network, and two branches for object classification and mask prediction, is then trained on the annotated images using pre-trained weights from the COCO dataset. The fine-tuned model will then be capable of detecting and classifying military vehicles in new images, providing output in the form of bounding boxes, object masks, and classification scores. 
The models of the transfer learning approach in this study will trained on cloud-based platform Google Colab providing a Tesla T4 GPU and the performance will be evaluated using various quantitative measures, such as Average Precision (AP) metrics for each class and at varying Intersection over Union (IoU) thresholds (AP50 and AP75) using the detection evaluation metrics by COCO. To add, weighted Average Precision (wAP), and efficiency analysis will also be included with visual outputs of the model predictions on diverse imagery to gain a more comprehensive understanding of the model's performance.
Architecture of Detectron2﻿
Figure 1: Architecture of Detectron2. Image sourced from the paper 'A Means of Assessing Deep Learning-Based Detection of ICOS Protein Expression in Colon Cancer' (2021). Licensed under CC BY 4.0.
In the Mask R-CNN model, the backbone is the part of the model that is responsible for extracting features from the input image. The backbone is typically a convolutional neural network (CNN) that is trained to recognize patterns in images. There are many different types of CNNs that can be used as the backbone. Some popular choices include VGG, ResNet, and ResNeXt. It is possible to use a pre-trained CNN as the backbone, and fine-tune it later on. The choice of backbone for a Mask R-CNN model can affect the performance of the model. Different backbones have different architectures and are trained on different datasets, so they may have different strengths and weaknesses. Experimenting with different backbones can help find the one that works best for the task. Every model used in this study is defined like this:  "COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml". The string is the name of a configuration file for a Mask R-CNN model. The configuration file specifies the hyperparameters and other settings for the model. For Faster/Mask R-CNN, the baselines provided is based on three different backbone combinations.
C4, DC5, and FPN are different variants of the Faster R-CNN model that is designed for instance segmentation tasks in Detectron2:
C4 : stands for "Convolutional Neural Network with 4 layers". It uses a ResNet conv4 backbone with conv5 head. The original baseline in the Faster R-CNN paper.
﻿
DC5 : stands for "Detectron Context-aware Convolutional Feature Network". It uses a ResNet conv5 backbone with dilations in conv5, and standard conv and FC heads for mask and box prediction, respectively. This is used by the Deformable ConvNet paper.
﻿
FPN : stands for "Feature Pyramid Network". It uses a ResNet+FPN backbone with standard conv and FC heads for mask and box prediction, respectively. It obtains the best speed/performance tradeoff, but the other two are still useful for research. This is introduced in Feature Pyramid Networks for Object Detection paper.
﻿
On the other hand, X-101-32x8d, R-50, and R-101 are architectures for the backbone of the model in Detectron2. The X-101-32x8d backbone is based on the ResNeXt-101 architecture with 32 groups and 8 channels per group, while R-50 and R-101 backbones are based on the ResNet-50 and ResNet-101 architectures, respectively. Each of these backbones can have different effects on the performance of the model. 1x/3x refers to the learning rate schedule being used for training the model. It means that the learning rate will be increased by a factor of 1 or 3 at the beginning of each epoch.
﻿
Data Acquisition, Image Processing & Target FeaturesThe first step was to acquire a dataset of satellite and aerial imagery that contains labeled instances of military vehicles. The dataset should be diverse, including a variety of different types of vehicles, and should represent the range of variations and conditions that the model may encounter in real-world scenarios. The majority of the military vehicles that can be seen in satellite and aerial images available from open-source platforms like Google Earth belong to the Russian military. This can be attributed to the country's strong military presence in various regions across the world and its history of military operations. To add, with the advancements in satellite and aerial imaging technology, these images provide a comprehensive overview of the locations and capabilities of Russian military bases. This, in turn, makes it possible to detect and identify the various types of Russian military vehicles and as a result, the detection model being developed in this project is focused on detecting those vehicles.
﻿The Georgian Foundation for Strategic and International Studies (GFSIS) was a help by providing a map of Russian military forces, which gives an overview of the locations and capabilities of Russian military bases in the region. The dataset used in this project was manually collected from aerial and satellite images of military vehicles, sourced from Google Earth. Google Earth imagery is often obtained through a number of commercial providers, including Maxar Technologies, Airbus, TerraMetrics, DigitalGlobe, and the United States Geological Survey (USGS). To supplement the dataset, high-resolution imagery, analytics, and geospatial data were obtained from Maxar Technologies via public sources on the internet.
After acquiring the images, the next step is to perform pre-processing on the images. The images were cropped to a defined  dimension (1448x850) which is required by the Mask R-CNN in the Detectron2 framework. The cropping correspondingly helped with the need to remove the Maxar Technologies watermarks logos that could have bias as they can be confusing for the models. This pre-processing step allowed for better image quality and improves the models' ability to detect and classify military vehicles. To ensure visual clarity, emphasis, and consistency in the results, grayscale imagery outputs were utilized from the Detectron2 Visualizer. This approach was chosen to better highlight and differentiate the features of interest in the imagery and was the determined technique for enhancing the quality of results in the context of this study.
Due to the limited availability of data, data augmentation techniques such as flipping, rotating, and mirroring images were employed to increase the dataset's diversity and improve the model's generalization ability. Annotation of the dataset was done using LabelMe, a lightweight and user-friendly annotation tool. The dataset consists of 260 images in RGB format, divided into three subsets. 200 images were used for training, with 10,624 annotated targets for the Ural/KamAZ, T-72, and BTR-80 classes, accounting for 5691, 2851, and 2082 targets, respectively. To improve the understanding of military vehicle features in the training dataset and enhance generalization, 21 Maxar Technologies' images including smaller-sized military vehicles from public sources on the internet, were sourced. These images were carefully selected based on their relevance and quality and were augmented to increase dataset diversity, resulting in a total of 41 images in the training subset. Additionally, 12 of the training images depicted snow-covered environments and account for a minority representation in the dataset. The validation subset included 41 images, three of which were snow-covered environments, while the remaining 19 images formed the test set, including two snow-covered environments. Snow-covered environments can significantly impact the appearance of military vehicles and make them harder to detect, especially for models trained on data from other environments.
﻿
Important Note : Data Usage & Copyright Clarification
Acknowledging the critical nature of recognizing that the data obtained from diverse sources may be subject to copyright restrictions and disparate usage policies. As a result, I have taken measures to ensure compliance with relevant copyright laws and ethical considerations, and therefore, the dataset used in this study will not be made publicly available. I recognize the significance of respecting intellectual property rights and adhering to data usage restrictions, and have documented the data collection and pre-processing methodologies comprehensively to enable transparency and facilitate reproducibility to the fullest extent possible.
﻿
Military VehiclesLooking through the blueprints, the features and video footage of each military vehicles is decisive for creating a more accurate ground truth for the dataset. Those information were used as a reference while labelling the images, which can help to ensure that the labels are accurate and consistent. By acquiring detailed information about the shape, size, and features of the military vehicles, it facilitates labelling and minimise the errors.
The detection of the following three military vehicle targets was the focus of this study:
Ural/KamAZ (Trucks): The Urals trucks are manufactured by the Ural Automotive Plant, a Russian company, and are renowned for their rugged design and versatility in various industries, including mining, construction, and forestry. Similarly, Kamaz trucks, manufactured by Kamaz, another Russian company, are known for their durability and are widely used in industries such as construction, transportation, and logistics. 
﻿
T-72s (Main Battle Tank): The T-72 is a main battle tank that was developed in the Soviet Union in the 1970s. It is a heavily armored vehicle with a powerful gun and is designed for use on the battlefield. The tank has undergone several upgrades and modernizations over the years to keep up with changing technologies and battlefield needs. For example, the T-72B3 variant features improved armor, a new gun and fire control system, and advanced sensors and communications equipment.
﻿
BTR-80 (APC): The BTR (or Boevaya Tekhnicheskaya Razvedka in Russian) is a series of armored personnel carriers that were developed in the Soviet Union. These vehicles are designed to transport troops and equipment, and they often have mounted weapons for defensive purposes. They are not as heavily armored or as heavily armed as tanks, and they are not intended for use in direct combat.
﻿
Challenges & Limitations
Acquiring Data, Annotations & Computational ResourcesAcquiring a dataset comprising military vehicles contained in satellite and aerial images presented a substantial challenge. The Georgian Foundation for Strategic and International Studies (GFSIS) provided valuable information in identifying military facilities, nonetheless, the availability of a diverse array of military vehicles at these locations was limited. This necessitated a comprehensive search across multiple Russian bases globally to obtain a more robust and diverse dataset. The images were required to meet a minimum quality threshold, and each image was manually annotated with polygons as object masks. Annotating 260 diverse images with a total of over 10,624 annotated military vehicles with polygons was a time-consuming and labor-intensive task. This required a significant amount of manual work and attention to detail. Handling occlusions caused by other objects or terrain also made it difficult for the model to accurately detect and classify military vehicles which needs more data to learn from those occlusion. 
In the process of annotations, it is crucial to accurately label the right vehicles in the images. The identification of military vehicles in satellite and aerial imagery is a challenging task, and despite best efforts, misclassifications may occur. Specifically, while annotations are made based on the perceived resemblance and features of the vehicles, there remains the possibility that other military vehicles share similar characteristics and may not be correctly identified. As such, the presence of military vehicles in the imagery is not limited to the specific classes under consideration in this study. 
To clarify, the Ural/KamAZ class of vehicles in this study refers to heavy-duty trucks, but it's not limited to only these two brands. Distinguishing between different brands can be challenging, and there may be other heavy-duty trucks with similar capabilities and purposes, such as military artillery, like the BM-27 Ouragan which is based on a ZIL-135 8x8 chassis. In addition, identifying specific models of tanks and armored vehicles from satellite and aerial imagery can be challenging due to their similar visual characteristics. As a result, it can be difficult to accurately differentiate between different models of tanks, such as the T-72, T-80s, and T-90s, which share similar visual features. Similarly, the BTR-80 class of vehicles includes several variations like the BTR-80 and BTR-82, which can also be difficult to differentiate based solely on their visual characteristics. In this study, the targets were assumed based on the most popular models of each vehicle used by the Russian military.
This highlights the limitations of relying solely in satellite and aerial imagery for military vehicles detection and underscores the importance of utilizing additional sources of information and expertise to validate the results. To overcome this, a conducted thorough research by looking through the blueprints, the features, and video footage of each military vehicle made the annotations more accurate. This level of research ensured that the lack of knowledge in this area was overcome, making it possible to correctly label the vehicles. It is vital to acknowledge that the quality of the ground truth has a direct impact on the precision of the model, and having adequate expertise in the specific subject matter is key for producing high-quality datasets. 
The resources used for training the Mask R-CNN model on the dataset of annotated images was computationally expensive and required a high-performance computing environment. Google Colab is a cloud-based platform for machine learning that allows users to train and test models on powerful GPUs and TPUs free use limited. In this project, only the free version of Google Colab was used to train the Mask R-CNN models.
The Variety of the DatasetThe dataset features vehicles of varying sizes and images with varying degrees of orientation. It is generally not necessary for the size of vehicles to be fixed in order to detect them in satellite images, although the measurements of the vehicles can impact the performance of the detection model. Nevertheless, smaller vehicles can be more challenging to detect due to their size, making it essential to train the model on a variety of sizes and scales to improve its ability to detect and classify different types of military vehicles. In general, the proportions of objects in satellite images can vary due to a variety of factors, including the altitude of the satellite, the resolution of the camera, and the distance of the objects from the satellite. This means that the size of objects in satellite images can vary significantly, and the dimensions of vehicles can change depending on the specific circumstances.
The orientation degree at which images are captured (such as off-Nadir or Nadir) has also a significant impact on the performance. Off-Nadir images, in particular, can pose challenges for instance segmentation algorithms due to the oblique angle of capture, resulting in distortions and variations in scale that can affect the performance of the detected objects. These variations in orientation can cause changes in the scale, aspect ratio, and viewpoint of the objects in the image, making it challenging for the algorithms to accurately detect and segment military vehicles. Moreover, the angle of the sun and shadows cast by the objects in the images can impact the visibility of the targets, making them harder to detect. These factors underscore the importance of carefully considering the orientation degree of images when developing instance segmentation models for military vehicles detection in satellite and aerial imagery, as it can have an impact on the precision and robustness of the models.
Resizing vs CroppingCropping and resizing are two techniques commonly used in computer vision to manipulate images, with each having its own advantages and disadvantages for different purposes. At first, the resizing method was used, but it was eventually changed to cropping methods due to its drawbacks.
Resizing involves changing the dimensions of an image by scaling it up or down while preserving the same aspect ratio. This technique can be useful in image classification, where the size of the input image is not a concern and the model is designed to handle a variety of image sizes. Still, it can be problematic in instance segmentation for this task. Loss of resolution, distortion of aspect ratio, and loss of context can all negatively impact the model's performance, as these issues can make it difficult for the model to detect small details, accurately classify vehicles, and understand spatial relationships between objects. 
Cropping, on the other hand, is better suited for instance segmentation for this task for several reasons. By focusing on the region of interest in the image, it allows for a more object-centric approach and improved resolution, making it easier for the model to learn about the object itself. Cropping reduces the size of the image, making it easier to handle large image sizes and occlusions. The cropping also helps with the need to remove the watermarks logos that could have bias as they can be confusing for the models. Furthermore, it improves efficiency by reducing the number of pixels that need to be processed, and it preserves the aspect ratio and context of the image, allowing the model to accurately segment individual vehicles and understand spatial relationships between objects. 
Cropping is often preferred when the dimensions and position of the objects in the image are critical, while resizing is often preferred when the size of the input image is not vital and the model is designed to handle a wide range of image sizes.
Lack of sufficient data - 2S19 MstaThe Msta (2S19 Msta) is a self-propelled artillery vehicle that was developed in the Soviet Union in the 1980s. It is equipped with a 155mm howitzer, which is a type of artillery gun that fires large caliber shells at long range. The Msta is designed to provide indirect fire support to ground forces, and it is not intended for use in direct combat. 
A common challenge in machine learning is the lack of sufficient labeled data to train models. One of the classes of military vehicles in the dataset was too rare and had too few examples. The Msta category was removed to ensure that the model had enough data to learn from for the remaining classes. When a class has very few examples in the dataset, it can be difficult for the model to learn from them and generalize to new examples of that class. Additionally, having rare classes in the dataset can increase the complexity of the model, which can lead to overfitting and poor performance on new data. Removing this vehicle helped to reduce the complexity of the model and improve its generalization. To add, there is similarity between the Msta and the T-72 in terms of features in the context of satellite and aerial imagery which is a challenge for the model. By removing the Msta category, the focus was aimed at the more common classes that have more examples in the dataset, which helps the model to achieve better performance. This is because the model is able to learn from more examples of the remaining classes, which can improve its generalization. It is essential to consider the balance of the dataset, as adding too many classes with few data can lead to a more imbalanced dataset which can have a negative impact on the performance. 
﻿
HyperparametersThe following are the main training hyperparameters set used, excluding the default ones set by Detectron2:
cfg.SOLVER.IMS_PER_BATCH = 2
cfg.SOLVER.BASE_LR = 0.00025
cfg.SOLVER.MAX_ITER = 2000
cfg.SOLVER.STEPS = []   (Learning rate will not be decayed during training)
cfg.DATALOADER.NUM_WORKERS = 2
The current batch size of 2 (IMS_PER_BATCH = 2) may cause slow training for the given hardware resources due to the increased number of iterations required to process the entire dataset. To enhance training speed for image classification tasks, increasing the batch size to 8 or 16 can make better use of GPU resources and reduce data loading time. Nevertheless, it's important to consider hardware limitations, such as available memory, while selecting a batch size that strikes a balance between speed and computational resources.
The initial training parameters of BASE_LR value 0.00025 and MAX_ITER of 2000 are sensible choices but may require adjustments based on how quickly the model can reduce the loss function. The learning rate affects the size of weight updates, and the number of iterations impacts how many times the model sees the data. It's essential to tune these hyperparameters carefully to achieve optimal performance.
By setting STEPS = [], the learning rate will not decay during training, making it a suitable option during the initial training phase with a relatively small dataset of 200 images and 10,624 annotated military vehicles. A fixed learning rate can help the model converge more quickly to a local optimum, but a learning rate that decays over time can help the model continue to improve and explore the loss landscape more thoroughly, which may lead to better performance in the long run. However, given the small size of the dataset, it might not be necessary to decay the learning rate to improve the model's generalization ability.
While increasing the number of data loader workers (NUM_WORKERS = 2) could speed up data loading if data loading is currently the bottleneck, its affect on training times may not be significant. It's vital to closely monitor and optimize hardware resources to ensure efficient use of computing power during training. By carefully adjusting hyperparameters and monitoring performance, it's possible to achieve high-quality results even with a relatively small dataset.
﻿
Feasibility TestIn this context, conducting a feasibility test prior to conducting experiments is foundational in determining the viability of a proposed method. The objective of the feasibility test is to assess the potential for success of a given method and determine whether further development and implementation is justified. In this phase, the model chosen is subjected to a preliminary evaluation in a controlled environment, allowing for the identification of potential limitations, weaknesses, and areas for improvement. This information is critical in refining and optimizing the method before more extensive experimentation, and ultimately, in ensuring the success of the overall project. The qualitative results is sufficient for the purpose of a feasibility test in military vehicles detection in satellite and aerial imagery using instance segmentation. This can be especially relevant when the focus is on evaluating the general concept or determining the ability of the method to produce desirable outputs. In such instances, presenting qualitative results can provide valuable insight into the potential of the method, while avoiding the cost and complexity of obtaining more detailed quantitative results.
The qualitative evaluation was performed on the R_50_FPN_3x model to assess the feasibility for detecting military vehicles in satellite and aerial imagery. The evaluation was based on a validation set, and the results were analyzed using two different thresholds, 50% and 80%. These thresholds on images are commonly used to evaluate the performance of object detection models. In the context of military vehicles detection, accurate localization is especially important since military vehicles can be small and difficult to detect. Setting a high IoU threshold ensures that the model is only detecting vehicles that are clearly visible, thus reducing false positives. If a model has a high precision rate but fails to reach the minimum IoU threshold, it may indicate that the model is detecting objects but not accurately localizing them within the image. By using the IoU metric with an appropriate threshold, the model's performance can be accurately evaluated and improved, leading to a more effective military vehicles detection system.
Qualitative resultsThe qualitative performance of the model in detecting and classifying military vehicles in different scenarios is demonstrated through visual outputs of its predictions compared to the ground truth on various images. The model's precision is evaluated at 50% and 80% thresholds only for the feasibility test.
The color-coding scheme for each class for the ground truth images annotated by LabelMe is as follows: T-72 is represented by the color Red, BTR-80 by Green, and Ural/KamAZ by Blue. 
Image 1
								Image 1: Ground truth - Map Data: © Google, Maxar Technologies, 2021 or newer
﻿
								Image 1: Threshold 50% - Map Data: © Google, Maxar Technologies, 2021 or newer
﻿
								Image 1: Threshold 80% - Map Data: © Google Earth, Maxar Technologies, 2021 or newer
﻿
Image 2 
								Image 2: Ground Truth - Map Data: Google, ©2021 or newer, Maxar Technologies
﻿
								Image 2: Threshold 50% - Map Data: Google, ©2021 or newer, Maxar Technologies
﻿
								Image 2: Threshold 80% - Map Data: Google, ©2021 or newer, Maxar Technologies 
﻿
Image 3
								Image 3: Ground Truth - Map Data: © Google, Maxar Technologies, 2021 or newer
﻿
							Image 3: Predictions exceeding 80% - Map Data: © Google, Maxar Technologies, 2021 or newer
﻿
Image 4
								Image 4: Ground Truth - Map Data: © Google, Maxar Technologies, 2019 or newer
﻿
							Image 4: Predictions exceeding 80% - Map Data: © Google, Maxar Technologies, 2019 or newer
﻿
Image 5
										Image 5: Ground Truth - Map Data: Google, © Maxar Technologies
﻿
   		Image 5: No military vehicles were detected with a confidence score above 50% - Map Data: Google, © Maxar Technologies
The results of the experiment demonstrate that the proposed method can effectively detect military vehicles under diverse conditions, including varying lighting and weather. Nonetheless, the method exhibits some limitations when it comes to detecting military vehicles in snow-covered environments, as the vehicles are not detected even above 50% threshold (Image 5). This limitation may stem from the inadequacy of the training data in capturing the unique features and patterns that characterize snowy conditions. Consequently, the model might not have been able to identify military vehicles effectively in such environments. Potential future directions could address this limitation by collecting more data from snowy environments or augmenting the existing data to improve the model's ability to detect military vehicles in challenging conditions.
Despite these limitations, the qualitative results demonstrate the feasibility of the proposed method for detecting military vehicles in satellite and aerial imagery. The results were evaluated qualitatively, and promising detections were obtained (as shown in Image 1, 2, 3, and 4). These detections demonstrate the potential of the proposed method for detecting military vehicles in satellite and aerial imagery and provide a foundation for future experiments in this project.
﻿
Model Selection ExperimentThe objective of this experiment study was to evaluate the effectiveness of ten instance segmentation models available in the Detectron2 framework for detecting military vehicles in satellite and aerial imagery using instance segmentation. The experiment used a training set of 200 images and a validation set of 41 images with identical hyperparameters as the feasibility test. To determine the best performing model, the evaluation was based on standard metrics for instance segmentation, Average Precision (AP) metrics for each class and at varying Intersection over Union (IoU) thresholds (AP50 and AP75) and efficiency metrics to provide a more thorough analysis. 
Quantitative results﻿
﻿
The R_101_FPN_3x model, which utilizes a ResNet-101 backbone and a Feature Pyramid Network (FPN) for object detection, demonstrated the highest precision among all the evaluated models. The model's performance was evaluated using Average Precision (AP) where it achieved the highest scores of 32.58% compared to the other models. The model also demonstrated superior performance in terms of AP at different Intersection over Union (IoU) thresholds, achieving top scores of 56.93% and 38.11% at IoU of 50% and 75%, respectively, further indicating its strong capability for the given task. The second-best performing model, X_101_32x8d_FPN_3x, achieved an AP of 30.92%. It also achieved an AP with an IoU at 50% and 75% of 55.29% and 34.59%, respectively.
﻿
﻿
﻿
When choosing a model for military vehicles detection in satellite and aerial imagery, it's essential to consider not only overall performance but also performance on specific classes. When comparing the R_101_FPN_3x and X_101_32x8d_FPN_3x models, it is clear that both models have good overall performance, but they perform differently on specific classes. Specifically, the X_101_32x8d_FPN_3x model achieved higher Average Precision score for the Ural/KamAZ and BTR-80 classes, indicating that it is particularly well-suited for detecting these types of military vehicles. On the other hand, the R_101_FPN_3x model performed significantly better for the T-72, suggesting that it is the better choice for detecting this class. These results could be due to the specific characteristics of each class, as well as the composition and diversity of the dataset.
﻿
Efficiency is a crucial factor to consider when selecting a model, as it determines how quickly the model can process data and how much experimentation can be done within a given time frame.
For clarity here is the list of the metrics tracked:
GPU Time Spent Accessing Memory (%): This metric can be a useful indicator of the efficiency of the model's memory access, which can be valuable for performance on large datasets or on hardware with limited memory resources.
GPU Power Usage (% & W): This metric can be useful for monitoring the power consumption of the model during training, which can help optimize for energy efficiency.
GPU Temp (℃): This metric can be useful for monitoring the temperature of the GPU during training, which can help avoid overheating and optimize for stability.
System Memory Utilization (%): This metric can be useful for monitoring the memory usage of the system during training, which can help avoid running out of memory and optimize for stability.
Process CPU Utilization (%): This metric can be useful for monitoring the CPU usage of the training process and identifying potential CPU bottlenecks that may affect performance.
System CPU (0 & 1) Utilization (%): This metric can be useful for monitoring the overall CPU usage of the system during training and identifying potential performance bottlenecks.
Disk Utilization (%): This metric can be useful for monitoring the disk I/O of the training process and identifying potential I/O bottlenecks that may affect performance.
System Memory Usage (%): This metric can be useful for monitoring the memory usage of the training process specifically and identifying potential memory bottlenecks that may affect performance.
﻿
 
All models10
2 Best models2
﻿
The graphs above show that the R_101_FPN_3x model is more efficient than the X_101_32x8d_FPN_3x model for instance segmentation, with shorter training time and similar or better efficiency results. This efficiency advantage can result in reduced training costs, faster model development and deployment, and accelerated experimentation and iteration. Therefore, the R_101_FPN_3x model is a promising choice for this task, taking into account both performance as seen in the previous analysis and efficiency considerations.
Findings of ExperimentThe R_101_FPN_3x model was identified as the optimal choice for detecting military vehicles in satellite imagery based on its superior performance and efficiency. Compared to the X_101_32x8d_FPN_3x model, the R_101_FPN_3x model demonstrated better overall quantitative scores and was significantly more efficient in terms of training time, resulting in a cost-effective and faster solution. While the X_101_32x8d_FPN_3x model exhibited better performance for the BTR-80 and Ural/KamAZ classes, it lagged behind in detecting T-72 instances, showing a 14% lower detection rate than the R_101_FPN_3x model. The R_101_FPN_3x model's superior efficiency, demonstrated by its shorter training time, makes it a more suitable candidate for experimentation and tuning to address the weaknesses in the detection rates of the BTR-80 and Ural/KamAZ classes. Compared to the X_101_32x8d_FPN_3x model, which required three times more training time, the R_101_FPN_3x model allows for more iterations and adjustments to be made within a shorter period. This highlights the advantage of the FPN backbone combination in achieving the best speed/performance tradeoff.
Improving the detection rates for the BTR-80 and Ural/KamAZ classes like the T-72 is required, as they are commonly used by military forces and therefore have significant operational importance. By utilizing the R_101_FPN_3x model's efficiency and its ability to provide accurate detections, further optimization of these classes' performance can be achieved through tuning of hyperparameters. This has the potential to improve the overall performance of the model and increase its utility in real-world deployment scenarios. 
While the feasibility test yielded promising qualitative results, the quantitative performance metrics in this experiment fell below the expectations with an Average Precision (AP) of 32.58%. Although quantitative measures are fundamental for evaluating model performance, they may not always provide a complete picture. In this case, errors in instance segmentation predictions and class identification could have affected the quantitative performance metrics. Additionally, the model may not have been adequately trained or fine-tuned for this specific task, emphasizing the need for further optimization. As such, the quantitative results should be interpreted with caution, and the qualitative results should be given equal importance in assessing the model's overall performance.
﻿
Exploratory PhaseFor this exploratory experiment, the same evaluation metrics were used as in previous experiments, with the addition of weighted Average Precision (wAP), which is necessary in this scenario to balance the impact of each class on the overall performance. Since the Ural/KamAZ, T-72, and BTR-80 classes have different numbers of targets, balancing their impact on the overall performance is decisive. The goal of this experiment was to increase the number of iterations in the final model. This would lay the foundation for more complex and sophisticated experiments in later phases. Increasing the number of iterations allows the model to receive more training, which can help it better learn the features and patterns of the data. In object detection, more iterations may allow the model to better distinguish between different types of objects, leading to improved precision. However, increasing the number of iterations may also lead to overfitting if the model is not properly regularized. Understanding the basic behavior of the model leads to more informed decisions about which approaches to pursue and which techniques to use in future rounds of experimentation to enhance the model's performance in accurately detecting and classifying military vehicles.
Unfortunately, resource limitations in this study such as computational power, memory, and data storage constrained the number of experiments that could be run and the complexity of the models that could be trained. This made it more challenging to conduct a thorough hyperparameter tuning process and achieve optimal performance on the given task. If computational power is limited, training larger models or increasing the learning rate may not be feasible, while memory constraints may require a decrease in batch size, resulting in longer training times.
﻿
Quantitative resultsThe objective of this experiment was to assess the impact of doubling the number of iterations on the performance of the instance segmentation model R_101_FPN_3x. The model was trained using the same data and hyperparameters as the previous experiment , but with an increased number of iterations from 2000 to 4000. The second best model from the previous model selection experiment X_101_32x8d_FPN_3x was also displayed along the previous and final model of R_101_FPN_3x in order to assess more insights from the results.
﻿
﻿
The experimental results showed an overall improvement in Average Precision (AP), with values ranging from 32.58% to 34.80%. Specifically, there was a slight increase in AP for Intersection over Union (IoU) at 50% and 75% thresholds, with top scores of 56.93% to 58.88% and 38.11% to 39.44%, respectively, indicating the model's improved capability for the given task. 
﻿
﻿
﻿
The experimental results indicate notable improvements in the performance of specific classes. Compared to the baseline model, the Ural/KamAZ and BTR-80 classes exhibited significantly higher Average Precision scores of 26.79% (up from 19.97%) and 42.12% (up from 41.13%), respectively. In contrast, the final model's performance for the T-72 class showed a slight decrease in Average Precision from 36.65% to 35.49%. It is noteworthy that the weighted Average Precision (wAP) demonstrated a clear enhancement, with the final model achieving a wAP of 32.13% for these classes, compared to the baseline model's wAP of 28.59%. The wAP is a weighted mean of the Average Precision across all classes, where the weight for each class is proportional to the number of ground truth objects in that class, which is important when dealing with skewed class distribution. In this study, the wAP is a particularly useful metric because the dataset contains imbalanced classes, where some classes have significantly more instances than others. The formula for calculating wAP is ∑(AP_i * N_i) / ∑N_i, where AP_i represents the Average Precision for the i-th class, and N_i represents the number of instances or ground truth objects for that class in the dataset.
﻿
Qualitative resultsThe visual outputs of the final model's predictions demonstrate its ability to accurately detect and classify military vehicles in various environments, including complex and cluttered scenes. These results provide qualitative evidence of the model's effectiveness and reliability. To provide a clearer picture of the model's performance, the results represented the qualitative results using 50% and 75% detection score thresholds, as shown in the accompanying quantitative graphs.
The color-coding scheme for each class for the ground truth images annotated by LabelMe is as follows: T-72 is represented by the color Red, BTR-80 by Green, and Ural/KamAZ by Blue. 
Image 1
								Image 1: Ground truth - Map Data: © Google, Maxar Technologies, 2021 or newer
﻿
								Image 1: Threshold 50% - Map Data: © Google, Maxar Technologies, 2021 or newer
﻿
								Image 1: Threshold 75% - Map Data: © Google, Maxar Technologies, 2021 or newer
﻿
Image 2
								Image 2: Ground Truth - Map Data: Google, ©2021 or newer, Maxar Technologies
﻿
								Image 2: Threshold 50% - Map Data: Google, ©2021 or newer, Maxar Technologies
﻿
								Image 2: Threshold 75% - Map Data: Google, ©2021 or newer, Maxar Technologies
﻿
Image 3
								Image 3: Ground Truth - Map Data: © Google, Maxar Technologies, 2021 or newer
﻿
							Image 3: Predictions exceeding 75% - Map Data: © Google, Maxar Technologies, 2021 or newer
﻿
Image 4
								Image 4: Ground Truth - Map Data: © Google, Maxar Technologies, 2019 or newer
﻿
								Image 4: Threshold 50% - Map Data: © Google, Maxar Technologies, 2019 or newer
﻿
								Image 4: Threshold 75% - Map Data: © Google, Maxar Technologies, 2019 or newer
﻿
Image 5
										Image 5: Ground Truth - Map Data: Google, © Maxar Technologies
﻿
										Image 5: Threshold 50% - Map Data: Google, © Maxar Technologies
﻿
										Image 5: Threshold 75% - Map Data: Google, © Maxar Technologies
﻿
Experiment Outcome Interpretation, Discussion & Final ResultsThe final model resulted in an improvement in Average Precision (AP) with values ranging from 32.58% to 34.80%, indicating that the model performed better than the previous model in detecting instances, including the majority Ural/KamAZ class. Specifically, the Ural/KamAZ category showed a significant improvement, with AP increasing from 19.96% to 26.79%, indicating the model's ability to detect instances in more complex examples due to greater variability within that class. Nevertheless, the fact that there was no substantial improvement in Average Precision for the other classes suggests that the model has not notably improved in detecting instances in the minority classes and even slightly decreased, individually for the T-72 target. This may be due to the class imbalance in the dataset, which can make it harder to train a model that can accurately detect instances in the minority classes. 
To address this limitation of analysis, the use of weighted Average Precision (wAP) is indispensable. This metric accounts for the proportion and number of instances of each class in each image, providing a more comprehensive evaluation of the model's performance across all classes. Weighted Average Precision is especially suitable for evaluating models in this study dataset where multiple object classes (military vehicles) appear in a single image. By taking into consideration the proportion and number of instances of each class, wAP can better assess the model's ability to accurately locate and classify multiple instances of the same or different object class. As a result, wAP is considered a more reliable metric than AP for evaluating the model's performance, as it imposes a higher level of stringency and ensures a fair evaluation across all classes. There is also a risk that the model may have overfit to the Ural/KamAZ target, potentially resulting in decreased performance on other classes. Qualitative results showed that the model performed well in complex and cluttered environments, but there were instances where the model appeared to overfit with the BTR-80 class, leading to false positives (as seen in Image 1). Nevertheless, the model's ability to accurately identify military vehicles in various environments is encouraging and suggests that it can handle diverse and challenging scenarios. This is particularly noteworthy given the previous limitation observed in the feasibility test conducted on snow-covered environments (as seen in Image 5).
Based on the results, it can be concluded that increasing the number of iterations resulted in improved precision for the selected R_101_FPN_3x final model, especially for challenging examples in the dataset, and enhanced performance for the Ural/KamAZ class. The fact that the final model required twice as many iterations as the previous model suggests that the previous model may have been underfitting the data. However, increasing the number of iterations may also increase the risk of overfitting, and further experimentation is necessary to fully understand the impact of this change on the model's overall performance for all classes in the context of military vehicles detection in satellite and aerial imagery using instance segmentation.
It is important to interpret the results of this experiment in the context of the overall evaluation and to compare them to other models or baselines to determine the significance of the observed improvements. While the project was limited to evaluating a single hyperparameter due to resource constraints, further experiments are necessary to conduct a more comprehensive evaluation. These experiments should evaluate additional hyperparameters, incorporate more data, and assess other metrics such as F1 score to gain a more comprehensive understanding of the model's strengths and weaknesses. Such experiments will help to further optimize the model's performance and gain a deeper understanding of its capabilities and limitations.
﻿
Qualitative Test ResultsUltimately, the test images were carefully selected to assess the model's performance on each class individually and collectively, including the use of non-target images to ensure that the model can accurately detect and distinguish between the absence and presence of military vehicles. False positives could have significant consequences in the context of military vehicles detection in aerial and satellite imagery, making it critical to ensure the precision of the model. The findings of the study confirm the effectiveness of the final R_101_FPN_3x model in detecting military vehicles in satellite and aerial imagery, as demonstrated by the observed improvement in precision during the model evaluation. This outcome is promising, suggesting that the model has a certain degree of generalization ability. Here are the few outputs obtained.
The color-coding scheme for each class for the ground truth images annotated by LabelMe is as follows: T-72 is represented by the color Red, BTR-80 by Green, and Ural/KamAZ by Blue. 
Output 1 - T-72, BTR-80 & Ural/KamAZ
Output 2 - T-72 & BTR-80
Output 3 - T-72
Output 4 - BTR-80
Output 5 - Ural/KamAZ
Output 6 - Ural/KamAZ 
Output 7 - Ural/KamAZ - Snow-covered environment
Output 8 - None Target Image
Output 9 - None Target Image
Output 10 - None Target Image
Output 11 - None Target Image
Conclusion & Future WorkThis research project proposes an instance segmentation approach for identifying military vehicles in satellite and aerial imagery by utilizing transfer learning based on Detectron2 Mask R-CNN framework, a cutting-edge deep learning-based object detection algorithm. Although the study was conducted with a small sample size, the results suggest a significant effect, and the analysis reveals a clear pattern. The experimental results demonstrate the effectiveness of the Detectron2 Mask-RCNN framework in detecting and classifying military vehicles in satellite and aerial imagery. The R_101_FPN_3x was identified as the best model for the task in the first phase of the experiment. This model achieved good performance and efficiency, as measured by quantitative results. The qualitative test set further confirmed the generalization capability of the model, indicating a promising future for its application. The model was also successful in accurately identifying an absence of military vehicles in the images and detecting and classifying military vehicles in cases where manual inspection may have failed to identify them. These findings indicate that the model may offer benefits in contexts where the detection and identification of military vehicles is challenging for human observers. The results demonstrate the potential of using machine learning and computer vision techniques for detecting and classifying military vehicles in satellite and aerial imagery and pave the way for further research and development in this field.
The proposed method has potential applications in defense and security-related tasks, such as surveillance and reconnaissance, and identifying and tracking military vehicles. Furthermore, the proposed method could be extended to other types of objects, such as military buildings and infrastructure. Nevertheless, there is still room for improvement. Some of the future works that can be done are :
Increasing the diversity of the dataset by incorporating additional environmental factors and occlusions to further enhance the generalization of the model.
Investigating the use of synthetic data generation and domain adaptation techniques to increase the number of training samples.
Having more rounds of experimentation by finding the optimal combination of hyperparameters.
Evaluate the model's performance on other metrics, such as F1 score to gain more understanding of the model's behavior.
Using other deep learning architectures such as U-Net and DeepLab, and comparing the performance with Mask R-CNN.
Testing the model on different satellite and aerial imagery platforms and sensor modalities to increase the robustness.
﻿
Add a comment
zblian • 2 years ago
can you make your dataset public？