Survival of the Fittest CNN Model

Optimize the hyperparameters of a CNN with Variable Length Genetic Algorithm. Made by Aritra Roy Gosthipaty using Weights & Biases
Aritra Roy Gosthipaty

🚀 Introduction

TL;DR

Genes of an individual express themselves through its attributes. The genes decide upon the color of the eyes, the individual's height, and many more. An individual's fitness is directly dependent on the attributes, which depend on the genes. Individuals evolve over generations to optimize their attributes for a particular environment. We are part of the process ourselves.

What if I told you to replace individuals with CNN models and genes with hyper-parameters? How would they evolve? What would they optimize while evolving? How would you describe their fitness? We have got some answers here in this report.

Try out the code here** | Kaggle Notebook**

Problem Statement

An artificial neural network is made up of only weights and biases, right? Wrong! An artificial neural network is made up of many prior constraints and weights and biases. These constraints, i.e., the number of neurons, the choice of activation (non-linearity), the number of layers, are commonly termed as hyper-parameters. A vast field of research is based on hyper-parameter optimization. This means people are interested in not only turning the knobs of the weights and biases but also that of the hyper-parameters. There are some great approaches (Grid, Random, Bayesian, to name some), which have already marked this field. Some researchers took upon themselves to bridge the optimization problem with evolutionary algorithms.

Here, in this report, we look into and break down this paper Efficient Hyperparameter Optimization in Deep Learning Using a Variable Length Genetic Algorithm by Xiao et al., which is an ambitious step towards evolving hyper-parameters with the Darwinian rule.

Genetic Algorithm

Evolutionary algorithms work in pursuit of the Darwinian mechanism of evolution, simply said, the natural selection. Trivially, they look forward to helping a population (collection of individuals) evolve over generations to be optimized. The very definition of population and optimization may vary with the problem statement and the one working on it (well, many people and us now). Genetic Algorithm (GA) is one such evolutionary algorithm. It is composed of a couple of sub-mechanisms that collaboratively head to an optimal fitness or say solution. These sub-mechanisms, viz. crossover, mutation, and (natural) selection, are often tweaked per the problem statement.

Convolutional Neural Network

The paper that we will try to reproduce focuses only on the nitty-gritty of Convolutional Neural Networks (CNNs). CNN is a special kind of artificial neural network that is extensively used in the field of vision. These architectures have filters that convolve with an input image and learn to pick up some image patterns. The hyperparameters corresponding to a CNN can be the number of filters, the size of the filters, the type of activation function used, and much more.

Hyperparameter Optimization

In GA, we take individual organisms and then group them in a population. The individuals are all genetically different. Slowly and steadily, the individuals mate and create offspring and evolve through them. Due to this, organisms optimize with generations. This report has taken each hyperparameter as a gene in a chromosome and the CNN models as the organisms. We will evolve these individuals over a generation. Each individual has a fitness score linked to itself. With an evolutionary algorithm, the population treads towards increasing the average fitness score. This directly means that the end product of GA applied to hyperparameters will provide us with an optimal setting of the hyperparameters.

Genetic Algorithm

In order to understand GA, we first need to understand (or at least have a vague understanding of) optimization. Now, what is optimization? If I were to tell you that in a population of migrating zebras, the one to sacrifice itself to a river full of crocodiles would be the oldest one (let us call it Ozzy), what would you infer? This is something you probably might have seen in one of those Nat Geo documentaries, but what you did not see is that Ozzy was basically pushed by the rest of the population into the river to cross it with a distraction for the crocodiles. Why was Ozzy “the chosen one”? This is because Ozzy was the oldest and could not contribute his genes to the next generation. Hence, it was important that the rest of the zebras who could take part in transmitting their genes (or simply, mate) survived. This brings us to an inference that animals live to maximize the number of genes they pass on to the next generation.

Now, we all know who discovered evolution, isn’t it? Charles Darwin. However, that is not it. Darwin did not discover evolution. Darwin, just like many other scientists, proposed a mechanism for evolution, the one that is majorly accepted. “Struggle for existence” and “survival of the fittest” are the two proposed clauses for the Darwinian mechanism of evolution. If you were a mathematician working in the field of behavioral or evolutionary biology, you would be a spectator of much sorcery. Say you calculated the best size of giraffes' heart based on their structure, food habit, and energy expenditure. If you come across to measure a real giraffe's heart size, you would discover that it is exactly the size you had calculated. Moreover, this happens almost all the time because animals evolve to attain the best working condition to survive in that period. In other words, they evolve to attain the best fit or to “optimize” their bodies (and hence, the underlying gene sequences). This is why GA or in that way, any evolutionary algorithm is used to obtain an optimal solution or even a set of them.

The very phenomenon of evolution (mostly) runs on the Darwinian terms, i.e., natural selection. “Survival of the fittest” comes into play right here. On this note, I should clarify that an individual is the fittest when it survives among the rest and not the other way around. Simply put, it is the survival of the individual that decides if it is fit and not the fitness of the individual that determines whether it will survive or not. GA works on the later part because you cannot make an individual mate or survive if it is not fit, but in reality, it is the former that is taking place.

Selection

So, here, in natural selection, the individuals are chosen based on their level of fitness (that would later serve as the accuracy of our model). The fitness of an individual becomes a crucial factor for selection in GA. The fitter the individual, the more likely are its chances to mate or to move to the next generation. This creates a group of individuals with higher fitness to occur more frequently than those with comparatively lower fitness. This is called the “mating pool”. Individuals having very low fitness might not occur in the mating pool. In case of low fitness levels, several individuals having the same fitness are chosen at random to be in the mating pool (see Figure 1).

Figure 1: Mating Pool

-> Figure 1: Mating Pool <-

The size of the mating pool varies according to the demand of the number of offspring. The distribution of individuals in the mating pool is important because it is based on the individual's fitness. To obtain a small number of offspring, it is wise to have a smaller mating pool.

Mutation

Both the occurrence and time of occurrence of mutation are pretty random. Mutation, in reality, occurs to bring about changes ranging from minute to drastic in an individual. The mutation occurs in individuals, and it could be before or after crossover. For the sake of avoiding this randomness, both the rate and the time of mutation are predetermined in GA. Here, in this model, we have set the mutation rate to be 0.1, which would occur before the crossover.

Figure 2: Mutation

-> Figure 2: Mutation <-

Crossover

The individuals, after being selected, are allowed to mate. This mating is two parental chromosomes (one from each parent) crossing over. This process is called the “crossover” owing to the chromosomal exchange. The point at which the chromosomes would cross over is predetermined.

Figure 3: Crossover

-> Figure 3: Crossover <-

Heredity

This leads to the formation of offspring with traits from both the parents. The offspring inherit some traits of the mother and the rest from the father, depending on the chromosome breakage point.

To be brutally honest, evolution is a cyclic process with some amount of randomness. It is difficult to choose where to begin from, and hence, some trivial set of rules have been established, which is why GA has a specific starting and termination point. Even though it differs from the natural phenomenon, it is only fair.

đź’» Hyperparameter Tuning

Problem Statement

You are given a deep learning problem. Let us take classification, for example. The dataset provided is that of CIFAR10, and you need to classify an input image in one of the 10 classes. You quickly build a CNN using tf.keras and run the forward and backward propagation. Your loss converges quite well, and you are happy with the results. Someone else comes in and tackles the problem using the same model but with a changed activation function and gets a better result. Someone else can change the size of the filters and get even better results. The model's properties that are not learned during the training process but are variable nonetheless are called hyperparameters of the model. There is extensive ongoing research on hyperparameter optimization for a model. Some well-known algorithms that have already made their mark in this field are Random search, Bayesian search, and Grid search. These are all search algorithms that may or may not have heuristics in them.

Genetic Algorithm with Hyperparameter Optimization

GA, as explained above, helps organisms optimize themselves. This optimization takes place gradually through generations. The sub-processes, namely, natural selection, crossover, and mutation, help the organisms try different chromosomal configuration and optimize according to a fitness function.

With our problem statement, we can treat each of the hyperparameters as individual genes, a particular hyperparameter setting as the organism’s chromosome, and a tf.keras model as the organism. The fitness function would be the validation accuracy of the model. Applying GA on these models will make them go through an evolution where individuals in a later generation will have different but better hyperparameters (the genes) than their ancestors. Theoretically, with GA, the more the generation and larger the search space, the algorithm eventually finds the optimal solution (here optimal hyperparameter setting).

Experimental Setup

Try out the code here

Hyperparameter Encoding

To apply GA to our problem statement, the first and most important task is to formulate an encoding scheme for hyperparameters into genes. Chromosomes (the collection of genes) are the single most important factor in the entire algorithm. They are mutated and crossed over.

In the paper the following encoding scheme was selected:

    options_phase0 = {
        'a_filter_size': [(1,1), (3,3), (5,5), (7,7), (9,9)],
        'a_include_BN': [True, False],
        'a_output_channels': [8, 16, 32, 64, 128, 256, 512],
        'activation_type': [ReLU, ELU, LeakyReLU],
        'b_filter_size': [(1,1), (3,3), (5,5), (7,7), (9,9)],
        'b_include_BN': [True, False],gene settings
        'b_output_channels': [8, 16, 32, 64, 128, 256, 512],
        'include_pool': [True, False],
        'pool_type': [MaxPool2D, AveragePooling2D]
    }
    options = {
        'include_layer': [True, False],
        'a_filter_size': [(1,1), (3,3), (5,5), (7,7), (9,9)],
        'a_include_BN': [True, False],
        'a_output_channels': [8, 16, 32, 64, 128, 256, 512],
        'activation_type': [ReLU, ELU, LeakyReLU],
        'b_filter_size': [(1,1), (3,3), (5,5), (7,7), (9,9)],
        'b_include_BN': [True, False],
        'b_output_channels': [8, 16, 32, 64, 128, 256, 512],
        'include_pool': [True, False],
        'pool_type': [MaxPool2D, AveragePooling2D]
    }

Each of the keys in the dictionary is a gene. The values of the keys are the properties of a particular gene. All the genes collectively make up the chromosome of an individual organism. One must notice that there are two dictionaries. The first dictionary, option_phase0 is the option of genes for organisms of the zeroth phase. The other dictionary is the option of genes for organisms of every other phase. We will be talking about phases in-depth in a later section.

The encoding scheme for organisms of phase ```0` as provided in the paper:

Figure 4: Encoding of chromosome

-> Figure 4: Encoding of chromosome <-

Variable Length Genetic Algorithm

The paper has proposed a variable-length GA to optimize hyperparameters. In classical GA, the chromosomes are of a fixed length. Since the depth of deep learning models is a hyperparameter, and the increasing depth of the model would call for more hyperparameters, we require a chromosome of varying length for our problem statement. In the Variable Length Genetic Algorithm (VLGA), the initial population with two convolutional layers is first produced. This is phase 0 of our GA. The population is then evolved through several generations. From the last generation, we obtain the best 2-layer model. After the 2-layer models have been evolved for some generations, the algorithm enters the next phase, enabling more layers and hyperparameters. For each individual of the population in phase 1, one part of the chromosome is from the best individual in phase 0, the other part is randomly generated. The population of phase 1 is then evolved, and we go on to phase 2. This algorithm continues until we find a phase where the depth of the model overfits on the dataset.

In classical GA, we have a population of organisms. The organisms are naturally selected, mutated, and then with crossover offspring for the next generation is produced. Natural selection is based on a fitness function. The better the fitness of an individual, the more likely it is to be selected to mate. This process iterates, and we obtain many generations. With each generation, the average fitness increases. As mentioned in the paper, we have jotted down the points of the VLGA and then break them down with code snippets.

  1. Initialize the number of populations p and generations g according to user input.
    p = 10
    g = 3

This means that in each population, there will be 10 individual organisms. In each phase of GA, we will evolve the population 3 times.

  1. Create the initial population. Each individual in the population has randomized hyperparameters. Evaluate the fitness of each individual in the current generation. The fitness is the accuracy of the individual on the validation dataset.
    for idx in range(self.population_size):
      org = Organism(chromosome=random_hyper(self.phase),
                     phase=self.phase,
                     prevBestOrganism=self.prevBestOrganism)
      org.build_model()
      org.fitnessFunction(train_ds,
                          test_ds,
                          generation_number=self.generation_number)
      
    self.population.append(org)

When an organism is created, the model is compiled, and the organism's fitness is also evaluated in the same step.

  1. Sort each individual according to their fitness value, from high to low.
    self.sortModel()
  1. Select the fittest individuals. Select a portion of the population that has the highest fitness values, and let them survive into the next generation. The percentage of this survival rate can be set manually.
    number_of_fit = int(self.population_size * self.fitSurvivalRate)
    new_pop = self.population[:number_of_fit]
  1. Allow some fewer fit individuals to survive. For the rest of the individuals in the current population, give them a small chance to survive into the next generation. This probability is set manually.
    for individual in self.population[number_of_fit:]:
      if np.random.rand() <= self.unfitSurvivalProb:
        new_pop.append(individual)
  1. Randomly mutate some individuals in the next generation. The mutation rate is manually configured. If an individual is chosen to mutate, one of its hyperparameter’s value will be changed.
    for index, individual in enumerate(new_pop):
      if np.random.rand() <= self.mutationRate:
        new_pop[index].mutation(generation_number=self.generation_number)
  1. Produce new individuals. Randomly select two individuals from the next population to serve as parents, and perform crossover operation to produce a child. Each hyperparameter in the child is randomly set to be one of its parent’s hyperparameter values. Repeat the crossover operation on different random individuals many times until the next generation has p individuals. Now the next generation becomes the current generation. In our code, we have used a rather simpler approach for crossover. We take parent A and parent B from the mating pool. We choose a number x in the range of the length of the chromosome. In the child, the chromosomes 0 to x will come from parent A, and the rest will come from parent B.
    for idx in range(self.population_size-len(new_pop)):
      parents = np.random.choice(new_pop,
                                 replace=False,
                                 size=(2,),
                                 p=softmax(fitness))
      A=parents[0]
      B=parents[1]
      child=A.crossover(B, generation_number=self.generation_number)
      children.append(child)
  1. Repeat steps 3-7 until the number of generations has reached g.

  2. Select the best individual from the current population.

  3. Produce a population with longer chromosome length for the next phase. For each individual of the population, one part of the chromosome is from the best individual in Step 9. The other part is randomly generated.

  4. Repeat steps 8 - 10 until convergence.

  5. Select the best individual and train for more epochs until convergence.

đź“Š Results and Analysis

Try out the code here

Our experimental setup was to have 5 phases. Each phase had to go through 3 generations. There were 10 individuals in a population that evolved.

Phase 0

We started with 10 random 2-layer networks. The population was evolved through 3 generations, as shown below. As the generations evolved, the population's average fitness went up (by a small amount).

After phase 0 comes to an end, the best model is chosen. This is the model that has the highest validation accuracy among all the individuals in phase 0.

The hyperparameter setting is as follows:

{   'a_filter_size': (1, 1),
    'a_include_BN': True,
    'a_output_channels': 64,
    'activation_type': <class 'tensorflow.python.keras.layers.advanced_activations.ReLU'>,
    'b_filter_size': (3, 3),
    'b_include_BN': False,
    'b_output_channels': 512,
    'include_pool': False,
    'include_skip': False,
    'pool_type': <class 'tensorflow.python.keras.layers.pooling.AveragePooling2D'>}

image.png

-> Best model in phase 0 <-

Phase 1

The next phase is where the depth of the models increases. We take the best model from phase 0 and to it attach randomly selected conv layers and build organisms for phase 1. We evolve this phase through 3 generations and then analyse the loss and accuracy of the models. We can see a jump in accuracy and a dip in the loss. This signifies that the models in phase 1 is better than that in phase 0. The average and best fitness metrics also seem to be telling the same story.

The hyperparameter configuration of the best model in phase 1 is as follows:

{   'a_filter_size': (9, 9),
    'a_include_BN': False,
    'a_output_channels': 128,
    'activation_type': <class 'tensorflow.python.keras.layers.advanced_activations.ReLU'>,
    'b_filter_size': (9, 9),
    'b_include_BN': True,
    'b_output_channels': 128,
    'include_layer': True,
    'include_pool': False,
    'include_skip': False,
    'pool_type': <class 'tensorflow.python.keras.layers.pooling.AveragePooling2D'>}

image.png

-> Best model in phase 1 <-

This process continues till phase 4. The details of all the phases are visualized below. Feel free to interact with the charts. We notice one thing, after phase 1, we do not see a stark jump in accuracy in the further phases. This directs us to the problem of saturation. The capacity of the model can increase but would not help any further with validation accuracy. This is why the accuracy saturates in the further phases.

The best model chosen after the last phase is as follows:

image.png

-> The best model in the last phase <-

A little note to the reader: The model in phase 0 has proper convolutional layers, but the later phases have a functional layer. This functional layer is the best model from the immediate previous phase. So if you look at the model in phase 1, that would have the best model of phase 0, and then attached to it are Convolutional layers. The last model has all the best models of the previous phase attached to it.

Discussion

The report talks about evolving deep learning models and finding an optimal hyper-parameter setting. The fun thing to discuss here is that we still have hyper-parameters left to play with. In GA, the hyper-parameters are the number of individuals in a population, the number of generations that a population evolves, and the number of phases. The comparison of these hyper-parameter settings is computationally inefficient and gruesome. In their own time, one can go to the code-base of the report and fiddle with these settings.

The paper talks about a novel variable length GA approach to hyper-parameter optimization. Along with this approach, they have been able to model skip connections in their algorithm. In the report, we can see that the best model in the last phase has a skip connection. We could not find a good approach to solving the dimensional problems with Conv layers, hence had padding involved in all of them. The paper did not have an official code repository. Hence we have submitted our code to their papers with code page. Our code also gets featured on their arxiv page. Feel free to go through the code and play with it. Valid pull requests are always welcome.

Reach the Authors

Name Github Twitter
Aritra Roy Gosthipaty @ariG23498 @ariG23498
Puja Roychowdhury @PujaRc @pleb_talks

Study Material

  1. Human Behavioral Biology - Stanford Biology
  2. Lecture 15 Optimization II - Genetic Algorithms by IIT Madras
  3. MIT 6.034 Artificial Intelligence, Fall 2010, Lecture 13. Learning: Genetic Algorithms