Paper Review: Understanding EfficientNet with Charts and Visualizations

8 min readMay 2, 2023

This is an article about understanding EfficientNet in the simplest language with visual aids on the side to understand the most powerful Convolutional Neural Network (CNN) to date.

EfficientNet is a family of convolutional neural networks (CNNs) that achieve state-of-the-art accuracy on image classification tasks while maintaining a small number of parameters and computations. The key innovation of EfficientNet is a compound scaling method that uniformly scales the depth, width, and resolution of the network to optimize the trade-off between accuracy and computational resources. In this paper, we review the design and performance of EfficientNet and its applications in various computer vision tasks.

Why EfficientNet?

As of this writing, EfficientNet is the most powerful CNN model orwhat we call ConvNets. EfficientNet has many variants available in Keras. However, there are much more accurate models by benchmark which are called ‘Vision Transformers’ but that is another paper to review.

Introduction

Convolutional neural networks (CNNs) have achieved remarkable success in a wide range of computer vision tasks, including image classification, object detection, semantic segmentation, and more. However, designing efficient and effective CNNs is still a challenging task, as there is a trade-off between model complexity, computational cost, and accuracy.

EfficientNet is a recent family of CNNs that addresses this challenge by optimizing the scaling of the network architecture. The authors of EfficientNet propose a novel compound scaling method that scales the depth, width, and resolution of the network simultaneously, resulting in models that are both accurate and efficient.

Related Works

EfficientNet builds upon prior work on model scaling, including AlexNet, VGGNet, ResNet, and MobileNet. These models all scale the depth and/or width of the network to improve performance. However, EfficientNet is the first to propose scaling the resolution of the network in addition to depth and width.

Methodology

The key innovation of EfficientNet is a compound scaling method that scales the depth, width, and resolution of the network together. Specifically, EfficientNet uses a scaling coefficient phi that controls the overall network size. The depth, width, and resolution of the network are then scaled as follows:

Depth scaling: the number of layers in the network is scaled by phi^alpha, where alpha is a tunable parameter.
Width scaling: the width of each layer is scaled by phi^beta, where beta is another tunable parameter.
Resolution scaling: the input image resolution is scaled by phi^gamma, where gamma is a third tunable parameter.

EfficientNet uses a simple yet effective grid search method to find the optimal values of alpha, beta, and gamma. Once these values are determined, the depth, width, and resolution of the network are scaled accordingly, resulting in a family of models with varying sizes and computational costs.

Compound Scaling in EfficientNet

The easiest way to increase the performance of neural networks is to increase the layers or also known as ‘make CNN deeper’. For example, ResNet has different versions with different numbers of layers from ResNet-18 to ResNet-202. However, adding more layers or making them ‘wider’ also makes them more computationally expensive to run. The goal is to increase the accuracy but in reality, we always have limited resources, so we need a way to balance the quality and the cost of neural networks. The authors from Google published a paper that came up with a new way to do this by using a compound coefficient that tells how much to change the width, depth, and resolution of neural networks in a consistent way.

EfficientNet uses compound scaling to effectively boost the accuracy of the CNN model. Original image illustration from the authors.

It is a way to scale up neural networks by changing their width, depth, and resolution at the same time. Width means how many channels or filters each layer has. Depth means how many layers the network has. Resolution means how big the input image is. The compound coefficient tells how much to increase or decrease each of these dimensions in a balanced way. For example, if the compound coefficient is 1, then the width, depth, and resolution are all increased by a factor of 1. If the compound coefficient is 2, then they are all increased by a factor of 2. This way, the network can become bigger and better without wasting resources or losing information. The authors wanted to optimize for accuracy and efficiency. They performed a neural architecture search or small grid search to find the best architecture design for the specific neural network.

What is small grid search?

Small grid search is a way to find the best values for the compound coefficient. It means trying different combinations of α, β, and γ that control how much to change the width, depth, and resolution, respectively, of the neural network. The small grid search is done on a small model that uses less resources. The goal is to find the values that make the model perform well without breaking a certain limit of resources. Once the best values are found, they can be used to scale up bigger models with more resources.

In simple language, this is how the small grid search works in the paper:

Define a range of values for α, β, and γ that control how much to change the width, depth, and resolution of the neural network. For example, α could be from 0.5 to 2 with a step of 0.5, β could be from 0.8 to 1.2 with a step of 0.1, and γ could be from 200 to 400 with a step of 50.
Create a grid of all possible combinations of values for α, β, and γ within the defined range. For example, one combination could be (α=0.5, β=0.8, γ=200), another could be (α=0.5, β=0.9, γ=200), and so on until all combinations are covered.
For each combination in the grid, create a small model that uses the corresponding values for α, β, and γ to scale up its width, depth, and resolution from a base model.
Train and test each small model on a certain task using some measure of performance such as accuracy or error rate.
Compare the performance of all small models in the grid and select the best one that gives the highest performance while staying within a certain limit of resources such as memory or computation time.
Use the values for α, β, and γ from the best small model to scale up bigger models with more resources.

Grid Search with two hyperparameter values. Take note that EfficientNet does small grid search with three (α, β, and γ) hyperparameter values. Illustration from A Guide to Hyperparameter Optimization (HPO) — (maelfabien.github.io).

EfficientNet Architecture

When we want to make a model bigger and better, we need to start with a good small model. This is called the baseline network. The authors used a special tool called AutoML MNAS framework to help find a good small model for mobile devices. This tool tries different combinations of layers and settings to find the best balance between how accurate and how fast the model is. The small model found with AutoML MNAS has a special type of layer called mobile inverted bottleneck convolution (MBConv), which is similar to some other models like MobileNetV2 and MNASNet. The small model is slightly bigger than those models because it allowed more FLOPS (floating operation per seconds) for significant higher accuracy which named the small model EfficientNet-B0. FLOPS is a measure of how fast the model is. The lower the FLOPS, the faster the model. Then the authors used the method of making models bigger and better by increasing all dimensions together which we already discussed as compound scaling. This way, the authors created a family of models with different sizes and performances, called EfficientNets.

The architecture for the baseline network EfficientNet-B0 is simple making it easier to scale and generalize. *EfficientNet-B0 is the baseline network developed by* *AutoML MNAS, while Efficient-B1 to B7 are obtained by scaling up the baseline network.* Image illustration from original paper.

EfficientNet Performance

EfficientNet is the powerful Convolutional Neural Network (CNN) model that achieves higher accuracy and better efficiency over existing CNNs. It reduces parameter size and FLOPS (floating-point operations per second) by an order of magnitude. For example, EfficientNet-B7 reaches state-of-the-art 84.4% top-1 / 97.1% top-5 accuracy on ImageNet while being 8.4x smaller and 6.1x faster on CPU inference than the previous Gpipe. Compared with the widely used ResNet-50, EfficientNet-B4 uses similar FLOPS while improving the top-1 accuracy from 76.3% of ResNet-50 to 82.6% (+6.3%). In simple terms, EfficientNets are more accurate and efficient than other existing CNNs.

*EfficientNet-B7 achieves new state-of-the-art 84.4% top-1 / 97.1% top-5 accuracy, while being 8.4x smaller than the best existing CNN. Illustration from original paper.*

Looking at the bubble chart below, the family of EfficientNets are leading in the in terms of accuracy and parameters in Top-1 Accuracy.

Available model applications on Keras in Top-1 Accuracy

https://docs.google.com/spreadsheets/u/1/d/e/2PACX-1vRdDqIYjOeMapXKD_C1pA1fvrJXAAQLRq5qfvL1IIoi3Ff0RQjeEWBxPF6p81vUhy6o53WUt7OUItCr/pubchart?oid=235147566&format=interactive

Also, in the Top-5 Accuracy.

Available model applications on Keras in Top-5 Accuracy

https://docs.google.com/spreadsheets/d/e/2PACX-1vRdDqIYjOeMapXKD_C1pA1fvrJXAAQLRq5qfvL1IIoi3Ff0RQjeEWBxPF6p81vUhy6o53WUt7OUItCr/pubchart?oid=235147566&format=interactive

In this bubble chart, the family of EfficientNet emerges at the top with the highest accuracy versus other CNN models while having lesser parameter and smaller in size.

Bonus: Demystifying Top-1 Accuracy and Top-5 Accuracy

Top-1 and Top-5 Accuracy are two ways of measuring how good a model is at recognizing pictures. Top-1 Accuracy means that the model’s most confident answer must be exactly the same as the correct answer. For example, if the picture is of a dog, the model must say “dog” as its first choice. Top-5 Accuracy means that any of the model’s five most confident answers must match the correct answer. For example, if the picture is of a dog, the model can say “dog”, “cat”, “wolf”, “fox”, or “bear” as its top five choices. Top-5 Accuracy is usually higher than Top-1 Accuracy because it gives more chances for the model to be right.

Takeaway

EfficientNet represents a significant advancement in the field of deep learning, with its efficient and effective design leading to state-of-the-art performance on various tasks. The compound scaling method used in EfficientNet has also sparked interest in further research into more efficient and scalable models. While EfficientNet focuses on designing efficient convolutional neural networks, other approaches such as Vision Transformer (ViT) have also shown promising results in image classification as seen in Papers with Code.

Future research could investigate the potential synergies between these models, such as using EfficientNet as a backbone for ViT or incorporating transformers into EfficientNet’s architecture. As the field continues to evolve, it will be exciting to see how these models, others like them, or future models that has never seen before can solve the machine learning problem.