# FBNetV5: Neural Architecture Search for Multiple Tasks in One Run

###### Abstract

Neural Architecture Search (NAS) has been widely adopted to design accurate and efficient image classification models. However, applying NAS to a new computer vision task still requires a huge amount of effort. This is because 1) previous NAS research has been over-prioritized on image classification while largely ignoring other tasks; 2) many NAS works focus on optimizing task-specific components that cannot be favorably transferred to other tasks; and 3) existing NAS methods are typically designed to be “proxyless” and require significant effort to be integrated with each new task’s training pipelines. To tackle these challenges, we propose FBNetV5, a NAS framework that can search for neural architectures for a variety of vision tasks with much reduced computational cost and human effort. Specifically, we design 1) a search space that is simple yet inclusive and transferable; 2) a multitask search process that is disentangled with target tasks’ training pipeline; and 3) an algorithm to simultaneously search for architectures for multiple tasks with a computational cost agnostic to the number of tasks. We evaluate the proposed FBNetV5 targeting three fundamental vision tasks – image classification, object detection, and semantic segmentation. Models searched by FBNetV5 in a single run of search have outperformed the previous state-of-the-art in all the three tasks: image classification (e.g., 1.3% ImageNet top-1 accuracy under the same FLOPs as compared to FBNetV3), semantic segmentation (e.g., 1.8% higher ADE20K val. mIoU than SegFormer with 3.6 fewer FLOPs), and object detection (e.g., 1.1% COCO val. mAP with 1.2 fewer FLOPs as compared to YOLOX).

## 1 Introduction

Recent breakthroughs in deep neural networks (DNNs) have fueled a growing demand for deploying DNNs in perception systems for a wide range of computer vision (CV) applications that are powered by various fundamental CV tasks, including classification, object detection, and semantic segmentation. To develop real-world DNN based perception systems, the neural architecture design is among the most important factors that determine the achievable task performance and efficiency. Nevertheless, designing neural architectures for different applications is challenging due to its prohibitive computational cost, intractable design space [radosavovic2020designing, dong2020bench, Ci_2021_ICCV], diverse application-driven deployment requirements [wu2019fbnet, li2021searching, xiong2020mobiledets], and so on.

To tackle the aforementioned challenges, the CV community has been exploring neural architecture search (NAS) to design DNNs for CV tasks. In general, the expectations for NAS are two-fold: First, to build better neural architectures with stronger performance and higher efficiency; and second, to automate the design process in order to reduce the human effort and computational cost for DNN design. While the former ensures effective real-world solutions, the latter is critical to facilitate the fast development of DNNs to more applications. Looking back at the progress of recent years, it is fair to say that NAS has met the first expectation in advancing the frontiers of accuracy and efficiency, especially for image classification tasks. However, existing NAS methods still fall short of meeting the second expectation.

The reasons for the above limitation include the following. First, over the years the NAS community has been over fixated on benchmarking NAS methods on image classification tasks, driven by the commonly believed assumption that the best models for image classification are also the best backbones for other tasks. However, this assumption is not always true [xiong2020mobiledets, du2020spinenet, chen2019detnas, zhang2021dcnas], and often leads to suboptimal architectures for many non-classification tasks. Second, many existing NAS works focus on optimizing task-specific components that are not transferable or favorable to other tasks. For example, [shaw2019squeezenas] only searches for the encoder part within the encoder-decoder structure of segmentation tasks, while the optimal encoder is coupled with the decoder designs. [ghiasi2019fpn] is customized to RetinaNet [lin2017focal] in object detection tasks. As a result, NAS advances made for one task do not necessarily favor other tasks or help reduce the design effort. Finally, a popular belief in current NAS practice is that it is better for NAS to be “proxyless” and a NAS method should be integrated into the target tasks’ training pipeline for directly optimizing the corresponding architectures based on the training losses of each target task [cai2019once, cai2018proxylessnas, yu2020bignas]. However, this makes NAS unscalable when dealing with many new tasks, since adding each new task would require nontrivial efforts to integrate the NAS techniques into the existing training pipeline of the target task. In particular, many popular NAS methods conduct search by training a supernet [yu2020bignas, cai2019once, wang2021alphanet], adding dedicated cost regularization to the loss function [ding2021hr], adopting special initialization [yu2020bignas], and so on. These techniques often heavily interfere with the target task’s training process and thus requires much engineering effort to re-tune the hyperparameters to achieve the desired performance.

In this work, we propose FBNetV5), a NAS framework, that can simultaneously search for backbone topologies for multiple tasks in a single run of search. As a proof of concept, we target three fundamental computer vision tasks – image classification, object detection, and semantic segmentation. Starting from a state-of-the-art image classification model, *i.e*., FBNetV3 [Dai2020FBNetV3JA], we construct a supernet consisting of parallel paths with multiple resolutions, similar to HRNet [wang2020deep, ding2021hr]. Based on the supernet, FBNetV5 searches for the optimal topology for each target task by parameterizing a set of binary masks indicating whether to keep or drop a building block in the supernet. To disentangle the search process from the target tasks’ training pipeline, we conduct search by training the supernet on a proxy multitask dataset with classification, object detection, and semantic segmentation labels. Following [ghiasi2021multi], the dataset is based on ImageNet, with detection and segmentation labels generated by pretrained open-source models. To make the computational cost and hyper-parameter tuning effort agnostic to the number of tasks, we propose a supernet training algorithm that simultaneously search for task architectures in one run.
After the supernet training, we individually train the searched task-specific architectures to uncover their performance.

Excitingly, in addition to requiring reduced computational cost and human effort, extensive experiments show that FBNetV5 produces compact models that can achieve SotA performance on all three target tasks. On ImageNet [deng2009imagenet] classification, our model achieved % higher top-1 accuracy under the same FLOPs as compared to FBNetV3 [Dai2020FBNetV3JA]; on ADE20K [zhou2017scene] semantic segmentation, our model achieved % higher mIoU than SegFormer [xie2021segformer] with 3.6 fewer FLOPs; on COCO [lin2014microsoft] object detection, our model achieved 1.1% higher mAP with 1.2 fewer FLOPs compared to YOLOX [ge2021yolox]. It is worth noting that all our well-performing architectures are searched simultaneously in a single run, yet they beat the SotA neural architectures that are delicately searched or designed for each task.

## 2 Related Works

Neural Architecture Search for Efficient DNNs. Various NAS methods have been developed to design efficient DNNs, aiming to 1) achieve boosted accuracy vs. efficiency trade-offs [he2016deep, sandler2018mobilenetv2, hu2018squeeze] and 2) automate the design process to reduce human effort and computational cost. Early NAS works mostly adopt reinforcement learning [zoph2016neural, tan2019mnasnet] or evolutionary search algorithms [real2017large] which require substantial resources. To reduce the search cost, differentiable NAS [wu2019fbnet, wan2020fbnetv2, cai2018proxylessnas, chen2019progressive, liu2018darts] was developed to differentiably update the weights and architectures. Recently, to deliver multiple neural architectures meeting different cost constraints, [cai2019once, yu2020bignas] propose to jointly train all the sub-networks in a weight-sharing supernet and then locate the optimal architectures under different cost constraints without re-training or fine-tuning. However, unlike our work, all the works above focus on a single task, mostly image classification, and they do not reduce the effort of designing architectures for other tasks.

Task-aware Neural Architecture Design. To facilitate designing optimal DNNs for various tasks, recent works [howard2019searching, liu2021swin, wang2020deep] propose to design general architecture backbones for different CV tasks. In parallel, with the belief that each CV task requires its own unique architecture to achieve the task-specific optimal accuracy vs. efficiency trade-off, [shaw2019squeezenas, chen2019fasterseg, ghiasi2019fpn, lin2017focal, tan2020efficientdet] develop dedicated search spaces for different CV tasks, from which they search for task-aware DNN architectures. However, these existing methods mostly focus on optimizing task-specific components of which the advantages are not transferable to other tasks. Recent works [ding2021hr, cheng2020scalenas] begin to focus on designing networks for multiple tasks in a unified search space and has shown promising results. However, they are designed to be “proxyless” and the search process needs to be integrated to downstream tasks’s training pipeline. This makes it less scalable to add new tasks, since it requires non-trivial engineering effort and compute cost to integrate NAS to the existing training pipeline of a target task. Our work bypasses this by using a disentangled search process, and we conduct search for multiple tasks in one run. This is computationally efficient and allows us to utilize target tasks’ existing training pipelines with no extra efforts.

## 3 Method

In this section, we present our proposed FBNetV5 framework that aims to reduce the computational cost and human effort required by NAS for multiple tasks. FBNetV5 contains three key components: 1) A simple yet inclusive and transferable search space (Section 3.1); 2) A search process equipped with a multitask learning proxy to disentangle NAS from target tasks’ training pipelines (Section 3.2); and 3) a search algorithm to simultaneously produce architectures for multiple tasks at a constant computational cost agnostic to the number of target tasks (Section 3.3).

### 3.1 Search Space

To search for architectures for multiple tasks, we design the search space to meet three standards: 1) Simple and elegant: we favor simple search space over complicated ones; 2) Inclusive: the search space should include strong architectures for all target tasks; and 3) Transferable: the searched architectures should be useful not only for one model, but also transferable to a family of models.

Inspired by HRNet[wang2020deep, ding2021hr], we extend a SotA classification model, FBNetV3 [Dai2020FBNetV3JA], to a supernet with parallel paths and multiple stages. Each path has a different resolution while blocks on the same path have the same resolution. This is shown in Figure 2 (bottom-left). We divide an FBNetV3 into 4 partitions along the depth dimension, each partition outputs a feature map with a resolution down-sampled by , , , and times, respectively. Stage 0 of the supernet is essentially the FBNetV3 model. For following stages, we use the last 2 layers of each partition to construct a block per stage. During inference, we first compute Stage 0 of the supernet, and then compute the remaining blocks by topological order. Similar to [wang2020deep], we insert (lightweight) fusion modules (see Appendix B) between stages to fuse information from different paths (resolutions). A block-wise model configuration of the supernet can be found in Appendix A.

The aforementioned supernet contains blocks with varying significance to different tasks. By conventional wisdom, a classification architecture may only need blocks on the low-resolution paths, while segmentation or object detection would favor blocks with a higher resolution. Based on this, we search for network topologies, *i.e*., which blocks to select or skip for different tasks. Formally, for a supernet with paths, stages, and blocks, a candidate architecture can be characterized by a binary vector , where means to select block- and means to skip and remove the corresponding connections from and to this block. More details about the implementation of fusion modules with skipped blocks are provided in Appendix B.

We believe that this search space is simple and elegant. It only contains binary choices for each of the blocks. This is much simpler than other search space design that considers how to mix different types of operators (convolutions and transformers) together, or how to wire operators with complicated connections. Furthermore, the search space is inclusive. As a sanity check, the search space include most of the mainstream network topologies for CV tasks, *e.g*., 1) the simple linear topology for most of the classification models, 2) the U-Net [ronneberger2015u] and PANet [liu2018path] topology for semantic segmentation, and 3) the Feature-Pyramid Networks (FPN) [lin2017feature] and BiFPN [tan2020efficientdet] for object detection, as illustrated in Appendix C. The searched architecture topology is transferable. FBNet contains a series of models from small to large. We conduct search on a FBNet-A based supernet, and the topology can be transferred to other models. It is worth noting that transferring topology to models of different sizes, depths, and resolutions is also a common practice adopted by works such as FPN [lin2017feature] and BiFPN [tan2020efficientdet]).

### 3.2 Disentangled Search Process

A popular belief is that NAS should be proxyless and the search process should be integrated into each target task’s training pipeline for achieving better results. However, implementing and integrating the search process to each target task’s pipeline can require significant engineering effort. Moreover, many NAS techniques heavily interfere with the target task’s training and thus requires much engineering effort to re-tune the hyperparameters.

To avoid the above limitations, we design a search process that is disentangled with target tasks’ training pipeline. Specifically, we conduct search by training a supernet on a multitask dataset where each image is annotated with labels from all target tasks. Following [wu2019fbnet], the supernet training jointly optimize the model weights and more importantly, task-specific architecture distributions (*e.g*., the SEG, DET, and CLS Arch. Prob. in Figure 2). The goal of the search process is to obtain a task-specific architecture distribution from which we can sample architectures for the target tasks. The searched models can then be trained using the existing training pipeline of the target tasks without the necessity of implementing the search process into the tasks’ training pipeline or re-tune the existing hyper-parameters. The search process is shown in Figure 2.

As there is no large-scale multitask dataset publicly available, we follow [ghiasi2021multi] to construct a pseudo-labeled dataset based on ImageNet. Specifically, we use 1) original ImageNet labels for classification, 2) open-source CenterNet2 [zhou2021probablistic] pretrained on the COCO object detection dataset to generate pseudo detection labels, and 3) open-source MaskFormer [cheng2021maskformer] pretrained on the COCO-stuff semantic segmentation dataset (171 classes) to generate pixel-wise segmentation labels. In addition, we follow [ghiasi2021multi] to filter out object detection results with a confidence lower than 0.5, and set segmentation predictions whose maximum probability lower than 0.5 to be the “don’t-care” category. As such, this dataset can easily extend to include more tasks by using open-source pretrained models to generate task-specific pseudo labels.

### 3.3 Search Algorithm

Our search algorithm is based on the differentiable neural architecture search [liu2018darts, wu2019fbnet, wan2020fbnetv2] for low computational cost compared with other methods, such as sampling-based methods [Dai2020FBNetV3JA, Tan_2019_CVPR]. For multiple tasks, a simple idea is to apply the conventional single-task NAS (Algorithm 1) times for each task. To make this more scalable, we derive a novel search algorithm with a constant computational cost agnostic to the number of tasks (Algorithm 4). For better clarity, We introduce the derivation of the search algorithm in four steps corresponding to Algorithm 1, 2, 3, and 4, respectively. We summarize and compare the four search algorithms at each step in Table 1. We visualize Algorithm 4 in Figure 2.

Search Algorithms | #Tasks to Handle | Search Cost | |

#Forward | #Backprop. | ||

Per Iter | Per Iter | ||

Algorithm 1 | 1 | 1 | 1 |

Algorithm 2 | |||

Algorithm 3 | 1 | ||

Algorithm 4 | 1 | 1 |

#### 3.3.1 Differentiable NAS for a Single Task

We start from a typical differentiable NAS designed for a single task, which can be formulated as

(1) |

where is a candidate architecture in the search space , is the supernet’s weight, and is the loss function of task- that also considers the cost of architecture . Following [wu2019fbnet, dai2019chamnet], the cost of an architecture can be defined in terms of FLOPs, parameter size, latency, energy, etc.

In our work, we search in a block-level search space. For block- of the supernet, we have

(2) |

where , are input and output of block- function . is a binary variable that determines whether to compute block- or skip it. Under this setting, the search space for Equation (1) is combinatorial and contains candidates, where is the number of blocks. To solve it efficiently, we relax the problem as

(3) |

where is a random variable sampled from a distribution , parameterized by . For each block, we independently sample from a Bernoulli distribution with an expected value of . The probability of architecture computes as

(4) |

Under this relaxation, we can jointly optimize the supernet’s weight and architecture parameter with stochastic gradient descent. Specifically, in the forward pass, we first sample , and compute the loss with input data , weights , and architecture . Next, we compute gradient with respect to and . Since architecture is a discrete random variable, we cannot pass the gradient directly to . Previous works have adopted the Straight-Through Estimator[bengio2013estimating] to approximate the gradient to as jang2016categorical, maddison2016concrete, wu2019fbnet] can also be used to estimate the gradient. We train and jointly using SGD with learning rate . After the training finishes, we sample architectures from the trained distribution and pass them to target task’s training pipeline. This process is summarized in Algorithm 1. . Alternatively, Gumbel-Softmax [

#### 3.3.2 Extending to Multiple Tasks

We are interested in searching architectures for multiple tasks, which can be formulated as

(5) |

This is a rather awkward way to combine independent optimization problems together. To simplify the problem, we first approximate Equation (5) as

(6) |

where is the weight of an over-parameterized supernet shared among all tasks, and is the architecture sampled for task-. One concern of using Equation (6) to approximate Equation (5) is that in multitask learning, the optimization of different tasks may interfere with each other. We conjecture that in an over-parameterized supernet with large enough capacity, the interference is small and can be ignored. Also, unlike conventional multitask learning, our goal is not to train a network with multitask capability, but to find optimal architectures for each task. We conjecture that the task interference has limited impact on the search results.

Using the same relaxation trick as Equation (3), we re-write Equation (6) as

(7) |

where are architectures sampled from a task-specific distribution parameterized by . To solve this, we can slightly modify Algorithm 1 to reach Algorithm 2.

With Algorithm 2, we did not gain efficiency compared with running the Algorithm 1 for times, since we need to compute forward and backward passes in each iteration. With the same number of iterations, we end up with a times higher compute cost. But in the next two sections, we show how we adopt importance sampling and REINFORCE to reduce the number of forward and backward passes to .

#### 3.3.3 Reducing Forward Passes to

Reviewing Algorithm 2, the need to run multiple forward passes comes from lines 4 and 5 that for each task, we need to sample different architectures from different to estimate the expected task loss under .

Using Importance Sampling [mcbook], we reduce forward passes into . Instead of sampling architectures from distributions, we can just sample architectures once from a common proxy distribution and let tasks share the same architecture in their the forward pass. Though not sampling from , we can still compute an unbiased estimation of the task loss expectation as

(8) |

is the number of architecture samples. can be any distribution as long as it satisfies the condition that where . Equation (8) will always be an unbiased estimator. We empirically design as a distribution that we first uniformly sample a task from , and sample the architecture from . For any architecture , its probability can be calculated as with computed by Equation (4). Using importance sampling, we redesign the search algorithm as Algorithm 3 to reduce the number of forward passes from to 1.

#### 3.3.4 Reducing Backward Passes to

Algorithm 3 only requires 1 forward pass but backward passes. This ie because to optimize the architecture distribution for task-, we need to run a backward pass to compute , which we use to estimate and to update the task architecture parameter . To avoid this, we use REINFORCE [williams1992simple] to estimate the gradient as

(9) |

is the number of architecture samples. Given the definition of in Equation (4), we can easily derive , with its -th element simply computed as

(10) |

Equation (9) is also referred to as the score function estimator of the true gradient . The intuition is that for any sampled architecture , we score its gradient by the loss , such that architectures that cause larger loss will be suppressed and vice versa. This technique is more often referred to as the policy gradient in Reinforcement Learning. For NAS, a similar technique is adopted by [casale2019probabilistic, yan2021fp] to search for classification models. Using Equation (9), we no longer need to run back propogation to compute for each task. We still need to compute the gradient to the supernet weights , but we can first sum up the task losses and run backward pass only once. This is summarized in Algorithm 4 and visualized in Figure 2. We discuss more important details of this algorithm in Appendix E. Note that we still have two for-loops in each iteration to compute the task loss from the network’s prediction and the gradient estimator for , but their computational cost is negligible compared with the forward and backward passes.

## 4 Experiments

### 4.1 Experiment Settings

We implement the search process and target task’s training pipeline in D2Go^{1}^{1}1https://github.com/facebookresearch/d2go powered by Pytorch [paszke2019pytorch] and Detectron2 [wu2019detectron2]. For the search (training supernet) process, we build a supernet extended from an FBNetV3-A model as illustrated in Section 3.1. During search, we first pretrain the supernet on ImageNet [deng2009imagenet] with classification labels for 1100 epochs, mostly following a regular classification training recipe [graham2021levit, touvron2020training]. More details are included in Appendix F.2. This step takes about 60 hours to finish on 64 V100 GPUs.
Then we train the supernet on the multitask proxy dataset for 9375 steps using SGD with a base learning rate of 0.96. We decay the learning rate by 10x at step-3125. We set the initial sampling probability of all blocks to 0.5. We do not update the architecture parameters until step-6250. We set the architecture parameter’s learning rate to be 0.01 of the weight’s learning rate. It takes about 10 hours to finish when trained on 16 V100 GPUs. More details of the search implementation can be found in Appendix F.1. After the search, we sample the most likely architectures for each task.

For training the searched architectures, we mostly follow existing SotA training recipes for each task [graham2021levit, cheng2021maskformer, wu2019detectron2]. See Appendix F.2 for details. For semantic segmentation, we follow MaskFormer [cheng2021maskformer] and attach a modified light-weight MaskFormer head (dubbed Lite MaskFormer) to the searched backbone. For object detection, we use Faster R-CNN’s [ren2015faster] detection head with light-weight ROI and RPN. We call the new head as Lite R-CNN. See the architecture design of the two light-weight heads in Appendix G.

### 4.2 Comparing with SotA Compact Models

We compare our searched architectures against both NAS searched and manually designed compact models for ImageNet [deng2009imagenet] classification, ADE20K [zhou2017scene] semantic segmentation, and COCO [lin2014microsoft] object detection. We search topologies for all tasks by training supernet once, sampling one topology for each task, and transfer the searched topology to different versions of FBNetV3 models with different sizes. We use FBNetV3-{A, C, F} and build two smaller models FBNetV3-A and FBNetV3-A by mainly shrinking the resolution and channel sizes from FBNetV3-A, respectively. See Appendix D. We name a model using the template FBNetV5-{version}-{task}. For a given task, all models share the same searched topology, as in Figure 3.

Compared with all the existing compact models including automatically searched and manually designed ones, our FBNetV5 delivers architectures with better accuracy/mIoU/mAP vs. efficiency trade-offs in all the ImageNet [deng2009imagenet] classification (*e.g*., 1.3% top-1 accuracy under the same FLOPs as compared to FBNetV3-G [wu2019fbnet]), ADE20K [zhou2017scene] segmentation (*e.g*., 1.8% higher mIoU than SegFormer with MiT-B1 as backbone [xie2021segformer] and 3.6 fewer FLOPs), and COCO [lin2014microsoft] detection tasks (*e.g*., 1.1% mAP with 1.2 fewer FLOPs as compared to YOLOX-Nano [nanodet]). See Tables 2, 3, 4 and Figure 1 for a detailed comparison.

Model | Input | FLOPs | Accuracy |

Size | (%, Top-1) | ||

HR-NAS-A [ding2021hr] | 224 224 | 267M | 76.6 |

LeViT-128S [graham2021levit] | 224 224 | 305M | 76.6 |

BigNASModel-S [yu2020bignas] | 192 192 | 242M | 76.5 |

MobileNetV3-1.25x [howard2019searching] | 224 224 | 356M | 76.6 |

FBNetV5-A-CLS | 160 160 | 215M | 77.2 |

HR-NAS-B [ding2021hr] | 224 224 | 325M | 77.3 |

LeViT-128 [graham2021levit] | 224 224 | 406M | 78.6 |

EfficientNet-B0 [tan2019efficientnet] | 224 224 | 390M | 77.3 |

FBNetV5-A-CLS | 224 224 | 280M | 78.4 |

EfficientNet-B1 [tan2019efficientnet] | 240 240 | 700M | 79.1 |

FBNetV3-E [Dai2020FBNetV3JA] | 264 264 | 762M | 81.3 |

FBNetV5-A-CLS | 224 224 | 685M | 81.7 |

LeViT-256 [graham2021levit] | 224 224 | 1.1G | 81.6 |

EfficientNet-B2 [tan2019efficientnet] | 260 260 | 1.0G | 80.3 |

BigNASModel-XL [yu2020bignas] | 288 288 | 1.0G | 80.9 |

FBNetV3-F [Dai2020FBNetV3JA] | 272 272 | 1.2G | 82.5 |

FBNetV5-C-CLS | 248 248 | 1.0G | 82.6 |

Swin-T [liu2021swin] | 224 224 | 4.5G | 81.3 |

LeViT-384 [graham2021levit] | 224 224 | 2.4G | 81.6 |

BossNet-T1 [li2021bossnas] | 288 288 | 5.7G | 81.6 |

EfficientNet-B4 [tan2019efficientnet] | 380 380 | 4.2G | 82.9 |

FBNetV3-G [Dai2020FBNetV3JA] | 320 320 | 2.1G | 82.8 |

FBNetV5-F-CLS | 272 272 | 2.1G | 84.1 |

Backbone | Head | Short | FLOPs | mIoU |

Size | (%) | |||

HR-NAS-A [ding2021hr] | Concatenation [ding2021hr] | 512 | 1.4G | 33.2 |

MobileNetV3-Large [li2021searching] | Lite MaskFormer | 448 | 1.5G | 29.2 |

FBNetV5-A-SEG | Lite MaskFormer | 384 | 1.3G | 35.6 |

HR-NAS-B [ding2021hr] | Concatenation [ding2021hr] | 512 | 2.2G | 34.9 |

EfficientNet-B0 [tan2019efficientnet] | Lite MaskFormer | 448 | 2.1G | 31.3 |

FBNetV5-A-SEG | Lite MaskFormer | 384 | 1.8G | 37.8 |

MiT-B0 [xie2021segformer] | SegFormer [xie2021segformer] | 512 | 8.4G | 37.4 |

FBNetV5-A-SEG | Lite MaskFormer | 384 | 2.9G | 41.2 |

MiT-B1 [xie2021segformer] | SegFormer [xie2021segformer] | 512 | 15.9G | 42.2 |

FBNetV5-C-SEG | Lite MaskFormer | 448 | 4.4G | 44.0 |

Swin-T [liu2021swin] | UperNet [xiao2018unified] | 512 | 236G | 46.1 |

Swin-T [liu2021swin] | MaskFormer [cheng2021maskformer] | 512 | 55G | 46.7 |

ResNet-50 [he2016deep] | MaskFormer [cheng2021maskformer] | 512 | 53G | 44.5 |

PVT-Larger [wang2021pyramid] | Semantic FPN [kirillov2019panoptic] | 512 | 80G | 44.8 |

FBNetV5-F-SEG | Lite MaskFormer | 512 | 9.4G | 46.5 |

Backbone | Head | Short, Long | FLOPs | mAP |

Size | (%) | |||

ShuffleNetV2 1.0x [ma2018shufflenet] | NanoDet-m [nanodet] | 320, 320 | 720M | 20.6 |

EfficientNet-B0 [tan2019efficientnet] | Lite R-CNN | 224, 320 | 793M | 23.1 |

FBNetV5-A-DET | Lite R-CNN | 224, 320 | 713M | 25.0 |

MobileDets [xiong2020mobiledets] | SSDLite [sandler2018mobilenetv2] | 320, 320 | 920M | 25.6 |

ShuffleNetV2 1.0x [ma2018shufflenet] | NanoDet-m [nanodet] | 416, 416 | 1.2G | 23.5 |

Modified CSP v5 [ge2021yolox] | YOLOX-Nano [ge2021yolox] | 416, 416 | 1.1G | 25.3 |

EfficientNet-B2 [tan2019efficientnet] | Lite R-CNN | 224, 320 | 1.2G | 24.9 |

FBNetV5-A-DET | Lite R-CNN | 224, 320 | 908M | 26.4 |

ShuffleNetV2 1.5x [ma2018shufflenet] | NanoDet-m [nanodet] | 416, 416 | 2.4G | 26.8 |

EfficientNet-B3 [tan2019efficientnet] | Lite R-CNN | 224, 320 | 1.6G | 26.2 |

FBNetV5-A-DET | Lite R-CNN | 224, 320 | 1.35G | 27.2 |

FBNetV5-A-DET | Lite R-CNN | 320, 640 | 1.37G | 28.9 |

FBNetV5-A-DET | Lite R-CNN | 320, 640 | 1.80G | 30.4 |

Tasks | Search | Search Cost | FLOPs | Top-1 Accuracy/ |

Algorithm | (GPU hours) | mIoU / mAP (%) | ||

CLS | Random | - | 769M | 81.5 |

Single Task (Alg. 1) | 4000 | 688M | 81.9 (0.4) | |

FBNetV5 (Alg. 4) | 4000 / | 726M | 81.8 (0.3) | |

SEG | Random | - | 2.9G | 38.8 |

Single Task (Alg. 1) | 4000 | 2.7G | 40.4 (1.6) | |

FBNetV5 (Alg. 4) | 4000 / | 2.8G | 40.4 (1.6) | |

DET | Random | - | 1.34G | 26.8 |

Single Task (Alg. 1) | 4000 | 1.36G | 27.3 (0.5) | |

FBNetV5 (Alg. 4) | 4000 / | 1.36G | 27.2 (0.4) | |

### 4.3 Ablation Study on FBNetV5’s Search Algorithm

To verify the effectiveness of the search algorithm proposed in Section 3.3 (*i.e*., Algorithm 4), we compare the proposed multitask search (Algorithm 4) with single-task search (Algorithm 1) and random search. We sample four architectures from two trained distributions (by Algorithm 4 and Algorithm 1) and a random distribution where each block has a 0.5 probability being sampled. We compare sampled architectures with their best accuracy/mIoU/mAP vs. efficiency trade-off and report the results in Table 5. First, random architectures achieve strong performance. This demonstrates the effectiveness of the search space design. But compared to the random search, using the same FLOPs, models from multitask search obviously outperforms randomly sampled models by achieving 0.3% higher accuracy on image classification, 1.6% higher mIoU on semantic segmemtation, and 0.4% higher mAP on object detection. Compared with single-task search, models searched by multitask search deliver very similar performance (*e.g*., 2.8G vs. 2.7G FLOPs under the same mIoU on ADE20K [zhou2017scene]) while reducing the search cost for each task by a factor of times.

### 4.4 Searched Architectures for Different Tasks

To better understand the architectures searched by FBNetV5, we visualize them in Figure 3. For the SEG model (Figure 3-top), its blocks between Fusion 1 and Fusion 6 match the U-Net’s pattern that gradually increases feature resolutions. See Figure 5-top for a comparison. For the DET model (Figure 3-middle), we did not find an obvious pattern to describe it. We leave the interpretation to each reader. Surprisingly, the CLS model contains a lot of blocks from higher resolutions. This contrasts the mainstream models [wu2019fbnet, cai2018proxylessnas, cai2019once, li2021searching, yu2020bignas] that only stack layers sequentially. Given that our searched CLS model demonstrates stronger performance than sequential architectures, this may open up a new direction for the classification model design.

## 5 Conclusion

We propose FBNetV5, a NAS framework that can search for neural architectures for a variety of CV tasks with reduced human effort and compute cost. FBNetV5 features a simple yet inclusive and transferable search space, a multi-task search process disentangled with target tasks’ training pipelines, and a novel search algorithm with a constant compute cost agnostic to number of tasks. Our experiments show that in a single run of search, FBNetV5 produces efficient models that significantly outperform the previous SotA models in ImageNet classification, COCO object detection, and ADE20K semantic segmentation.

## 6 Discussion on Limitations

There are several limitations of our work. First, we did not explore a more granular search space, e.g., to search for block-wise channel sizes, which can further improve searched models’ performance. Second, while our framework can search for multiple tasks in one run, we do not support adding new tasks incrementally, which will further improve the task-scalability. One potential solution is to explore whether we can transfer the searched architectures from one task (*e.g*., segmentation) to similar tasks (*e.g*., depth estimation) without re-running the search.

## References

## Appendix A Block Configurations of FBNetV3-A Supernet

To explain how to extend an FBNetV3 [Dai2020FBNetV3JA] to the supernet in FBNetV5, we list the code snippets below. It includes the block configurations of both FBNetV3-A and the supernet extended from FBNetV3-A. It is compatible with with official implementation of FBNetV3 ^{2}^{2}2https://github.com/facebookresearch/mobile-vision/blob/main/mobile_cv/arch/fbnet_v2/fbnet_modeldef_cls_fbnetv3.py.

## Appendix B Details about the Fusion Module

Following HRNet [wang2020deep], in the supernet and searched network, we design the Fusion module to fuse feature maps with different resolutions with each other. The original HRNet’s fusion modules are computationally expensive. To reduce cost, we design a parameter-free and (almost) compute-free fusion module.

Each block in the network is fused to all blocks at the next stage. For a block at a given path: 1) if the output feature is at the same path (resolution), the fusion module is essentially a identity connection (blue arrows in Figure 4). 2) To fuse the feature to a path with larger channel size and lower resolution, we first down-sample the input feature to the target size, and repeat the original channels by times, where are input/output channel sizes. If is not divisible by , we drop the extra channels. This is shown as the red arrows in Figure 4. 3) To fuse to a feature with higher resolution and smaller channel sizes, we first up-sample the feature map. Then, we pad the input feature’s channels with zero such that the channel size becomes . Finally, we take every channels as a group and compute a channel-wise average to produce a new output channel. This is shown as Figure 4 blue arrows. Features fused to the same block will be summed together as input to the block.

This fusion module does not require any parameters, and only requires a negligible amount of compute for down-sampling, up-sampling, padding, and channel-wise average.

In the supernet and the searched architectures, if any block (*e.g*., the ones in (Stage s-1, Path p) or (Stage s, Path p-1)) is skipped, the corresponding connections in the fusion modules from and to the block will also be removed, except the connections from and to other blocks in the same path.

## Appendix C Visualization of the Mainstream Topologies for CV Tasks

To demonstrate our search space is inclusive, we visualize how can it represent some of the mainstream network topologies for CV tasks in Figure 5. These include 1) the U-Net (Figure 5-top) and PANet (Figure 5-middle) topology for semantic segmentation and 2) the FPN (Figure 5-top) and BiFPN (Figure 5-bottom) topology for object detection.

## Appendix D FBNetV3-A and FBNetV3-A

We provide the code snippets below to demonstrate the details about the architectures of FBNetV3-A and FBNetV3-A, by mainly shrinking the resolution and channel sizes from FBNetV3-A, respectively.

## Appendix E Important Implementation Details of Algorithm 4

We provide several impotant implementation details of Algorithm 4.

Sampling multiple architectures. Algorithm 1 2, 3, 4 show we sample 1 architecture in each forward pass. Although it still gives an unbiased estimation of the task loss, small sample sizes lead to large variations. In practice, we implement the supernet training with distributed data parallel in Pytorch, such that each thread independently samples an architecture from the same distribution. We use 16 threads for supernet training, therefore, sampling 16 architectures per iteration to reduce the estimation variance.

Self-normalized importance sampling. In Equation (8), we compute the importance weight as . In some extreme cases if is too small relative to , will become very large that destabilize the supernet training. To prevent this, we actually use the self-normalized importance sampling and re-write Equation (8) as

(11) |

with . This still gives an unbiased estimation [mcbook], but will prevent the loss from becoming exceedingly large. During supernet training, we implement this through an all-gather operation to collect from all threads and compute the normalized importance weight.

Loss normalization. In Equation (9), we scale the gradient by the associated loss to determine whether we should suppress or encourage the sampled architecture . However, one challenge is that for different tasks, the loss may have different mean and variance, so the gradients of different tasks can be scaled differently. To address this, instead of using the raw task-loss in Equation (9), we use a normalized task-loss, computed as , where is the mean and standard deviation of the task loss in the past 200 steps. The mean provides a baseline to evaluate how does the sampled architecture compare with the average. This is similar to the Reinforcement Learning approach of using “advantage” instead of reward for policy gradient. The scaling factor ensures that all losses are scaled properly without needing to tune the task-specific learning rate.

Cost regularization. In addition to the original task loss, *e.g*., cross-entropy for classification, we add a cost regularization term computed as , where is the cost (*e.g*., FLOPs) of block-, and denotes whether to select block-b or not. is a loss coefficient. is the relative cost target.

Warmup training. Similar to the observation of [wu2019fbnet], before training the architecture parameters , we need to first sufficiently train the model weights . This is because at the beginning of the supernet training, the loss will always drop regardless of the choice of architecture . In our implementation, we use warmup training to first train sufficiently and then begin to update and jointly.

## Appendix F Details about the Search and Training Process Implementation

### f.1 Search Process Implementation

We introduce the implementation details of the search process of FBNetV5. As discussed in Section 3, our search is conducted by training a supernet on a multitask proxy dataset. Details of the dataset creation can be found in Section 3.2, supernet design can be found in Section 3.1. Our search is based on the supernet extended from an FBNetV3-A model as illustrated in Section 3.1. On top of the supernet we use the FBNet-V3 style classification head attached to the end of Path 3. We use a Faster R-CNN head attached to Path 2 for object detection, and a single convolutional layer as the segmentation head attached to Path 2 for semantic segmentation. Note since we only care about topologies, heads used during search can be different from the heads for downstream task. We pretrain the extended supernet on the ImageNet for classification, and then train it on the multitask proxy dataset. We implement the search algorithm in D2go^{3}^{3}3https://github.com/facebookresearch/d2go powered by Pytorch [paszke2019pytorch] and Detectron2 [wu2019detectron2]. To train the supernet, we use a total batch size of 768. The images are resized such that the short size is 256, and we take a random crop with size 224x224 to feed to the model. We train the supernet for 9375 steps using SGD with a base learning rate of 0.96. We decay the learning rate by 10x at 3125 steps. We set the initial sampling probability of all blocks to 0.5. We do not update the architecture parameters until 6250 steps, and we the architecture parameter’s learning rate is 0.01 of the regular learning rate for weights. We use 16 V100 GPUs to train the supernet. It takes about 10 hours to finish.

### f.2 Training Process Implementation

For training the task-specific architectures searched by FBNetV5, we follow existing SotA training recipes for each task [graham2021levit, cheng2021maskformer, wu2019detectron2] and use PyTorch [paszke2019pytorch] for all the experiments.

For ImageNet [deng2009imagenet] image classification, we use the FBNetV3 style MBPool+FC classification head on top of the final feature map from Path 3 in Figure 2. We adopt the distillation based training settings in [graham2021levit, touvron2020training] and use a large pretrained model that has a 85.5% top-1 accuracy on ImageNet as the teacher model. We use a batch size of 4096 on 64 V100 GPUs for 1100 epochs, using SGD with momentum 0.9 and weight decay 2 as the optimizer, initializing the learning rate as 4.0 with 11 epochs warm-up from 0.01, and decaying it each epoch with a factor of 0.9875.

For ADE20K semantic segmentation, we modify the MaskFormer’s [cheng2021maskformer] segmentation head to a lighter version, *i.e*., we use a pixel decoder with a 33 convolution layer and shrink the transformer decoder to only contain 1 Transformer layer, dubbed as Lite MaskFormer. The pixel decoder is attached to the end of Path 1, and the transformer decoder is attached to Path 3. We use the same training settings for ResNet backbone in [cheng2021maskformer] to train all the searched architectures except using 320k iterations for bigger models (FBNetV5-A/C/F-SEG in Table 3) following [wang2021pyramid]. We initialize the backbone with the weights of the ImageNet-pretrained supernet.

For COCO [lin2014microsoft] object detection, we use the searched architectures as the backbone feature extractor. We attach a Faster R-CNN [ren2015faster] head on Path 1 of the supernet. We re-design the ROI and RPN head to have a lighter architecture, and reduce the number of ROI proposals to 30 and name this version as Lite R-CNN. We follow most of the default training settings in [wu2019detectron2] while using a batch size of 256 to train all the searched architectures for 150k iterations with a base learning rate of 0.16, and decay the learning rate by 10 after 140K steps. We keep an exponential moving average (EMA) of the model weights, and evaluate on the EMA model. The same as the settings in ADE20K above, we initialize the backbone with the weights from the ImageNet-pretrained supernet.

## Appendix G Design of light Detection and Segmentation Head

### g.1 Architecture of the Lite MaskFormer Head

MaskFormer [cheng2021maskformer] consists of three components, a pixel decoder (PD), a transformer decoder (TD), and a segmentation module (SM). The pixel decoder is used to generate the per-pixel embeddings. The transformer decoder is designed to output the per-segment embeddings which encode the global information of each segment. The segmentation module converts the per-segment embeddings to mask embeddings via a Multi-Layer Perceptron (MLP), and then obtain final predication via a dot product between the per-pixel embeddings from the pixel decoder and the mask embeddings.

We squeeze both the pixel decoder and the transformer decoder to build the Lite MaskFormer used in our experiments.

Our pixel decoder takes the output from Path 1 and leverage a 33 convolution layer to generate the per-pixel embeddings.

Our transformer decoder follows the design of MaskFormer’s transformer decoder, *i.e*., the same with DETR [carion2020end]. But we shrink it to only contain 1 Transformer [bello2019attention] layer and attach it to the output of Path 3.

We further demonstrate the distribution of our models’ FLOPs in Table 6. The total FLOPs is the sum of BB, PD, TD, and SM FLOPs, and it is computed based on the input resolution of (short_size short_size) following [cheng2021maskformer, xie2021segformer].

Model | Short Size | BB | PD | TD | SM | Total |

FBNetV5-A-SEG | 384 | 945 | 162 | 139 | 82 | 1328 |

FBNetV5-A-SEG | 384 | 1389 | 162 | 139 | 82 | 1773 |

FBNetV5-A-SEG | 384 | 2485 | 215 | 135 | 84 | 2919 |

FBNetV5-C-SEG | 448 | 3838 | 357 | 144 | 109 | 4448 |

FBNetV5-F-SEG | 512 | 8502 | 550 | 155 | 142 | 9350 |

### g.2 Architecture of the Lite Faster R-CNN Head

Model | Ref. Size | BB | RPN | ROI | Total | Avg. |

FBNetV5-A-DET | 213x320 | 399 | 152 | 182 | 733 | 713 |

FBNetV5-A-DET | 213x320 | 601 | 152 | 182 | 935 | 908 |

FBNetV5-A-DET | 213x320 | 1054 | 158 | 186 | 1398 | 1354 |

FBNetV5-A-DET | 320x481 | 912 | 347 | 182 | 1441 | 1367 |

FBNetV5-A-DET | 320x481 | 1372 | 347 | 182 | 1901 | 1800 |

For object detection, we attach Faster R-CNN [ren2015faster] head to our searched backbones. Faster R-CNN detection contains two component, a region proposal network (RPN) and a region-of-interest (ROI) head. We use light-weight RPN and ROI heads to save the overall compute cost.

Our ROI head contains a inverted resitual block (IRB) [sandler2018mobilenetv2] with kernel size 3, expansion ratio 3, output channel size 96. We also use Squeeze-Excitation [hu2018squeeze] and HSigmoid activation following [howard2019searching]. The output of RPN is fed to a single convolution layer to generate RPN output.

Our RPN head contains 4 IRB blocks with the same kernel size of 3; expansion ration of 4, 6, 6, 6; output channel size of 128, 128, 128, 160. The IRB blocks do not use SE or HSigmoid. We use an ROIPool operator to extract feature maps from a region-of-interest, and reshape the spatial size to 6x6. The first IRB block further down-samples the input resolution to 3x3. The output of the IRB blocks are projected by a single conv layer to predict ROI output (bounding box prediction, class prediction, etc.).

During inference, we select the 30 regions post NMS and feed them to ROI. Under this setting, our models FLOPs distribution is shown in Table 7. Note that the total FLOPs of our model is computed based on the reference input size. The total FLOPs is the sum of BB, RPN, and ROI FLOPs. The average FLOPs reported in Table 4 is computed based on images in the COCO val set.

## Appendix H Average FLOPs of R-CNN models.

In Table 4 and Table 7, we report the average FLOPs of our model on the COCO validation dataset. This is because our R-CNN based detection model does not fix the input size, while our baselines [nanodet, ge2021yolox] takes a fixed input size. It is a more fair to use the average FLOPs of R-CNN models to compare models with a fixed input size.

During inference, our R-CNN model re-size images using the following strategy. We first define two parameters min_size (set to 224 or 320) and max_size (set to 320 or 640). For an input image, we first resize the image such that its short size becomes min_size while keeping the aspect ratio the same. After this, if the longer side the image becomes larger than max_size, we re-size the image again to make sure the longer side becomes max_size, while not changing the aspect ratio.

To compute the average FLOPs, we first compute the backbone (BB), RPN, ROI, and total flops of the model based on a reference input (*e.g*., 213x320 or 320x481), as in Table 7. Then, we compute the number of pixels in the reference image, and the average number of pixels for all images in the dataset. We compute a ratio ratio between the average and reference pixel number. Finally, we compute the average FLOPs as ratio x (BB + RPN) + ROI, where BB, RPN, ROI denotes the backbone, RPN, ROI FLOPs of the model. We do not scale ROI since the backbone and RPN flops is determined by the input resolution while ROI’s FLOPs do not depend on input resolution.