Compression Survey Hal
Compression Survey Hal
Abstract Over the past, deep neural networks have proved to be an essential
element for developing intelligent solutions. They have achieved remarkable
performances at a cost of deeper layers and millions of parameters. Therefore
utilising these networks on limited resource platforms for smart cameras is a
challenging task. In this context, models need to be (i) accelerated and (ii)
memory efficient without significantly compromising on performance. Numer-
ous works have been done to obtain smaller, faster and accurate models. This
paper presents a survey of methods suitable for porting deep neural networks
on resource-limited devices, especially for smart cameras. These methods can
be roughly divided in two main sections. In the first part, we present compres-
sion techniques. These techniques are categorized into: knowledge distillation,
pruning, quantization, hashing, reduction of numerical precision and binariza-
tion. In the second part, we focus on architecture optimization. We introduce
the methods to enhance networks structures as well as neural architecture
search techniques. In each of their parts, we describe different methods, and
analyse them. Finally, we conclude this paper with a discussion on these meth-
ods.
Keywords Deep learning · Compression · Neural networks · Architecture
A. Berthelier
Institut Pascal - 4 Avenue Blaise Pascal, 63178 Aubiere, France
Tel.: +33630899676
E-mail: [email protected]
T. Chateau
Institut Pascal - 4 Avenue Blaise Pascal, 63178 Aubiere, France
S. Duffner
LIRIS - 20, Avenue Albert Einstein, 69621 Villeurbanne Cedex, France
C. Garcia
LIRIS - 20, Avenue Albert Einstein, 69621 Villeurbanne Cedex, France
C. BLanc
Institut Pascal - 4 Avenue Blaise Pascal, 63178 Aubiere, France
ii Anthony Berthelier et al.
Hashing
Numerical Precision
Neural Architecture
Architecture Overview
Knowledge Distillation Pruning Search
Binarization
Quantization
1 Introduction
Since the advent of deep neural network architectures and their massively par-
allelized implementations [1, 2], deep learning based methods have achieved
state-of-the-art performance in many applications such as face recognition,
semantic segmentation, object detection, etc. In order to achieve these perfor-
mances, a high computation capability is needed as these models have usually
millions of parameters. Moreover, the implementation of these methods on
resource-limited devices for smart cameras is difficult due to high memory con-
sumption and strict size constraints. For example, AlexNet [1], is over 200MB
and all the milestone models that followed such as VGG [3], GoogleNet [4] and
ResNet [5] are not necessarily time or memory efficient. Thus finding solutions
to implement deep models on resource-limited platforms such as mobile phones
or smart cameras is essential. Each device has a different computational capac-
ity. Therefore, to run these applications on embedded devices the deep models
need to be less-parametrized in size and time efficient.
Few works has been done focusing on dedicated hardware or FPGA with
a fixed specific architecture. Having a specific hardware is helpful to optimize
a given application. However, it is difficult to generalise. The CPU architec-
tures of the smartphones are different from each other. Thus, it is important
to develop generic methods to help optimize neural networks. This paper aims
to describe general compression methods for deep models that can be im-
plemented on a large range of hardware architectures, especially on various
generic-purpose CPU architectures. Moreover, we are specifically interested in
multilayer perceptron (MLP) and Convolutional Neural Networks (CNNs) be-
cause these types of state-of-the-art models have a large number of parameters.
Title Suppressed Due to Excessive Length iii
2 Compression techniques
labelled data through this deep network. Next, this "synthetic" dataset is then
used to train a smaller mimic model (the student network), which assimilates
the function that was learned by the larger model. It is expected that the
mimic model should produce same predictions and mistakes as the deep net-
work. Thus, similar accuracy can be achieved between an ensemble of neural
networks and its mimic model with 1000 times fewer parameters. In [11], the
authors demonstrated this assertion on the CIFAR-10 dataset. An ensemble of
deep CNN models was used to label some unlabeled data of the dataset. Next,
the new data were used to train a shallow model with a single convolution and
maxpooling layer followed by a fully connected layer with 30k non-linear units.
In the end, the shallow model and the ensemble of CNN acquired the same
level of accuracy. Further improvements have been made on student-teacher
techniques, especially with the work of Hinton et al. [13]. Their framework uti-
lizes the output from the teacher’s network to penalize the student network.
Additionally it is also capable of retrieving an ensemble of teacher networks
to compress their knowledge into a student network of similar depth.
In recent years, other compression methods that are described in this paper
are preferred. However, some works are coupling transfer learning techniques
with their own methods to achieve strong improvements. For example, the
works of Chen et al. [14] and Huang et al. [15] follow this approach employing
additional pruning techniques (see section 2.3). The former uses a deep metric
learning model, whereas the latter handles the student-teacher problem as a
distribution matching problem by trying to match neuron selectivity patterns
between them to increase the performance. Aguilar et al. [16] propose to distill
the internal representations of a teacher models into a simplified version of it to
improve the learning and the performance of the student model. Lee et al. [17]
use a self-supervised learning algorithm to improve transfer learning methods.
These methods are efficient. However their performances can vary largely ac-
cording to the application. Classification tasks are easy to learn for a shallow
Title Suppressed Due to Excessive Length v
model, but tasks like segmentation or tracking are difficult to apprehend even
with a deep model. Furthermore, Muller et al. [18] recently showed with label
smoothing experiments that teacher and student networks are sensitive to the
format of the data. Thus, improving knowledge distillation methods is also a
difficult mission.
2.2 Hashing
2.3 Pruning
derivatives. Also, these methods suggest that reducing the number of weights
by using the Hessian of the loss function is more accurate than magnitude-
based pruning like weight decay. Additionnally, it reduces the network over-
fitting and complexity. However, the second-order derivatives introduce some
computational overhead.
Signorini et al. [30] utilized an intuitive and efficient method to remove
parameters. The first step is to learn the connectivity of the network via a
conventional training of the network i.e. to learn which parameters (or con-
nections) are more important than the other. The next step consists in pruning
those connections with weights below a threshold i.e. converting a dense net-
work into a sparse one. Further, the important step of this method is to retrain
(fine-tune) the network to learn the weights of the remaining sparse connec-
tions. If the pruned network is not retrained, then the resulting accuracy is
considerably lower. The general steps for pruning a network are presented on
Figure 3.
YES
NO
Fig. 3: Basic steps for pruning a deep network. Figure inspired by [31].
Anwar et al. [32] used a similar method. However, they state that pruning
has the drawback of constructing a network that has "irregular" connections,
which is inefficient for parallel computing. To avoid this problem, the authors
introduced a structured sparsity at different scales for CNN. Thus, pruning
is performed at : the feature map, the kernel and the intra-kernel levels. The
idea is to force some weights to zero but also to use sparsity at well defined
activation locations in the network. The technique consists in constraining each
outgoing convolution connection for a source feature map to have similar stride
and offset. This results in a significant reduction of both feature and kernel
matrices. Usually, sparsity has been studied in numerous works in order to
penalize non-essential parameters [33–36].
Similar pruning approach is seen in Molchanov et al. [31]. However differ-
ent pruning criteria and technical considerations are defined to remove features
maps and kernel weights, e.g. the minimum weight criteria [30]. They assume
that if an activation value (an output feature map) is small, then the fea-
ture detector is not important in the application. Another criteria involves
the mutual information which measures how much information is present in
Title Suppressed Due to Excessive Length vii
a variable about another one. Further, the Taylor expansion is used similar
to LeCun [28], to minimise the computational cost between the pruned and
the non-pruned network. In this case, pruning is treated as an optimization
problem.
A recent pruning method [37] consists in removing filters that are proven
to have a small impact on the final accuracy of the network. This results
in automatically removing the filter’s corresponding feature map and related
kernels in the next layer. The relative importance of a filter in each layer
is measured by calculating the sum of its absolute weights, which gives an
expectation of the magnitude of the output feature map. At each iteration,
the filters with the smallest values are pruned. Recently, Jian-Hao et al. [38]
developed a pruning network called ThiNet which, instead of using information
of the current layer to prune unimportant filters of that layer, uses information
and statistics of the subsequent layer to prune filters from a given layer. Not
only weights and filters but also channels can be pruned [39] using complex
thresholding methods.
Fig. 4: Comparison of the speed of AlexNet and VGG before and after pruning
on CPU, GPU and TK1. Figure from [40].
Numerous pruning methods exist and each of them has strength and weak-
nesses. The main disadvantage of these methods is that it takes a long time to
prune networks due to the constant retraining that they demand. Recent tech-
niques like [48] try to bypass some steps by pruning neural networks during
their training by using recurrent neural networks. However, all of them result
in considerable reduction of parameters. Pruning methods allow to eliminate
10 to 30 percent of the network’s weights. Regardless of the method, the size
of a network can be decreased with pruning without change or significant drop
in accuracy. The inference with the resulting models will also be faster (see
Figure 4) but the actual speed depends on which method has been utilized
and the sparsity of the network after pruning.
2.4 Quantization
2.6 Binarization
3 Architecture optimization
Compression methods are widely studied. In present times, some of them are
part of popular deep learning frameworks. Tensorflow Lite [56] has tools to
quantify models, allowing to transfer models to mobile devices easier. Core
ML [57], the Apple framework for deep learning, is also able to apply some
of these methods on the devices of the brand. Thus, on the one hand, a few
compression techniques are already integrated in useful tools for developers
but on the other hand, we are still quite far from understanding the intricacies
of deep neural models.
However, these methods are usually applied on already constructed mod-
els as they aim to reduce their complexity. Thereby, recent research focuses
directly on the architectures of these deep models, i.e. creating optimized ar-
chitectures from the ground-up instead of finding methods to optimize them
afterwards. This section of the survey is adressing these approaches. Firstly,
a review of optimized architectures and modules to obtain efficient models is
performed. Secondly, we present neural architecture search (NAS) methods to
construct models "from scratch".
– the use of 1x1 filters to replace most of the 3x3 filters that are usually
present in CNNs,
– decreasing the number of input channels with 3x3 filters,
– downsampling later in the network in order to have convolution layers with
larger activation maps.
The first two choices are aiming to reduce the global number of parameters.
The third point improves the classification accuracy due to the large activation
maps induced by the 1x1 filters and the delay of the downsampling step [5].
Thereby, the fire module is composed of a squeeze convolution layer with only
1x1 filters followed by an expand layer incorporating a mix of 1x1 and 3x3
filters. The final SqueezeNet model is 50 times smaller than AlexNet while
maintaining the same accuracy.
This architecture has been taken one step further by Nanafack et al. [82] to
create the Squeeze-SegNet architecture, a deep fully convolutional neural net-
work for pixel-wise semantic segmentation. This model is an encoder-decoder
style network. The encoder part is similar to the SqueezeNet architecture while
the decoder part is composed of inverted fire and convolutional layers proposed
by the authors and inspired by the SqueezeNet architecture. Thus, the inverted
fire module is called a DFire module, which is a series of alternating expand
and squeeze modules. Both of these modules are retrieved from SqueezeNet.
The downsampling stages are replaced by upsampling steps as the model needs
to produce dense activation maps. Inspired by SegNet [83], the Squeeze-SegNet
xiv Anthony Berthelier et al.
model is able to get the same level of accuracy as SegNet [83] on a dataset like
CamVid [84] with 10 times fewer parameters.
In 2012, Mamalet et al. [85] introduced the simplification of convolutional
filters by using separable convolution layers. This work was improved by
Howard et al. [86] in the MobileNet model. Inspired by the work of Chol-
let [87], the core layers of their architecture is based on depthwise separable
features [88]. Stated differently, the convolution step is factorized into two
separate steps to decrease the computation time taken by multiplication op-
erations. Firstly, a depth-wise convolution applies a single filter to each input
channel. Secondly, a point-wise convolution applies a 1x1 convolution in order
to combine the outputs of the depth-wise convolution. This factorisation intro-
duced in [88] drastically reduces the computational cost of the convolutions.
Fire Fire
Convolution Maxpool Maxpool Maxpool Fire Convolution
Module Module
Layer Module Layer
x3 x4
DFire DFire
Deconvolution DFire Deconvolution
Module Module
Layer Module Layer
Upsample x3 Upsample x4 Upsample
These past few years, the understanding of deep networks has grown due to
the development of new modules and architectures. However, knowing which
model to use for a specific idea is still a difficult task. Tasks have become
more challenging and to overcome them, the key is to find an architecture to
fit them perfectly. But the more challenging is the task, the more difficult it
is to design a network "by hand". Since constructing a proper architecture
can be time-consuming, work has been done to study the possibility of letting
networks automatically grow, adapt or even construct their own architectures.
It is interesting to note that the first works in this field were oriented around
physics and biology. Rosenblatt [90] Kohonen [91] or Willshaw and al. [92] were
associating the organisation of the brain structure to the neural networks in
order to find theoretical self-organising processes. Since then, numerous works
on the subject have been done and they could be regrouped under the name
of neural architecture search (NAS).
We give an overview of different methods in this field. We begin by in-
troducing NAS with the early works in the domain regarding neural gas, fol-
lowed by the neuroevolution methods, inpired by genetical algorithms, and
the network morphism methods which aim to transform trained architectures.
In these methods, the designed architectures are mostly optimized to obtain
the best performance for a resulting task. However the size or memory con-
sumption of these structures may not be optimized. Thus, in a last section we
describes supergraph methods capable of finding structure optimized on these
criteria.
Followed by these initial works, the neural gas methods, introduced by Mar-
tinetz and Schulten [93] were among the first approaches to push forward the
idea of self-organized structures. they aimed to find an optimal data repre-
sentation based on features vectors. In the beginning of the 90’s, the works of
Fritzke B. [94–97] studied the basis of self-organizing and incremental neural
networks by enhancing the neural gas methods into growing structures. The
authors mainly explored two ideas:
The first one, described in [97], was to develop an unsupervized learning
approach for data visualisation, clustering and vector quantization to find a
suitable architecture automatically. In this work, a neural network could be
seen as a graph where a controlled growth process is applied. Furthermore, a
supervized learning approach was also developed adding a radial basis func-
tion. This addition permitted, for the first time, to add new units and to
supervise the training of the parameters at the same time. Moreover, the new
xvi Anthony Berthelier et al.
units were not added randomly anymore, leading to small networks that were
able to generalise better. Their method was tested on vowel recognition prob-
lems and had better results than the nearest neighbour algorithm (the former
state-of-the-art technique).
The second idea described by Fritzke B. in [94] is an extension of the first
method based on a Hebb-like learning rule. As opposed to [93], the model has
no parameters which change overtime but is still able of continuous learning
until a performance criterion is met. From a theoretical point of view, these
research works helped to better understand how the information is passed and
transmitted inside a network.
3.2.2 Neuroevolution
Crossover Points
B C H A G E F D B C H G B E F D
A E F G B C H D A E F A G C H D
PARENTS OFFSPRINGS
Fig. 7: Example of a crossover step. The two parent structures (left) are ran-
domly decomposed at certain points and reconstructed to build offspring struc-
tures (right).
Fig. 8: The two networks compute the same solution, even if their units are
appearing in a different order. This is making crossover impossible or one of
the main unit will disappear. Figure from [100].
approach is able to find better structures than human design with enough
computation time. These methods are able to find automatically adapted and
efficient structure for a specific task. However the computational cost is ex-
pensive.
3.2.4 Supergraphs
111]. Thus, in principle the fact that child models are using their parameters
for different purposes is not a problem.
order to reach this balance, two different hierarchical search spaces are used:
one to factorise a deep network into a sequence of blocks and another one to
determine the layer architecture for each block. Thereby, different layers are
able to use different operations but all the layers in one block are sharing the
same structure. The adapted block is chosen at different depth of the network
to reduce the overall latency. These improvements allow to reduce the search
space while finding architecture that are performing better and faster than
MobileNets [86, 113].
precision method can be one of the solutions. The speed gained for limiting
the numerical precision is important, especially if the structures is well de-
signed. Nevertheless, higher precision is needed in some steps and accuracy
could vary significantly. In the end, compressing a deep model will always lead
to a trade-off between accuracy and computational efficiency.
Faster models provide a great benefit for resource-limited devices and fur-
ther work needs to be done in this direction if we want to leverage all of their
power on mobile devices. However, finding new methods to compress deep
models is not the only solution. We can focus on how the models are con-
structed beforehand. For example a simple architecture like SqueezeNet [81]
is able to reach the same accuracy as a deep model like AlexNet [1], but is 50
times smaller.
Compared to the size of the model, computational efficiency is crucial for
running such algorithms on mobile platforms. Despite the effort on hardware
optimization, algorithmic optimizations like [85] and recent works such as
Mobile-Net[86] and Shuffle-Net[89] have shown that it is promising to not only
compress models but also to construct them intelligently. Thus a well-designed
architecture is the first key to optimized networks.
The works on NAS design an optimized architecture (performance-wise and
computational efficiency-wise) for a specific tasks. Though this is a challenging
exercise, some research works have already shown promising results through
new algorithms and theories like the lottery ticket hypothesis [114]. All these
works are pushing forward the understanding of the mechanics behind deep
models and towards building optimized models capable of solving challenging
applications at a lower cost.
References
9. J. Cheng, P. Wang, G. Li, Q. Hu, and H. Lu, “Recent Advances in Efficient Compu-
tation of Deep Convolutional Neural Networks,” Frontiers of Information Technology
& Electronic Engineering, 2018.
10. Y. N. Dauphin and Y. Bengio, “Big neural networks waste capacity,” 2013.
11. J. Ba and R. Caruana, “Do deep nets really need to be deep?,” NIPS, pp. 2654–2662,
2014.
12. C. Buciluǎ, R. Caruana, and A. Niculescu-Mizil, “Model compression,” pp. 535–541,
2006.
13. G. Hinton, O. Vinyals, and J. Dean, “Distilling the Knowledge in a Neural Network,”
NIPS 2014 Deep Learning Workshop, pp. 1–9, 2014.
14. Y. Chen, N. Wang, and Z. Zhang, “Darkrank: Accelerating deep metric learning via
cross sample similarities transfer,” 2017.
15. Z. Huang and N. Wang, “Like what you like: Knowledge distill via neuron selectivity
transfer,” 2017.
16. G. Aguilar, Y. Ling, Y. Zhang, B. Yao, X. Fan, and C. Guo, “Knowledge distillation
from internal representations,” 2020.
17. H. Lee, S. J. Hwang, and J. Shin, “Self-supervised label augmentation via input trans-
formations,” ICML, 2020.
18. R. Müller, S. Kornblith, and G. Hinton, “When does label smoothing help?,” NIPS,
2019.
19. K. Weinberger, A. Dasgupta, J. Langford, A. Smola, and J. Attenberg, “Feature hash-
ing for large scale multitask learning,” pp. 1113–1120, 2009.
20. W. Chen, J. Wilson, S. Tyree, K. Weinberger, and Y. Chen, “Compressing neural
networks with the hashing trick,” pp. 2285–2294, 2015.
21. R. Spring and A. Shrivastava, “Scalable and sustainable deep learning via randomized
hashing,” pp. 445–454, 2017.
22. J. Ba and B. Frey, “Adaptive dropout for training deep neural networks,” pp. 3084–
3092, 2013.
23. A. Gionis, P. Indyk, and R. Motwani, “Similarity Search in High Dimensions via Hash-
ing,” Proceedings of the 25th International Conference on Very Large Data Bases,
pp. 518–529, 1999.
24. R. Shinde, A. Goel, P. Gupta, and D. Dutta, “Similarity search and locality sensitive
hashing using ternary content addressable memories,” pp. 375–386, 2010.
25. N. Sundaram, A. Turmukhametova, N. Satish, T. Mostak, P. Indyk, S. Madden, and
P. Dubey, “Streaming similarity search over one billion tweets using parallel locality-
sensitive hashing,” Proceedings of the VLDB Endowment, pp. 1930–1941, 2013.
26. Q. Huang, J. Feng, Y. Zhang, Q. Fang, and W. Ng, “Query-aware locality-sensitive
hashing for approximate nearest neighbor search,” Proceedings of the VLDB Endow-
ment, pp. 1–12, 2015.
27. A. Shrivastava and P. Li, “Asymmetric lsh (alsh) for sublinear time maximum inner
product search (mips),” pp. 2321–2329, 2014.
28. Y. L. Cun, J. S. Denker, and S. a. Solla, “Optimal Brain Damage,” Advances in Neural
Information Processing Systems, pp. 598–605, 1990.
29. B. Hassibi and D. G. Stork, “Second order derivatives for network pruning: Optimal
brain surgeon,” pp. 164–171, 1993.
30. S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for
efficient neural network,” NIPS, pp. 1135–1143, 2015.
31. P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz, “Pruning convolutional
neural networks for resource efficient transfer learning,” ICLR, 2017.
32. S. Anwar, K. Hwang, and W. Sung, “Structured pruning of deep convolutional neural
networks,” ACM Journal on Emerging Technologies in Computing Systems, 2017.
33. W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured sparsity in deep
neural networks,” NIPS, p. 2082–2090, 2016.
34. H. Zhou, J. M. Alvarez, and F. Porikli, Less Is More: Towards Compact CNNs, pp. 662–
677. Cham: Springer International Publishing, 2016.
35. J. M. Alvarez and M. Salzmann, “Learning the number of neurons in deep networks,”
pp. 2270–2278, 2016.
36. V. Lebedev and V. Lempitsky, “Fast convnets using group-wise brain damage,”
pp. 2554–2564, 2016.
Title Suppressed Due to Excessive Length xxiii
37. H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning filters for efficient
ConvNets,” ICLR, pp. 1–10, 2017.
38. J.-H. Luo, J. Wu, and W. Lin, “Thinet: A filter level pruning method for deep neural
network compression,” ICCV, 2017.
39. Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang, “Learning Efficient Convolu-
tional Networks through Network Slimming,” ICCV, pp. 2736–2744, 2017.
40. S. Han, H. Mao, and W. J. Dally, “Deep Compression - Compressing Deep Neural
Networks with Pruning, Trained Quantization and Huffman Coding,” ICLR, pp. 1–13,
2016.
41. R. Yu, A. Li, C. Chen, J. Lai, V. I. Morariu, X. Han, M. Gao, C. Lin, and L. S. Davis,
“NISP: pruning networks using neuron importance score propagation,” CVPR, 2018.
42. Z. Zhuang, M. Tan, B. Zhuang, J. Liu, Y. Guo, Q. Wu, J. Huang, and J. Zhu,
“Discrimination-aware channel pruning for deep neural networks,” in Advances in Neu-
ral Information Processing Systems 31, pp. 875–886, 2018.
43. Y. He, P. Liu, Z. Wang, Z. Hu, and Y. Yang, “Filter pruning via geometric median for
deep convolutional neural networks acceleration,” CVPR, 2019.
44. S. Lin, R. Ji, C. Yan, B. Zhang, L. Cao, Q. Ye, F. Huang, and D. S. Doermann,
“Towards optimal structured CNN pruning via generative adversarial learning,” CVPR,
pp. 2790–2799, 2019.
45. T. N. Sainath, B. Kingsbury, V. Sindhwani, E. Arisoy, and B. Ramabhadran, “Low-rank
matrix factorization for Deep Neural Network training with high-dimensional output
targets,” IEEE International Conference on Acoustics, Speech and Signal Processing,
pp. 6655–6659, 2013.
46. T. G. Kolda and B. W. Bader, “Tensor decompositions and applications,” SIAM Re-
view, pp. 455–500, 2009.
47. Y. Cheng, D. Wang, P. Zhou, and T. Zhang, “A survey of model compression and
acceleration for deep neural networks,” arXiv preprint arXiv:1710.09282, 2017.
48. J. Lin, Y. Rao, J. Lu, and J. Zhou, “Runtime neural pruning,” pp. 2178–2188, 2017.
49. Y. Gong, L. Liu, M. Yang, and L. Bourdev, “Compressing deep convolutional networks
using vector quantization,” 2014.
50. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to
document recognition,” Proceedings of the IEEE, pp. 2278–2324, 1998.
51. Y. Choi, M. El-Khamy, and J. Lee, “Towards the limit of network quantization,” ICLR,
2017.
52. D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” 2014.
53. J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning
and stochastic optimization,” Journal of Machine Learning Research, pp. 2121–2159,
2011.
54. M. D. Zeiler, “ADADELTA: An Adaptive Learning Rate Method,” p. 6, 2012.
55. G. E. Hinton, N. Srivastava, and K. Swersky, “Lecture 6a- overview of mini-batch
gradient descent,” COURSERA: Neural Networks for Machine Learning, p. 31, 2012.
56. M. Abadi, “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015.
Software available from tensorflow.org.
57. J. Ahmad, J. Beers, M. Ciurus, R. Critz, M. Katz, A. Pereira, M. Pringle, and J. Rames,
iOS 11 by Tutorials: Learning the New iOS APIs with Swift 4. Razeware LLC, 1st ed.,
2017.
58. C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culurciello, and Y. Lecun, “Neu-
Flow: A runtime reconfigurable dataflow processor for vision,” IEEE Computer Society
Conference on Computer Vision and Pattern Recognition Workshops, 2011.
59. V. Gokhale, J. Jin, A. Dundar, B. Martini, and E. Culurciello, “A 240 G-ops/s mo-
bile coprocessor for deep neural networks,” IEEE Computer Society Conference on
Computer Vision and Pattern Recognition Workshops, pp. 696–701, 2014.
60. V. Vanhoucke, A. Senior, and M. Z. Mao, “Improving the speed of neural networks on
cpus,” in Deep Learning and Unsupervised Feature Learning Workshop, NIPS 2011,
2011.
61. A. Iwata, Y. Yoshida, S. Matsuda, Y. Sato, and Y. Suzumura, “An artificial neural net-
work accelerator using general purpose 24 bit floating point digital signal processors,”
Proc. IJCNN, pp. 171–175, 1989.
xxiv Anthony Berthelier et al.
88. L. Sifre and M. Stephane, “Rigid-Motion Scattering For Image Classification,” CoRR,
2014.
89. X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremely efficient convolutional
neural network for mobile devices,” 2017.
90. F. Rosenblatt, “Perceptrons and the Theory of Brain Mechanics,” p. 621, 1962.
91. T. Kohonen, “Self-organized formation of topologically correct feature maps,” Biolog-
ical Cybernetics, vol. 43, no. 1, pp. 59–69, 1982.
92. “How patterned neural connections can be set up by self-organization,” Proceedings of
the Royal Society of London. Series B, Biological Sciences, vol. 194, no. 1117, pp. 431–
445, 1976.
93. T. M. Martinetz, S. G. Berkovich, and K. J. Schulten, ““Neural-Gas” Network for
Vector Quantization and its Application to Time-Series Prediction,” 1993.
94. B. Fritzke, “A Growing Neural Gas Learns Topologies,” Advances in Neural Informa-
tion Processing Systems, vol. 7, pp. 625–632, 1995.
95. B. Fritzke, “Supervised Learning with Growing Cell Structures,” Advances in Neural
Information Processing Systems 6, no. 1989, pp. 255–262, 1994.
96. B. Fritzke and R.-u. Bochum, “Fast learning with incremental RBF Networks 1 Intro-
duction 2 Model description,” Processing, vol. 1, no. 1, pp. 2–5, 1994.
97. B. Fritzke, “Growing cell structures-A self-organizing network for unsupervised and
supervised learning,” Neural Networks, vol. 7, no. 9, pp. 1441–1460, 1994.
98. D. J. Montana and L. Davis, “Training feedforward neural networks using genetic algo-
rithms,” Proceedings of the International Joint Conference on Artificial Intelligence,
pp. 762–767, 1989.
99. D. Floreano, P. Dürr, and C. Mattiussi, “Neuroevolution: from architectures to learn-
ing,” Evolutionary Intelligence, vol. 1, no. 1, pp. 47–62, 2008.
100. K. O. Stanley and R. Miikkulainen, “Evolving neural networks through augmenting
topologies,” Evolutionary Computation, 2002.
101. N. J. Radcliffe, “Genetic set recombination and its application to neural network topol-
ogy optimisation,” Neural Computing & Applications, vol. 1, no. 1, pp. 67–90, 1993.
102. D. Thierens, “Non-redundant genetic coding of neural networks,” pp. 571–575, 1996.
103. R. Miikkulainen, J. Liang, E. Meyerson, A. Rawal, D. Fink, O. Francon, B. Raju,
H. Shahrzad, A. Navruzyan, N. Duffy, and B. Hodjat, “Evolving Deep Neural Net-
works,” 2017.
104. T. Elsken, J. H. Metzen, and F. Hutter, “Simple and efficient architecture search for
convolutional neural networks,” 2018.
105. H. Cai, T. Chen, W. Zhang, Y. Yu, and J. Wang, “Efficient Architecture Search by
Network Transformation,” pp. 2787–2794, 2018.
106. H. Jin, Q. Song, and X. Hu, “Efficient Neural Architecture Search with Network Mor-
phism,” 2018.
107. H. Cai, J. Yang, W. Zhang, S. Han, and Y. Yu, “Path-level network transformation for
efficient architecture search,” in Proceedings of the 35th International Conference on
Machine Learning, pp. 678–687, 2018.
108. S. Saxena and J. Verbeek, “Convolutional Neural Fabrics,” NIPS 2016, 2016.
109. H. Pham, M. Guan, B. Zoph, Q. Le, and J. Dean, “Efficient neural architecture search
via parameters sharing,” in Proceedings of the 35th International Conference on Ma-
chine Learning, vol. 80, 2018.
110. T. Veniat and L. Denoyer, “Learning time/memory-efficient deep architectures with
budgeted super networks,” in Conference on Computer Vision and Pattern Recogni-
tion, pp. 3492–3500, 2018.
111. B. Zoph, D. Yuret, J. May, and K. Knight, “Transfer Learning for Low-Resource Neural
Machine Translation,” 2016.
112. M. Tan, B. Chen, R. Pang, V. Vasudevan, and Q. V. Le, “MnasNet: Platform-Aware
Neural Architecture Search for Mobile,” CVPR, 2019.
113. M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Inverted Residuals
and Linear Bottlenecks: Mobile Networks for Classification, Detection and Segmenta-
tion,” 2018.
114. J. Frankle and M. Carbin, “The lottery ticket hypothesis: Finding sparse, trainable
neural networks,” in ICLR, 2019.