Deep learning framework
Deep learning framework
L EARNING F RAMEWORKS
A P REPRINT
Jordan University of Science and Technology Jordan University of Science and Technology
Irbid, Jordan Irbid, Jordan
[email protected] [email protected]
May 7, 2020
A BSTRACT
Deep Learning (DL) is one of the hottest fields in machine learning as DL approaches produced
results superior to the state-of-the-art in several areas such as image processing and natural language
processing (NLP). To foster the growth of DL, several open source frameworks appeared providing
implementations of the most common DL algorithms. These frameworks vary in the algorithms
they support and in the quality of their implementations. The purpose of this work is to provide a
qualitative and quantitative comparison among three of the most popular and most comprehensive
DL frameworks (namely Google’s TensorFlow, University of Montreal’s Theano and Microsoft’s
CNTK). The ultimate goal of this work is to help end users make an informed decision about the best
DL framework that suits their needs and resources. To ensure that our study is as comprehensive as
possible, we conduct several experiments using multiple benchmark datasets from different fields
(image processing, NLP, etc.) and measure the performance of the frameworks’ implementations of
different DL algorithms. For most of our experiments, we find out that CNTK’s implementations are
superior to the other ones under consideration.
1 Introduction
Deep Learning (DL) is the hottest field in Machine Learning (ML). The idea of DL is to train a multi-layer Neural
Network (NN) on a dataset in order to allow it to handle real world tasks. Although the theoretical concepts behind
DL are not new, DL has enjoyed a surge of interest over the past decade due to many factors including its successful
application to several problems (many of which have great commercial potentials) and the improved affordability of the
required computing infrastructure.
DL approaches have significantly outperformed state-of-the-art approaches in many classical problems of many fields
such as image processing, computer vision, speech processing, natural language processing (NLP), etc. Moreover, the
scientific community (from both the academia and the industry) has quickly and massively adopted DL. Open source
implementations of successful DL algorithms quickly appeared on code sharing websites, and were subsequently used
by many researchers in different fields.
A PREPRINT - M AY 7, 2020
Several DL frameworks exist such as TensorFlow, Theano, CNTK, Caffe, Torch, Neon, pylearn, etc. Each one of these
frameworks has different features and performance characteristics. Further, each framework utilizes different techniques
to optimize its implementation of DL algorithms. Although the same algorithm is implemented in different frameworks,
the performance of the different implementations can vary greatly. Researchers/practitioners looking to employ such
algorithms in their research or application face a difficult choice, since the number of different implementations is high
and the effort invested by the research community in scientifically comparing these implementations is limited.
In this work, we aim at providing qualitative and quantitative comparison between popular open source DL frameworks.
To be more specific, we focus on three very popular DL frameworks, namely Theano (from MILA Lab, University of
Montreal), TensorFlow (from Google), and CNTK (from Microsoft). These frameworks support multi-core CPUs as
well as multiple GPUs. All of them import cuDNN, which is a DL library from NVIDIA that supports highly tuned
implementations for standard routines such as forward and backward convolution, normalization, pooling, and activation
layers.1 We compare these frameworks by training different NN architectures on five different standard benchmark
datasets for various tasks in image processing, computer vision and NLP.2 Despite their importance, comparative studies
like ours that focus on performance issues are rare. A comparative study of the frameworks is important in order to
enable people who are interested in applying DL in their research and/or applications to make informed decisions about
which of the existing frameworks suits their needs.
The rest of this paper is organized as follows. Sections 2 presents our survey of the literature, while Section 3 explains
in details what NN are. Section 4 discusses the frameworks, the way they were used to train the datasets and a brief
comparison between them. The methodology we follow is discussed in Section 5. Experimental results and the
discussion are detailed in Section 6. The work is concluded with final thoughts presented in Section 7.
2 Literature Survey
Only few researchers did a comparative study between state-of-the-art DL frameworks running on different hardware
platforms (CPU and GPU) to highlight the advantages and limitations for each framework when applied on different
deep NN architectures, and to enable developers to optimize the running performance of DL frameworks.
One of the first and most prominent examples of comparative studies between DL frameworks was carried out by
Bahrampour et al. [2]. The authors compared five DL frameworks: Caffe, Neon, TensorFlow, Theano, and Torch, in
terms of speed (gradient computation time and forward time), hardware utilization and extensibility (ability to support
different types of DL architectures) after applying various convolutional algorithms on the aforementioned frameworks.
They conducted their experiments on a single machine for both CPU (multithreaded) and GPU (NVIDIA Titan X)
environments. The comparison between frameworks was carried out by training convolutional and stacked autoencoder
(AE) networks on the MNIST and ImageNet datasets. They also trained long short-term memory (LSTM) networks [3]
on the IMDB dataset [4].3 The authors reported several observations/findings. In terms of extensibility, they found that
Theano and Torch were the best as they can support various DL architectures and libraries. Moreover, they found that
TensorFlow was a very flexible framework especially when used in different parts of the computational graph. Finally,
emphasizing ease of use, they noticed that Caffe was the easiest. In terms of performance, they noticed that Torch was
the best for training and testing their DL architectures on a CPU platform. Theano came in second while Neon gave the
worst performance. On a GPU platform, for convolutional and fully connected networks, they found that Torch was the
best followed by Theano. Moreover, they noticed that Theano was the fastest on small networks and Torch was the
fastest on large networks followed by Neon. For recurrent networks (LSTM), they found that Theano’s results were the
best in terms of performance. TensorFlow on single GPU was the worst compared to other studied frameworks.
Shi et al. [5] did a comparative study between several DL frameworks including Caffe, MXNet, CNTK, TensorFlow,
and Torch. They considered three types of NN including; fully connected NN (FCN), convolutional NN (CNN) and
recurrent NN (RNN). Moreover, they used different hardware environments including two CPU platforms and three
GPU platforms. They considered the running time and the convergence rate as the metrics to evaluate the selected
frameworks. They used synthetic datasets to measure running time performance and real-world datasets to measure the
convergence rate in their experiments. The results were as follows. In synthetic datasets, they evaluated the performance
of FCN using a large NN (FCN-s). They used AlexNet and ResNet-50 on ImageNet dataset. For real-world datasets,
they applied MNIST dataset using a small FCN (FCN-R). Moreover, they applied CIFAR10 dataset using AlexNet-R
and ResNet-56. For RNN, they chose two LSTM layers for testing. After experimentation, they found that all tested
frameworks achieved significant speed-up using GPU over CPU. For CPU platform, they found that TensorFlow
was the best compared to other tools. On a single GPU, Caffe, CNTK and Torch performed better than MXNet and
1
The frameworks now have updates which were not present during the writing of this paper.
2
In an earlier version of this work [1], we only considered two datasets.
3
More details will be given later about these neural network architectures and these datasets.
2
A PREPRINT - M AY 7, 2020
TensorFlow on FCN implementations. For small CNN, Caffe and CNTK achieved good performances. For RNN
(LSTM), CNTK was the fastest as it was five to ten times better than the other tools. Finally, on multi-GPU platforms,
all implementations had higher throughput and convergence rate.
Goldsborough [6] showed the timeline of ML software libraries for DL. He focused on TensorFlow’s results and its
basic properties including computational paradigms, its distributed model and programming interface. He compared
TensorFlow to other DL frameworks including Theano, Torch and Caffe, qualitatively and quantitatively. In qualitative
terms, he compared aforementioned frameworks using several categories including frontends, programming model style,
how gradients are computed, and distributing the execution of computational graph. Table 1 shows a summary of this
comparison. In quantitative terms, he reviewed works (such as [2, 7]), which contain comparisons between TensorFlow
and other DL frameworks. From LeNet benchmark in [8], he noted that TensorFlow was ranked second after Torch in
forward and backward measures, but, in terms of performance, TensorFlow came at the last rank compared to tested
frameworks. From the results on [9]’s benchmark for convolutional network models, he noted that TensorFlow came
at the second place behind Torch in forward and backward propagation time. Finally, in [10], the benchmarks were
CNN models including AlexNet architecture, and an LSTM network operating on the Penn TreeBank dataset [11]. He
noted that TensorFlow was the best framework for small model followed by Theano then Torch. For large models,
TensorFlow came at the second rank after Theano, and Torch came at the last place.
Chintala4 applied different ImageNet benchmarks for a variety convolutional network types including AlexNet,
GoogleNet, Overfeat, and OxfordNet using different open source DL frameworks such as Caffe, Theano, Torch,
TensorFlow, Chainer, etc. He conducted his experiments on NVIDIA Titan X GPU and two new packages for Fast
Fourier Transform (FFT) computation [12]. The first one is based on the NVIDIA library (cuFFT) and another on a
Facebook FFT (fbfft). Moreover, he used native version of each DL frameworks. After experimentation, he found that
using fbfft resulted in a speed-up over cuFFT for all applied CNN. Moreover, he found that the fastest framework for
CNN was Torch followed by Tensorflow.
Theano development team of Al-Rfou et al. [7] discussed the Theano framework, its features, how to use it, and showed
recent improvements on it. They did a performance comparison between Theano and other frameworks including Torch
and TensorFlow on three types of ML models including CNN, RNN and sequence-to-sequence mapping RNN. Finally,
they showed the computation speed using multiple GPUs. The results were as follows. On a single GPU platform, for
CNN, they found the processing time for four different convolutional models (AlexNet [13], OverFeat [14], VGG [15],
and GoogLeNet [16]) on ImageNet dataset. They reported results for each framework per minibatch for forward and
backward pass. They found that Theano was slower than Torch and TensorFlow. However, in overall performance-wise,
they were close to each other. For RNN, they used LSTM on Penn TreeBank dataset [11] and reported the results on
small, medium and large LSTM models. They found that Theano was in the second place after TensorFlow for the
small model, but Theano was the fastest for the medium and large models. They also showed that Torch was slower than
Theano on all tested models. Finally, for sequence-to-sequence model [17], the input was video frames and the output
was the English sentence describing the input. The input video frame was preprocessed by a GoogLeNet (pre-trained for
classification on ImageNet). They compared Theano to TensorFlow and excluded Torch because there was no available
implementation in Torch. They reported the processing time per minibatch using three different batch sizes (32, 64 and
128). They found that Theano was the fastest on smaller batches. However, on large ones, TensorFlow was the superior
one. They repeated the previous RNN model (LSTM) on multi-GPU platforms (2-GPUs and 4-GPUs) using platoon.
Their measured processing speed when synchronizing after each batch was found to provide speed-ups between 1.6X
and 1.7X for 2-GPUs and 3.2X for 4-GPUs. When synchronizing after every 100 batches, they found a 2X speed-up for
2-GPUs and 3.9X-4X speed-up for 4-GPUs.
Kovalev et al. [18] presented a comparative study between five DL frameworks: Theano, Torch, Caffe, TensorFlow, and
DeepLearning4J, in terms of training and prediction speed and classification accuracy. They used MNIST dataset of
handwritten digits for testing five FCN frameworks. Their computation experiments were applied only on CPU; they
reported out the results for two kinds of scaling factors applied on FCN networks including changing the network’s
depth (number of internal layers) and changing the network’s width (number of neurons). Moreover, they tested the NN
4
https://github.com/soumith/convnet-benchmarks
3
A PREPRINT - M AY 7, 2020
with two different activation functions, Tanh and ReLU. In Tanh nonlinearity function, they found that the training time
is approximately 30 seconds when they changed the number of layers from one to four in all frameworks except for
DeepLearning4J. For DeepLearning4J, the training time started from 140 seconds for one layer and grew up to 210
seconds when they used four layers. For the prediction time, they found that Theano, Torch, Caffe, and TensorFlow
consumed less than 0.4 second. However, for DeepLearning4J, it started from 0.75 second when using one layer and
increased up to 1.1 second for four layers. In terms of classification accuracy, they found that Theano, DeepLearning4J,
and Caffe achieved high accuracy starting from 94% for one layer and going up to 98% for four layers. For Torch and
TensorFlow, the accuracy dropped with increasing the number of network layers. With ReLU nonlinearity function,
the training time was much lower compared with the case of Tanh function. As for the prediction time, they observed
that the use of ReLU gave results that were similar to Tanh results. In terms of accuracy, they found that Torch’s
accuracy grew up while increasing the number of layers, whereas other frameworks had the same behavior as with Tanh.
Finally, they changed the number of neurons in the hidden layers of networks for ReLU function only, and reported the
speed and accuracy values. They found that DeepLearning4J framework was the slowest in training and prediction
times, and time consumed increased with the increase in the number of neurons. The final classification accuracy when
changing the internal layer sizes from 64 to 1024 neurons remained around 97% for Theano, Caffe, TensorFlow, and
DeepLearning4J. However, in the case of Torch, the classification accuracy grew with the growing layer size (started
from 70% and reached 98%).
Bastien et al.[19] suggested a new feature to be added to the main features of Theano in order to improve its performance
in different benchmarks. They conducted a comparative study (in terms of features and performance) between Theano
and Torch7 on NN benchmarks, and between Theano and RNNLM on RNN benchmarks. In their comparison, they
used three learning architectures: logistic regression, NN with one hidden layer (500 units) and Deep NN (DNN) with
three hidden layers (1000 hidden units each). They found that when applying NN on CPU using one hidden layer
models without using mini-batches, Theano’s results overcome Torch7. However, on the logistic regression benchmark,
Torch7 thrived because, at each call, it decreased the amount of performed computation. On the GPU, with batch size
equal to one, Torch7 overcame Theano. When using mini-batches, Theano was faster than (or has an equivalent speed
to) Torch7 on all three learning architectures under consideration. When they applied RNN on Theano and RNNLM
with batch size of one, they found that RNNLM was faster than Theano on smaller models. However, for bigger sizes,
Theano was faster.
Ding et al. [20] introduced an implementation of Theano-based AlexNet on ImageNet dataset and applied it on multiple
GPUs in order to accelerate the training process. They compared their results of Theano to Caffe library which runs on
single GPU in terms of training time. On a single GPU with batch size equal to 256, Caffe was shown to be faster than
Theano. However, on 2-GPUs with batch size equal to 128, they found that Theano was faster than Caffe on 1-GPU.
Dai and Berleant [21] studied benchmarking principles and machine learning hardware setup including GPUs, FPGAs
and ASICs. Moreover, they studied deep learning software frameworks. They introduced 11 qualitative benchmarking
metrics for hardware devices and six metrics for deep learning software frameworks. Moreover, they compared 18 deep
learning frameworks and divided them into three categories namely mature frameworks (including Caffe, Facebook
Caffe2, Chainer, DyNet, MXNet, CNTK, Tensorflow, Keras, Neon, PlaidML, Pytorch and Theano), developing
frameworks (including Apache SINGA, BigDL, DL4J and PaddlePaddle) and inactive frameworks (including Torch
and Purine). The 11 qualitative benchmarking hardware aspects for GPUs, FPGAs and ASICs discussed computing
performance, low latency, energy efficiency, compatibility, research costs, research risks, upgradability, scalability, chip
price, ubicomp and time to market. On the other hand, the six qualitative benchmarking metrics for DL frameworks
discussed the licence type, interface codes (API), compatible hardware, reliability, tested DL networks and tested
datasets that encompass wide range of datasets such as image datasets, voice datasets and text datasets as well. They
also mentioned the MLPerf benchmarking organization that offers useful benchmarks to evaluate training and inference
on DL hardware devices. MLPerf benchmarks include benchmark metrics and datasets such as ImageNet and COCO
image datasets, WMT English-German translation datasets and MovieLens-20M recommendation datasets. Another
evaluation criteria mentioned was DL algorithms and DL frameworks such Tensorflow, Pytorch, MXNet, Caffe and
Sinan.
Coleman et al. [22] introduced DAWNBench a benchmark focused in end-to-end training time to achieve a SOTA
accuracy level, as well as inference time accompanied with that accuracy. They studied how different optimizations,
including choice of optimizer, stochastic depth and multi-GPU training affect end-to-end training performance. The
initial release of DAWNBench provided end-to-end and inference tasks such as image classification on CIFAR-10 and
ImageNet as well as question answering on SQuaD and reference implementations for each task. DAWNBench differs
from other benchmarking platforms (Baidu DeepBench, Fathom, and Tensorflow Benchmark) because it focuses on
end-to-end performance, where the others uses time needed to train on a single minibatch of data as key metric, while
disregrading the resulting accuracy of the trained model. Moreover, other benchmarks focus on timimg individual
low-level operations utilized in DL computations, while DAWNBench measures time to a pre-specified level of accuracy
4
A PREPRINT - M AY 7, 2020
taking into account both hardware and statistical performance. Training procedure used for batch size 128, an SGD
with a weight decay of 0.0005 and momentum of 0.9, learning rate of 0.01 for five epochs used as well, then proceed
with an initial learning rate of 0.1 for 90 epochs. the learning rate is decayed by a factor of 10 every 45 epochs and the
training terminated after 185 epochs. Regarding augmentation process, the initial learning rate was linearly scaled using
base learning rate of 0.1 corresponding to a batch size of 128. The authors considered three optimization techniques.
Adam is an adaptive optimizer for gradient descent that reports considerable speed-ups over other adaptive optimizers
when training CNN on CIFAR-10. Single-node multi-GPU training, using 4 GPUs in two different settings, first with
the same minibatch size (128) as that used in the baseline approach, but distributed across the 4 GPUs. Second, with
effectively a minibatch size multiplied by four, where every GPU is given a minibatch size of 128. Stochastic Depth is
another optimizer used that can be thought of as a form of regularization similar to dropout. It reports improvements in
training and time accuracy.
In previous studies, the comparison goal focused only on processing time. None of those comparative studies dealt with
CPU and GPU utilization or memory consumption. This work covered these metrics to find which of the considered
frameworks achieve the best performance. Finally and most importantly, the comparisons involved more datasets from
more fields compared with previous studies.
3 Neural Networks
We start this section with a glimpse of history related to neural networks.
A single layer NN is a network that consists of a single hidden layer between the input layer and the output layer. The
hidden layer has many units called neurons. See Figure 1.
5
A PREPRINT - M AY 7, 2020
A perceptron is considered a binary classifier as it has only two possible results: 0 or 1. This result is determined by
computing a single output from multiple input values. This is done by computing a weighted sum of input values
and then putting the output through some nonlinear activation function like the Heaviside step function (as Threshold
Function) [24], as shown in Figure 3. The output neuron in the output layer connects to all inputs to produce one output
value.
The advantages of the perceptron include simplicity of its architecture and its “light” computation requirement allowing
its efficient use with very large datasets. The main drawback of the perceptron is that it only learns linearly-separable
functions. In order to solve this problem, a multilayer perceptron (also known as DNN) was suggested by Ivakhnenko
et al. [25] to get more powerful learning mechanisms.
An activation function is a nonlinear function that is used in different types of NN. It takes weighted input data
and transforms them into a nonlinear output by performing some mathematical operations on them such as matrix
multiplication between inputs and weights. Regarding DL implementations, nonlinear activation functions create
complex features with every layer. Implementations with a linear activation function would behave like a single-layer
network (no matter how many hidden layers) because summing these layers would give just another linear function.
This is the reason why nonlinear activation functions are used more widely at DL networks. However, it is possible that
some NN may contain neurons with linear activation function in the output layer. These neurons require a nonlinear
activation function in previous parts of the network. There are several types of activation functions including the
sigmoid function shown in Figure 4.
6
https://goo.gl/h9cAJm
7
https://appliedgo.net/perceptron/
8
http://cs231n.github.io/neural-networks-1/
6
A PREPRINT - M AY 7, 2020
7
A PREPRINT - M AY 7, 2020
It is to be noted that the ReLU function is a special case of the Maxout function. The Maxout neuron has all the benefits
of a ReLU unit and does not have its drawbacks (dying unit), which makes Maxout one of the most common activation
functions used in DL networks.
Another form of activation functions is called the Logistic Regression. It is a regression model developed in 1958 by
Cox [27], that estimates the relationship between statistical input variables to make prediction of an output variable.
It uses the logistic sigmoid function to generate a prediction as to which of multiple classes the input data belongs.
Logistic regression is used for different areas like medical and social sciences where it is used for analytical purposes
and interpretation of results from experiments. It is used for very large datasets because of its simplicity and speed. The
final layer at a DL algorithm can be constructed using a logistic regression, where the network has multiple feature
learning layers that pass features into a logistic regression layer to classify inputs.
Logistic regression can be binomial, ordinal or multinomial. In the binomial type, the observed output for a dependent
variable has two possibilities: 0 or 1. In multinomial logistic regression, the output can have more than two types (e.g.,
“Disease A” vs “Disease B” vs “Disease C”) which are not ordered. In ordinal type, the dependent variables must be
ordered.
In linear regression algorithms (using least square), Gradient Descent (GD) is an optimization algorithm that is used to
find the values of parameters in a way that minimizes the cost function and the least square error. If a huge dataset
is trained using GD, calculating the parameters will be expensive and take long time. If there are millions of sample
points, in every iteration, GD must go through these points to calculate the parameters. To solve these problems, a
variation of GD called Stochastic GD (SGD) is used. SGD is an optimization method used to train models including
support vector machines (SVM), logistic regression, graphical models, etc. To calculate the parameters in SGD, a
sample of training set or one training value is used instead of using the entire sample in every iteration. This method is
much faster and less costly than GD.
Backpropagation of errors is a learning method that is used to train NN. It is used along with an optimization algorithm
such as GD. The modern version of backpropagation was proposed by Linnainmaa [28], where he published the
general method for automatic differentiation (AD) of discrete connected networks of nested differentiable functions.
Backpropagation repeatedly performs two steps for training the network; propagation and weights update. At the first
step, the input vector is forward propagated layer-by-layer until it reaches the output layer and produces the output
of this vector. For each of the neurons in the output layer, an error value is calculated using a loss function, which
compares the output of the network to the desired output. It then calculates the gradient of the loss function with respect
to the weights. Backpropagation sends these error values backwards starting from the output layer until it reaches the
first layer. At the second step, this gradient value is fed to an optimization method (e.g., GD) to update the weights in
order to minimize the loss function. Backpropagation is considered as a supervised learning method, because it requires
the knowledge of the desired output to calculate the loss function gradient, but some unsupervised networks such as AE
can use it.
The rapid improvement in DL methods makes the training of any DL network a complex and time consuming process.
To address these issues, many software tools (frameworks) have appeared to develop these methods in an easy and
efficient manner.
8
A PREPRINT - M AY 7, 2020
A multilayer NN is a network consisting of more than one hidden layer between the input layer and the output layer,
as shown in Figure 7. One popular example is the Multilayer Perceptron (MLP) network, which consists of an input
layer, one or more hidden layers of computation neurons and an output layer. MLP can learn linear and nonlinear
functions in contrast to the single layer perceptron that only supports linear functions. MLP has a large number of
features. Moreover, it uses backpropagation technique for training the network. Each node in its hidden layers is a
neuron that applies a nonlinear activation function. The input values are passed from the input nodes to the first hidden
layer, which applies some calculations to them using the activation function. The resulting signals are then passed as
input signals to the next hidden layer. This procedure is repeated until the signals reach the output layer.
The wide adoption of Deep Neural Networks (DNN) gave rise to the new field of Deep Learning (DL). The DNN
is simply a NN with more than one hidden layer of nonlinear processing units which extract the features by passing
input data from a layer to another until a desirable output is produced. One of the most used algorithms in DL is the
Backpropagation algorithm, which is used to train the network by updating the weights of the connections between
sequential hidden layers. This process is repeated many times until the output matches the desired output.
There are several types of DL architectures such as DNN, Convolutional DNN (CDNN), Deep Belief Networks (DBN)
and RNN. Approaches based on these architectures have been achieving significant performance improvements when
applied to several tasks in speech recognition, computer vision, NLP, etc.
More than half a century ago, Ivakhnenko et al. [25] introduced deep MLP (Figure 8), where thin but deep models
(three hidden layers) with polynomial activation functions were used. The authors used statistical methods to select the
best features in each layer, and forwarded these features to the next layer until the output layer is reached. Finally, they
used layer-by-layer backpropagation algorithm to train the network. A deeper network with eight layers was introduced
in [29] which was trained using the Group Method of Data Handling (GMDH) algorithm.
In 1980, a network with multiple convolutional and pooling layers was introduced in [30], where it was trained
using reinforcement learning. The challenge for this model was the training of the multiple layers. At that time,
backpropagation of errors was an inefficient and incomplete form to train such deep models.
LeCun gave in 1989 the first efficient and practical application of backpropagation at Bell Labs [31]. He applied
backpropagation to a deep convolutional network in order to classify the handwritten digits of the MNIST dataset. This
approach achieved good results. Unfortunately, it consumed a lot of time, which rendered it impractical for many years.
In 1993, RNN were introduced to solve the time consumption problem. RNN learned by unsupervised learning, which
was implemented and used with very deep learning tasks (more than 1,000 subsequent layers) [3]. After that, a DL
method called the long short-term memory (LSTM) for RNN was proposed by Hochreiter and Schmidhuber in 1997 [3],
where it was used in the deep learning tasks that require memories of events (like speech). Moreover, it avoided the
vanishing gradient problem at which no learning signals reached to early layers in the network during training the deep
network.
9
https://bit.ly/2KgyOdY
10
https://goo.gl/vcqNjP
9
A PREPRINT - M AY 7, 2020
Figure 8: The architecture of the first known deep network by Ivakhnenko et al. [25].10
A big shift in the field of DL occurred when more people started to use the graphics processing units (GPUs) in the
training process. This increased the computational speed allowing NN to produce better results by using more training
data. However, training using huge amounts of data brought back to light the vanishing gradient problem. To solve this
problem, the models were learning in a layer-by-layer fashion using unsupervised learning. This required the features
of early layers to be initialized with suitable features beforehand (pre-trained). However, in supervised learning, the
features at early layers need to be adjusted during learning process. One solution, which is the pre-training solution,
was initially developed for RNN in 1992 [32] and for feedforward networks in 2006 [9]. A second solution for the
vanishing gradient problem in RNN was the LSTM [3].
In the year 2011, the rapid increase in the speed of GPUs reached its glory, which led many researchers such as Ciresan
et al. [33] to train deep networks without using pre-training techniques and started to introduce deep learning networks
that were constructed from convolutional layers, max-pooling layers, and several fully connected layers followed by a
final classification layer [10]. Krizhevsky et al. [34] used a similar architecture with rectified linear activation functions
and dropout function. Since then, the research in DL using GPUs has accelerated rapidly.
Convolution, which is widely used in DL networks, is a mathematical process used to mix input data function, g, with
convolutional kernel (filter), f , in order to produce a transformed feature map (a modified version of the original input)
as shown in the following equation.
Z 1 Z 1
(f ⇥ g)(t) = f (⌧ )g(t ⌧ )d⌧ = f (t ⌧ )g(⌧ )d⌧ (6)
1 1
Convolution has different applications in many fields such as probability, statistics, computer vision, imaging, etc.
In probability theory, convolution is similar to cross-correlation, while in statistics, it updates the weights over
normalization of input vector. Figure 9 shows a Convolutional Neural Network (CNN or ConvNet), depicted in Figure 9,
is a network with multiple layers of convolutions applying a nonlinear activation function like ReLU or Tanh (mostly
used for image processing). CNN layers filter input data to produce useful feature map information. These layers have
parameters that are updated repeatedly to produce the desired output. In feedforward NN with fully connected layers,
each input neuron is connected to all neurons of the next layer. However, with CNN, after calculation of the output by
applying convolutions on the input data, each node only connects itself with the closest neighboring neurons (local
connectivity). Convolutional layers sizes shrink as they become deeper.
The architecture of CNN primarily has three types of layers including convolutional layers, pooling layers, and fully
connected layers. The convolutional layers are the main block of a CNN that do most of the computational operations.
The pooling layers are applied after the convolutional layers, where these layers partition the input data into non-
overlapping sets (windows), and reduce each set to a single value (subsampling) by applying a max operation (outputs
the maximum value in each set) to the result of each filter. The pooling layers have benefits of reducing the spatial
11
http://www.asimovinstitute.org/neural-network-zoo/
10
A PREPRINT - M AY 7, 2020
size of data, reducing the number of parameters, reducing computations, and controlling overfitting. Finally, after all
generated features are combined, they are used to find the final classification via fully connected layers, whose neurons
are fully connected to all activations in the previous layer.
Many CNN architectures exist such as:
• LeNet: Developed by LeCun et al. [35] in the 1990s, LeNet was used to read zip codes, recognize characters,
etc.
• AlexNet: Developed by Krizhevsky et al. [34], AlexNet was used in computer vision tasks. It has a similar
architecture to LeNet, but with deeper, bigger, and more stacked convolutional layers on top of each other.
• GoogleNet: Introduced by Szegedy et al. [16] from Google, GoogleNet had a reduced number of parameters
in the network (it had 4M parameters, compared to AlexNet’s 60M).
• VGGNet: Introduced by Simonyan and Zisserman [15], the goal of VGGNet’s architecture was to prove that a
good performance depends on the depth of the network.
• ResNet (Residual Network): Introduced by He et al. [36], ResNet’s architecture does not have fully connected
layers at the end of the network. Moreover, it uses batch normalization.
RNN is a type of NN, first introduced in 1990 by Elman [37]. After that, Elman and others began to develop the concept
of RNN. In 1993, a modified RNN model was developed to solve very deep learning tasks which required more than
100 subsequent layers. RNN connections between neurons form a directed cycle (fed data from previous layer and from
themselves) as shown in Figure 10. This makes it available to be used for sequential information. Because RNN is built
in a way that fits sequential information, it is used in many tasks in NLP, speech recognition, image capturing, language
translation, etc.
11
A PREPRINT - M AY 7, 2020
Unlike Simple NN and CNN models, in RNN, the output is dependent on the previous computations. Thus, a RNN has
an internal memory to save previous computations. For instance, RNN is popular in NLP tasks where it predicts next
word depending on previous words in any given sequence using the internal memory in neurons of the hidden layers.
Long short-term memory (LSTM) is a special type of RNN (shown in Figure 11) proposed in 1997 by Hochreiter and
Schmidhuber [3] in order to solve the vanishing gradient problem. It uses LSTM neurons (memory cells with three
gates: input, output and forget) instead of simple neurons in the hidden layers. Additionally, LSTM are designed in a
way to avoid the long-term dependency problem. While simple RNN can “remember” previous information for short
time periods, LSTM can remember them for much longer time periods. However, if the information is not used, it will
be lost.
An LSTM cell is more complex than simple (or vanilla RNN) cells as it has a memory to store previous sequences.
Each cell contains gates that manage its state and output. Each gate within a unit uses the sigmoid activation function
to decide whether it is triggered or not, which makes it conditional to change state or add information. These three
gates within a memory cell are: forget gate to decide what information to discard from each LSTM cell, input gate to
decide the update of memory state depending on input values and output gate to decide what to output based on the
input and the memory of the LSTM cell.
The frameworks considered in this comparative study are: CNTK, TensorFlow and Theano. Moreover, we use Keras on
top of these frameworks as discussed later. All of these frameworks provide flexible APIs and configuration options for
performance optimization. Software versions of the frameworks12 are shown in Table 2 and their properties are shown
in Table 3.
4.1 CNTK
Microsoft Cognitive Toolkit (CNTK) is an Open source DL framework developed by Microsoft Research [38] for
training and testing many types of NN across multiple GPUs or servers. CNTK supports different DL architectures like
Feedforward, Convolutional, Recurrent, LSTM, and Sequence-to-Sequence NN.
12
This work was conducted in the summer of 2017. The versions we consider were the latest ones. Since then, these frameworks
have been updated.
12
A PREPRINT - M AY 7, 2020
In CNTK, a Computational Network learns any function by converting it to a directed graph, where leaf nodes consist of
an input values or learning parameters while other nodes represent matrix operation applied to its children. In this case,
CNTK has an advantage as it can automatically find the derive gradients for all the computations which are required to
learn the parameters. In CNTK, users specify their networks using a configuration file that contains information about
the network type, where to find input data, and the way to optimize parameters [39].
CNTK interface supports different APIs of several languages such as Python, C++ and C# across both GPU (CUDA)
or CPU platforms. According to its developers,13 CNTK was written in C++ in an efficient way, where it removes
duplicated computations in forward and backward passes, uses minimal memory and reduces memory reallocation by
reusing them. The framework’s installation is discussed in [40].
4.2 Theano
Theano14 is an open source Python library developed at MILA lab at the University of Montreal as a compiler for
mathematical expressions that lets users and developers optimize and evaluate their expressions using NumPy’s syntax
(a Python library that supports a large and multi-dimensional arrays) [41, 7]. Theano starts performing computations
automatically by optimizing the selection of computations, translates them into other machine learning languages such
as C++ or CUDA (for GPU) and then compiles them into Python modules in an efficient way on CPUs or GPUs.
Theano’s development started in 2008 and it is more popular on a research and ecosystem platform than many DL
libraries. Several software packages have been developed to build on top of Theano, with a higher-level user interface
which aims to make Theano easier to express and train different architectures of deep learning models, such as Pylearn2,
Lasagne, and Keras. The framework’s installation is discussed in [40].
4.3 TensorFlow
TensorFlow is an open source framework developed by Google Brain Team [42]. It uses a single data flow graph,
expressing all numerical computations, to achieve excellent performance. TensorFlow constructs large computation
graphs where each node represents a mathematical operation, while the edges represent the communication between
nodes. This data flow graph executes the communication between sub-computations explicitly, which makes it possible
to execute independent computations in parallel or to use multiple devices to execute partition computations [42]. The
framework’s installation is discussed in [40].
Programmers of TensorFlow define large computation graphs from basic operators, then distribute the execution of
these graphs across a heterogeneous distributed system (can deploy computation to one or more CPUs or GPUs on a
different hardware platforms such as desktops, servers, or even mobile devices). The flexible architecture of TensorFlow
allows developers and users to experiment and train a wide variety of deep neural network models, and it is used for
deploying machine learning systems into production for different fields including speech recognition, NLP, computer
vision, robotics, and computational drug discovery. TensorFlow uses different APIs of several languages such as Python,
C++, and Java for constructing and executing a graph (Python API is the most complete and the easiest to use).15 The
framework’s installation is discussed in [40].
13
https://docs.microsoft.com/en-us/cognitive-toolkit/cntk-evaluation-overview
14
Theano is no longer supported; however, it was so when this paper was written.
https://groups.google.com/forum/#!topic/theano-users/7Poq8BZutbY
15
https://www.tensorflow.org/
13
A PREPRINT - M AY 7, 2020
4.4 Keras
Keras is an open source DL library developed in Python. It runs on top of CNTK, Theano or TensorFlow frameworks.
Keras was founded by Google engineer Chollet in 2015 as a part of the research project ONEIROS (Open-ended
Neuro-Electronic Intelligent Robot Operating System). Keras is designed in a way that allows fast expression with
deep neural networks and easy and fast prototyping (modularity and extensibility) [43]. The framework’s installation is
discussed in [40].
5 Methodology
The goal of this experimental study is to compare the aforementioned frameworks (Theano, TensorFlow and CNTK) by
using them to train CNN and RNN models on standard benchmark datasets of classical problems in image processing
(MNIST, CIFAR-10, and Self-driving Car) and NLP (Penn TreeBank and IMDB). We then evaluate each framework’s
performance through the following metrics:
• Running time
• Memory consumption
• CPU and GPU utilization
• Number of epochs
We aim at comparing the aforementioned frameworks using a GPU-equipped laptop that runs Windows 10 operating
system, and has the following specifications:
It is worth mentioning that our goal is to compare the resources consumed by each framework to reach a certain accuracy
level for each problem. So, we experimented with different epoch counts in order to make sure the accuracy for all
frameworks are close to each other.
5.1.1 MNIST
The MNIST (Mixed National Institute of Standards and Technology) dataset (shown in Figure 12) is a computer vision
database for handwritten digits. It is widely used for training and testing in the field of machine learning [35]. MNIST
has a training set of 60,000 images and a testing set of 10,000 images. It is a subset of a larger set available from NIST.
Each image is 28 ⇥ 28 pixels which can be represented as a big array of numbers. We can flatten this array into a vector
of 28 ⇥ 28 = 784 numbers. Each image in MNIST has a corresponding label, a number between 0 and 9 representing
the digit appearing in the image. See Figure 12.
Our goal is to construct a CNN to classify MNIST images. The training will be carried out on both CPU and GPU
environments using different frameworks including the aforementioned ones. We then evaluate the performance of each
framework.
5.1.2 CIFAR-10
The CIFAR-10 dataset is one of the 80 million images datasets, collected by Krizhevsky et al. [34, 44]. It consists of
60,000 32 ⇥ 32 color images evenly distributed over ten classes: airplane, automobile, bird, cat, deer, dog, frog, horse,
ship, and truck. There are 50,000 training images and 10,000 test images.
16
https://goo.gl/Gm8xR7
14
A PREPRINT - M AY 7, 2020
Figure 13 shows the classes in the dataset, as well as 10 random images from each class. The classes are completely
mutually exclusive. I.e., there is no overlap between them. For instance, the “Automobile” class includes sedans, SUVs,
etc. On the other hand, the “Truck” class includes only big trucks. To avoid overlap, neither one of these two classes
includes pickup trucks.
15
A PREPRINT - M AY 7, 2020
5.1.4 IMDB
The IMDB dataset [4] is another example of applying CNN, which is an online dataset of information regarding films,
TV programs and video games. It consists of 25,000 reviews labeled by the sentiment (positive/negative) of each review.
The reviews have been preprocessed and encoded as integers in a form of a sequence of word indexes (See Figure 15).
Words are indexed by overall frequency in the dataset, so that the index i encodes the ith most frequent word in the data
in order to allow operations of quick filtering.
19
https://goo.gl/S8a6P4
20
https://github.com/upul/Behavioral-Cloning
16
A PREPRINT - M AY 7, 2020
CNN is used for the MNIST, CIFAR-10, IMDB and Self-Driving Car datasets, where a different network architecture is
used for each dataset. When applying CNN on both Theano and Tensorflow, Keras is used for coding all experiments,
while, in CNTK, Keras is used only for the Self-Driving car and IMDB datasets. Thus, MNIST and CIFAR-10
experiments are done without Keras.
For the MNIST and CIFAR-10 datasets, two convolutional layers with ReLU activation function are used after the input
layer. The activation function is used to reduce the training time and to prevent vanishing gradients. After each CNN
layer, a max-pooling layer is added in order to down-sample the input and to reduce overfitting. In the max-pooling
layer, the stride value must be specified with which the filter is slid. When the stride is one, the filter (window) is moved
one pixel at a time. When the stride is two, the filter moved two pixels at a time. This will produce smaller output
volumes spatially. After each max-pooling layer, the dropout method is used in order to reduce overfitting by forcing the
model to learn many independent representations of the same data through randomly disabling neurons in the learning
phase. The sequential architecture (layers part only) used on the MNIST and CIFAR-10 datasets are shown in [40].
Another example of applying CNN is on the Self-Driving Car dataset, where the network has the same components as
the ones used with the MNIST and CIFAR-10 datasets, but with deeper model that consists of five convolutional layers
with Exponential Linear Unit (ELU) activation function. The sequential architecture of this CNN is shown in [40]. The
convolutional layers are used for feature engineering. The fully connected layer is used for predicting the steering angle
(final output). The dropout avoids overfitting and, finally, the ELU activation function is used to solve the problem of
the vanishing gradient.
The final example of applying CNN is on the IMDB dataset. The movie reviews in this dataset are composed of
sequences of words of different lengths. These words are encoded by mapping movie reviews to sequences of word
embeddings where words are mapped to vectors of real numbers; the network architecture consists of an embedding
layer followed by a 1D convolution layer which is used for temporal data followed by a global max-pooling operation.
These sequences are padded to have the same size as the largest sequence because they have different lengths. The
sequential architecture used on IMDB dataset is shown in [40].
The other neural network type we consider is RNN with LSTM. One of the most popular uses of LSTM is for text
analysis tasks such as the ones associated with the Penn TreeBank (PTB) dataset. Word-level prediction experiments on
PTB was adopted, which consists of 929k training words, 73k validation words, and 82k test words. It has 10k words in
its vocabulary. We trained models of two sizes (small LSTM and medium LSTM) using the same architecture presented
in [46].
17
A PREPRINT - M AY 7, 2020
To evaluate language models of the PTB implementation, a special metric called a perplexity is used, where better
prediction accuracy is achieved when perplexity value is as low as possible. Perplexity is the inverse of probability
definition. This means that minimizing perplexity value is the same as maximizing probability. The goal of applying
PTB dataset is to match a probabilistic form which assigns probabilities to sentences. This process is done by predicting
next words in a text given a history of previously located words. LSTM cells represent the core of the model which
processes one word at a time and computes probabilities of the possible values for the next word in the sentence. A
vector of zeros is used for the memory state of the network to get initialized and updated after reading each word.
In the small LSTM model, two hidden layers (with 200 LSTM units per layer) are used with Tanh activation function.
The weights are initialized to 0.1. We trained it for four epochs with a learning rate of one (number of epochs trained
with initial learning rate), and then the learning rate is decreased by a factor of two after each epoch (the decay of the
learning rate for each epoch after four epochs), for a total of 13 training epochs. The size of each batch is 20, then the
network is unrolled for 20 steps. The sequential architecture used for PTB dataset is shown in [40].
In this section we discuss the results of our experiments. Table 4 shows the CPU and GPU processing times for each
dataset. For the image classification datasets (MNIST and CIFAR-10), one can observe the superiority of CNTK over
TensorFlow and Theano in terms of GPU and CPU multithreading; however, in CIFAR-10 using 8, 16 and 32 threads in
CPU, TensorFlow was faster than CNTK. On the other hand, Theano revealed to be more time consuming than other
frameworks. Transitioning to sentiment analysis dataset (IMDB), CPU multithreading was not performed because
CNTK is written in Python in which multithreading is not supported. Without CPU multithreading (CPU uses the
default number of existing physical cores which are equal one thread per core), the superiority of TensorFlow is revealed
in both CPU and GPU environments. The results for the text analysis dataset (Penn TreeBank) shows the superiority of
TensorFlow over CNTK and Theano, for CPU with 8 threads as well as the case in GPU. Moving forward to video
analysis dataset (Self-Driving Car), the superiority of TensorFlow is revealed in both CPU and GPU environments,
while CNTK showed to be more time consuming than the other two frameworks.
Figures (17–22) include CPU multithreading and GPU processing time, as well as Tables 5–9 represent the utilization
levels of the CPU, GPU and their memories for each model on each framework. The processing times clearly show the
advantage of GPU over CPU for training deep convolutional and recurrent neural networks. The advantage of fast GPU
would be more significant when training complex models with larger data as in the Self-Driving Car dataset. From
the CPU results, the best performance occurred when the number of threads is equal to the number of physical CPU
cores, where each thread possesses a single core. In our work we used a laptop with 8 cores, so in each dataset the best
performance in terms of processing time was achieved while using 8 threads as shown in figures (17 – 22). The metrics
measurement of each framework was conducted to explain the failure of one of the selected frameworks.
We noticed poor performance of Theano at most datasets comparing to CNTK and TensorFlow. This could be explained
because of low CPU utilization comparing to aforementioned frameworks, meanwhile CNTK and TensorFlow use all
available resources (high CPU utilization). CNTK outperformed both TensorFlow and Theano while training MNIST
and CIFAR-10 datasets. This achievement is highly likely due to the use of BrainScript21 format which is a custom
network description language that makes CNTK more flexible for neural networks customization. On the other hand,
TensorFlow uses Eigen,22 which is a C++ template library (BLAS library) for linear algebra including matrices, vectors,
numerical solvers, and related algorithms. It is used to make TensorFlow perform better than CNTK and Theano in
RNN.
Comparing our work to previous work such as Bahrampour’s et al. work presented at [2] and Shi et al. work presented
at [5], we reveal the following findings. Bahrampour et al. [2] based their comparative study on three main aspects
including speed, hardware utilization, and extensibility. Besides, they used three NN types: CNN, AE, and LSTM to
train MNIST, ImageNet, and IMDB datasets on Caffe, Neon, TensorFlow, Theano and Torch frameworks. They used
the following hardware specs; Intel Xeon CPU E5-1650 v2 @3.5GHz (with multi-threading), NVIDIA GeForce GTX
Titan X/PCI/SSE2, 32GB DDR3 Memory, and a SSD drive to come up with the following results. Training on CPU,
Torch performed the best followed by Theano while Neon had the worst performance. Moreover, Theano and Torch are
the best in terms of extensibility, as well as TensorFlow and Theano were very flexible and Caffe was the easiest to find
the performance. Regarding training datasets on GPU, and for larger convolutional and fully connected networks, Torch
was the best followed by Neon. For smaller networks Theano was the best. For LSTM, Theano’s results were the best,
while TensorFlow’s performance was not competitive compared with the other studied frameworks.
21
https://docs.microsoft.com/en-us/cognitive-toolkit/BrainScript-Network-Builder
22
http://eigen.tuxfamily.org/index.php?title=Main_Page
18
A PREPRINT - M AY 7, 2020
On the other hand, Shi et al. based their comparative study on two comparative terms including processing time and
convergence rate. The neural networks used are fully connected NN, CNN and RNN to train ImageNet, MNIST, and
CIFAR10 datasets on Caffe, CNTK, MXNet, TensorFlow, and Torch frameworks. They used the following hardware
specs; Two types of multi-threaded CPU: desktop CPU (intel i7-3820) (8 threads) and server-grade CPU (intel xeon
E5-2630) (32 threads), as well as three types of NVIDIA GPU (GTX980, GTX1080, Tesla K80), and Two Tesla K80
cards used to evaluate the multi-GPU performance. The results of TensorFlow were the best while using CPU. While
using single GPU; on FCN, Caffe, CNTK and Torch performed better than MXNet and TensorFlow. As for small CNN;
Caffe and CNTK achieved a good performance, and for RNN (LSTM), CNTK was the fastest (5-10x faster than other
frameworks). Using multi-GPU implementation, all frameworks had higher throughput and accelerated the convergence
speed.
In addition to processing time, we also report the utilization levels of the CPU, GPU and their memories for each model
on each framework under consideration. These results are shown in Tables 5–9. The utilization levels for both CPU and
GPU are high for all models. The only supersizing numbers are the CPU utilization for Theano, which were very low.
The tables also show that the utilization levels are rather small for both types of memory. This applies to all models for
all frameworks. However, the tables also show that, in most cases, CNTK had the lowest memory utilization while
TensorFlow had the highest. Surprisingly, the case is almost reversed for the video analysis dataset (the Self-Driving
Car dataset), where CNTK had the highest utilization and Theano had the lowest. Another unexpected finding of these
experiments is that the models of the IMDB generally needed the largest portions of memory.
19
A PREPRINT - M AY 7, 2020
20
A PREPRINT - M AY 7, 2020
21
A PREPRINT - M AY 7, 2020
22
A PREPRINT - M AY 7, 2020
In this paper, we have provided a qualitative and quantitative comparison between three of the most popular and
most comprehensive DL frameworks (namely Microsoft’s CNTK, Google’s TensorFlow and University of Montreal’s
Theano). The main goal of this work was to help end users make an informed decision about the best DL framework
that suits their needs and resources. To ensure that our study is as comprehensive as possible, we have used multiple
benchmark datasets namely MNIST, CIFAR-10, Self-Driving Car, and IMDB which were trained via multilayer CNN
network architecture and Penn TreeBank dataset which was trained via RNN architecture. We have run our experiments
on a laptop with windows 10 operating system. We have measured performance and utilization of CPU multithreading,
GPU and memory. For most of our experiments, we find out that CNTK’s implementations are superior to the other
ones under consideration.
References
[1] Ali Shatnawi, Ghadeer Al-Bdour, Raffi Al-Qurran, and Mahmoud Al-Ayyoub. A comparative study of open
source deep learning frameworks. In 2018 9th International Conference on Information and Communication
Systems (ICICS), pages 72–77. IEEE, 2018.
[2] Soheil Bahrampour, Naveen Ramakrishnan, Lukas Schott, and Mohak Shah. Comparative study of deep learning
software frameworks. arXiv preprint arXiv:1511.06435, 2015.
[3] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
[4] Andrew L Maas et al. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting
of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pages 142–150.
Association for Computational Linguistics, 2011.
[5] Shaohuai Shi, Qiang Wang, Pengfei Xu, and Xiaowen Chu. Benchmarking state-of-the-art deep learning software
tools. arXiv preprint arXiv:1608.07249, 2016.
[6] Peter Goldsborough. A tour of tensorflow. arXiv preprint arXiv:1610.01178, 2016.
[7] Rami Al-Rfou et al. Theano: A python framework for fast computation of mathematical expressions. arXiv
preprint arXiv:1605.02688, 2016.
[8] Arthur L Samuel. Some studies in machine learning using the game of checkers. IBM Journal of research and
development, 3(3):210–229, 1959.
23
A PREPRINT - M AY 7, 2020
[9] Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks.
science, 313(5786):504–507, 2006.
[10] Hector P Martinez, Yoshua Bengio, and Georgios N Yannakakis. Learning deep physiological models of affect.
IEEE Computational Intelligence Magazine, 8(2):20–33, 2013.
[11] Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large annotated corpus of
english: The penn treebank. Computational linguistics, 19(2):313–330, 1993.
[12] Nicolas Vasilache, Jeff Johnson, Michael Mathieu, Soumith Chintala, Serkan Piantino, and Yann LeCun. Fast
convolutional nets with fbfft: A gpu performance evaluation. arXiv preprint arXiv:1412.7580, 2014.
[13] Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997,
2014.
[14] Pierre Sermanet, David Eigen, Xiang Zhang, Michaël Mathieu, Rob Fergus, and Yann LeCun. Overfeat: Integrated
recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229, 2013.
[15] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition.
arXiv preprint arXiv:1409.1556, 2014.
[16] Christian Szegedy et al. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 1–9, 2015.
[17] Li Yao et al. Describing videos by exploiting temporal structure. In Proceedings of the IEEE international
conference on computer vision, pages 4507–4515, 2015.
[18] Vassili Kovalev, Alexander Kalinovsky, and Sergey Kovalev. Deep learning with theano, torch, caffe, tensorflow,
and deeplearning4j: Which one is the best in speed and accuracy? In Pattern Recognition and Information
Processing (PRIP 2016), 2016.
[19] Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian Goodfellow, Arnaud Bergeron, Nicolas
Bouchard, David Warde-Farley, and Yoshua Bengio. Theano: new features and speed improvements. arXiv
preprint arXiv:1211.5590, 2012.
[20] Weiguang Ding, Ruoyan Wang, Fei Mao, and Graham Taylor. Theano-based large-scale visual recognition with
multiple gpus. arXiv preprint arXiv:1412.2302, 2014.
[21] Wei Dai and Daniel Berleant. Benchmarking contemporary deep learning hardware and frameworks: A survey of
qualitative metrics. In International Conference on Cognitive Machine Intelligence (CogMI), 2019.
[22] Cody Coleman et al. Dawnbench: An end-to-end deep learning benchmark and competition. In 31st Conference
on Neural Information Processing Systems (NIPS 2017), 2017.
[23] Frank Rosenblatt. The perceptron, a perceiving and recognizing automaton Project Para. Cornell Aeronautical
Laboratory, 1957.
[24] Lokenath Debnath and Dambaru Bhatta. Integral transforms and their applications. CRC press, 2014.
[25] Alekseı̆ Grigor’evich Ivakhnenko and Valentin Grigorévich Lapa. Cybernetic predicting devices. Technical report,
DTIC Document, 1966.
[26] Ian J Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout networks.
arXiv preprint arXiv:1302.4389, 2013.
[27] David R Cox. The regression analysis of binary sequences. Journal of the Royal Statistical Society. Series B
(Methodological), pages 215–242, 1958.
[28] Seppo Linnainmaa. The representation of the cumulative rounding error of an algorithm as a taylor expansion of
the local rounding errors. Master’s thesis, Univ. Helsinki, 1970.
[29] Alexey Grigorevich Ivakhnenko. Polynomial theory of complex systems. IEEE transactions on Systems, Man,
and Cybernetics, 1(4):364–378, 1971.
[30] Kunihiko Fukushima. Neocognitron: A self-organizing neural network model for a mechanism of pattern
recognition unaffected by shift in position. Biological cybernetics, 36(4):193–202, 1980.
[31] Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard, and
Lawrence D Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541–
551, 1989.
[32] Jürgen Schmidhuber. Learning to control fast-weight memories: An alternative to dynamic recurrent networks.
Neural Computation, 4(1):131–139, 1992.
24
A PREPRINT - M AY 7, 2020
[33] Dan Claudiu Ciresan, Ueli Meier, Jonathan Masci, Luca Maria Gambardella, and Jürgen Schmidhuber. Flexible,
high performance convolutional neural networks for image classification. In Twenty-Second International Joint
Conference on Artificial Intelligence, 2011.
[34] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural
networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
[35] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document
recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
[36] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
[37] Jeffrey L Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990.
[38] Dong Yu et al. An introduction to computational networks and the computational network toolkit. Technical
Report MSR-TR-2014–112, Microsoft, 2014.
[39] Dong Yu, Kaisheng Yao, and Yu Zhang. The computational network toolkit [best of the web]. IEEE Signal
Processing Magazine, 32(6):123–126, 2015.
[40] Ghadeer Al-Bdour. Comparative study between deep learning frameworks using multiple benchmark datasets.
Master’s thesis, Jordan University of Science and Technology, 2017.
[41] James Bergstra et al. Theano: A cpu and gpu math compiler in python. In Proc. 9th Python in Science Conference,
pages 1–7, 2010.
[42] Martín Abadi et al. Tensorflow: A system for large-scale machine learning. In Proceedings of the 12th USENIX
Symposium on Operating Systems Design and Implementation (OSDI), 2016.
[43] François Chollet et al. Keras, 2015.
[44] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report,
University of Toronto, 2009.
[45] Richard S Wallace, Anthony Stentz, Charles E Thorpe, Hans P Moravec, William Whittaker, and Takeo Kanade.
First results in robot road-following. In IJCAI, pages 1089–1095, 1985.
[46] Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization. arXiv preprint
arXiv:1409.2329, 2014.
25