20.2.1 Creating neural networks
The command neural_network
is used for creating trainable feed-forward neural networks.
-
neural_network takes one mandatory argument and several optional arguments:
-
topology or net, where topology is a vector of positive integers
defining the number of neurons in each layer and net is an existing neural network.
-
If topology is given, then a neural network is constructed from scratch.
The first number in topology is the size of the input layer and the last number is the size of
the output layer. The numbers in between are sizes of the hidden layers. If topology
contains only two numbers m and n, then a hidden layer of size 2(m+n)/3 is added automatically.
The number of layers will be referred to as L below.
- If net is given, then a copy of net is constructed with optional
modifications of the parameters as specified by the additional arguments (see below).
-
Optionally, block_size=bs, where bs is a positive integer specifying
the amount of samples that can be processed by the network at a time. This value can be set to
the batch size or some smaller number for faster execution (by default, bs=1).
-
Optionally, learning_rate=rate, where rate is a positive real
number which is used as the learning rate of the network (by default, rate=0.001).
Alternatively, rate can be a two-element vector
[rinit,schedule] where rinit=η0
is the initial rate (a real number) and schedule is a function η(t), t∈ℤ
which is used as the schedule multiplier for rinit (t is initially zero and increases
by 1 after each weight update).
In this case, the learning rate in iteration t is ηt=η0η(t).
-
Optionally, momentum=mom, where mom is a number µ∈[0,1)
which specifies the momentum parameter of the network (it is usually set between 0.5 and 1.0).
Alternatively, mom can be specified as a two-element vector [β1,β2] in which
case Adam (adaptive moment estimation) method is used. To use Adam with the default (paper)
parameters β1=0.9 and β2=0.999, simply set mom to adaptive.
(By default, µ=0, i.e. no momentum is used.)
-
Optionally, func=act, where act is the activation function used in
the hidden layers. It can be specified either as an univariate function
(e.g. tanh or logistic) or a list in which the first element is a univariate
function f(x,a,b,…) with parameters a,b,…, and the other elements are the fixed
values a0,b0,… of the parameters.
In the latter case the value of f at x is obtained by computing f(x,a0,b0,…).
ReLU activation is specified by the symbol ReLU.
By default, ReLU activation is used for regression models while tanh is used for
classifiers (see below).
- Optionally, output=outact, where outact is the activation
function used in the output layer. It is specified in the same way as the hidden activation
function (see above). By default, identity is used in regression models, while sigmoid
and softmax are used in binary and multi-class classifiers, repectively.
- Optionally, erf=err, where err is the error function used for
network training. It is either a bivariate function g(t,y) or a list
in which the first element is a function g(t,y,a,b,…)) with parameters
a,b,… and the other elements are the fixed values a0,b0,… of the parameters.
In the latter case the error is computed as f(t,y,a0,b0,…). Here
t is the expected output and y is the output read at the final layer of
the network after forward-propagation of the input. By default, half mean squared distance is
used in regression models while log-loss and cross-entropy are used in binary and
multi-class classifiers, respectively. These error functions
can be specified by using shorthands MSE,
log_loss and
cross_entropy.
The corresponding definitions are:
| MSE: | | | | | | | | | | |
log-loss: | f(t,y)=− | | (tilog(yi)+(1−ti)log(1−yi)), |
| | | | | | | | | |
cross-entropy: | | | | | | | | | | |
|
- Optionally, labels=lab or classes=lab,
where lab is the list of labels
corresponding to the output neurons. If this option is set, then the activation for hidden
layers is set to tanh, while the activation for the output layer is set to
logistic except with classes and multiple labels, in which case the softmax activation
is used. (By default, lab is unset, i.e. the network is a regression model.)
With labels the network becomes a multi-label classifier (multiple labels may be assigned
to the output), while with classes it becomes a binary/multi-class classifier
(exactly one label is assigned to the output).
-
Optionally, weights=wi, where wi is either a list of
matrices containing initial weights for the model or a random variable used for generating
initial weights automatically.
-
In the former case, wi is a list [W(1),W(2),…,W(L−1)]
where W(k) is the matrix specifying the weights between the kth and (k+1)-th layer.
The element wij(k) is the weight corresponding to the link from the ith neuron in
kth layer to the jth neuron in (k+1)-th layer. Optionally, the initial bias for
the kth layer may be specified as an additional row in the matrix W(k).
- In the latter case, wi is either a constant or a random variable X
as returned by the command
randvar which may optionally contain one or two symbolic parameters: nin
and optionally nout, which correspond to the size of the preceding layer (“fan-in”)
and the size of the next layer (“fan-out”). If symbols are present, then wi must be
a list in which the first element is the random variable and other elements are the symbols
(nin first, then optionally nout).
These symbols are substituted by sizes of the kth and (k+1)-th layer, respectively.
The commonly used initializations by He, Glorot and LeCun
can be specified as strings "he", "glorot" and "lecun" with the suffix
"-uniform" resp. "-normal" for uniform resp. Gaussian variant.
By default, the uniform random variable
U(−1/√nin,1/√nin) is
used for weights initialization. Note that bias weights are always initialized to zero.
-
Optionally, weight_decay=wd, where wd is a positive real number α
or a list of L−1 such numbers [α1,α2,…,αL−1]. These numbers are the
L2-regularization coefficients used in the model (by default, they are all equal to zero, i.e. no
regularization is performed). If a single number α is given, then the coefficients for all
layers are set to that value. If the list is given, then αk is used for the kth layer
(αk is a weight decay coefficient for the weights in W(k)). Note that regularization
is not applied to bias weights.
-
Optionally, title=str, where str is a string holding the name of
the newtork which is printed alongside its one-line description (by default, network is not named).
- neural_network(topology ⟨,options ⟩)
or neural_network(net ⟨,options ⟩)
returns the network object net which can be trained by using the
train command (see Section 20.2.2).
- After the network is trained to a sufficient accuracy,
you feed it with some input inp by calling the command net(inp), which returns the list
of values in the final layer or output label(s) if the network is a classifier. inp can also
be a matrix, in which case each row is processed and the list of results is returned.
Alternatively, you can pass two arguments to net as in the command net(inp,res) where
res is (the list of) expected output(s). The return value in this case is the (list of) error(s)
made by the network in attempt to predict res.
- Hyperparameters and other properties of the network net can be fetched by using the command
net[property], where property is one of:
-
block_size,
for obtaining the block size,
- learning_rate,
for obtaining the learning rate and possibly the schedule multiplier,
- labels,
for obtaining the list of output labels,
- momentum,
for obtaining the momentum/Adam parameter(s),
- title,
for obtaining the network name,
- topology,
for obtaining the network topology, i.e. the list of layer sizes,
- weight_decay,
for obtaining the list of L2-regularization parameters,
- weights,
for obtaining the list of weight matrices (bias weights are contained in the
last row in each of these matrices).
To fetch the contents of the kth layer neurons, the command net[k] can be used,
where k∈{0,1,…,L−1}. This returns the layer as a matrix with the number of rows equal
to the block size in which the ith row corresponds to the ith sample passed forward through the
network. This is useful for e.g. obtaining hidden representations from autoencoders.
- Although neural_network is flexible when it comes to custom activation and error
functions, the resulting network is optimized for speed only when the options func,
output and erf are left unset, i.e. when the default activation/error function(s) are used.
- Networks can be saved to disk by using the command write (see Section 23.5.3)
and loaded by using the command read (see Section 3.5.2). This is useful for
storing networks after training and loading them on demand.
Examples
To create a network with three layers of size 2, 3, and 1, input:
|
a neural network with input of size 2 and output of size 1
| | | | | | | | | | |
|
To use GELU activation x↦ x·Φ(x) for hidden neurons:
neural_network([2,3,1],func=unapply(x*normal_cdf(x),x)) |
|
a neural network with input of size 2 and output of size 1
| | | | | | | | | | |
|
To define a penguin classifier:
net:=neural_network([10,15,7,3],classes=["adelie","chinstrap","gentoo"]) |
|
a classifier with input of size 10 and 3 classes
| | | | | | | | | | |
|
Now we create a copy of net with block size changed to 10:
netcopy:=neural_network(net,block_size=10) |
|
a classifier with input of size 10 and 3 classes
| | | | | | | | | | |
|