Welcome

Our website is made possible by displaying online advertisements to our visitors.
Please disable your ad blocker to continue.

Biomedical Engineering - Methods and applications of ai in biomedicine

Completed notes of the course - AN2DL

Complete course

MACHINE LEARNING VS DEEP LEARNING Deep learning has developed in order to analyze data that need to be interpreted by patterns (specific structures). For example, imagine the ECG signal: the doctor analyzes the waves, so the patterns to make a diagnosis, so a prediction. This prediction is based not on each individual failure, but is based on the structure, on the pattern. Deep learning is made to provide predictions over this type of signals. Supervised learning: classification One task is to associate an input to a given class of catego ry. We need to determine if the imagine contains a car or a motorcycle. The input is the image. The target is the type of the content of the image (in this case car or motorcycle). So, the output of our network is a binary. What was typically done to solv e this problem is: given an image, extract out some information (features); this is done by smarter program that is taking, for example, color intensity. Supervised learning: regression The input domain is always an image, but the target is not a category, but is the continuous domain, like the price. The problem of regression is to predict how much is the cost of a car or of a motorcycle. The target domain is not a binary, but a real num ber. What is in common with classification is that, in order to solve these problems, we always need a training set, which are target values associated to images. Unsupervised learning: clustering There are machine learning problems that don’t requir e to be provided with input and expect an output. There are some figures, and we ask to the machine to group them into distinct categories. So, depending on how are similar, we can identify groups (for example cycles, cars, motorcycles…). This task has to be solved without any supervision. Machine Learning Paradigms We have a training set D, which in case of supervised learning has an input �� and a target value �� (desired output), while in an unsupervised learning we don’t need a target value: we h ave a set of images, and we want to find regularities and group similar inputs. With massive amount of computational power, machines can now recognize objects and translate speech in real time. Artificial intelligence is finally getting smart. HAND -CRAFTED FEATURES Example: there is a post office with a plenty of boxes that are going along this trade, we need to recognize 3 types of images, passes: a box, a bag or 2 boxes. The object is to recognize when there are 2 boxes to separate them and put them in their own space. What we do is to place a sort of structure over th e belt in the camera. The camera makes thing easy. What we see is a grey intensity level of each pixel. This is not related to the color but to the distance from the camera. We can classify this image as a double, parcel box or bag. The first image repr esents a single bag, the second a double box and the third a single box. We have to find a program that recognize the difference between these 3. This is an example of supervised because there is a target . How to solve this problem? The idea is to co unt the area, so counting for example the pixels at 25 cm of depth in the third photo. We count how large is the image. We use this information to determine what is envelope, double or parcel. We can do this also for other information like the maximum wei ght. We can translate the image classification problem into a classification problem of these simple dates that we call features. We call them features because are very peculiar in describing the images. We can also look other elements as what is the perim eter, what is the minimum height, the ratio, the area, etc. we can move the image into a vector describing the most significant characteristics for solving the classification problem. This is the feature extraction and is the only way for classify images i n deep learning. This is the same thing that a doctor does to make a diagnosis. If we analyze for example the heartbeat morphology we can see information as PR interval, the weight of QRS complex and so on. The input is very complicated because the image is a signal and in order to obtain information from this, we need to extract features. The training set Let’s return to our domain of post office. Let’s try to separate images in the 3 categories using 2 features: height and area. We can place picture in specific pos ition of our plane depending on the values of the features. We can look just at the location of the centers of these pictures considered. And now let’s consider the whole training set. What we need now in order to classify these images is to provide some rules. Given an input image we get a point, a vector of average height and area and we want to determine the area where there are the envelopes, the boxes and the doubles . If the height is below 2.5 we have a parcel, if is over 6.2 we have an envelop e and between we have to look also the area: when the area is high we have doubles and when is low we have the envelopes. In the end we have obtained this classifier output: A tree classifying image features This is a decision tree. Each leave of the tree is associated to a label This classifier has a few parameters: the thresholds (number of splits we could have, 200/250 pixels) and which category I’m looking at (height or area). Changing a bit these param eters, we change the division into regions and doing this we could have a better performance. But in order to adjust these parameters, we need to quantitatively assess how good we are. We can look how many elements are in the wrong area  14.2%. If we chan ge a bit our separation line, we are changing the parameters, we could obtain a better performance. Data driven models What machine learning does is given the training set, identify the best parameter to reduce the loss. The model parameters (e.g. Ne ural Network weights) are set to minimize a loss function (e.g., the classification error in case of discrete output or the reconstruction error in case of continuous output) . Can definitely boost the image classification performance . This is how, during t raining, the computer learns • Annotated training set is always needed • Classification performance depends on the training set • Generalization is not guaranteed Better classifier for literature is a neural network . Inside these circles there is a number. If our vector is like 6 dimensional, the first column of circle, the layer of neural network, contains exactly the value of target that is here. This is what is called the input layer . Then, the output layer has the same size of the number of categories. What we have in these circles is the probability for the input to belong to each category. In the middle we have the hidden layers , plenty of circles which contains a number, the results of some computations. Wh at is important to understand is that the first part is hand -crafted, while the second part is data driven, because this is something we automatically define to optimize separations between features. There are some advantages, because since this part is defined yourself, we know what is happening in the model. So we need few images in order to understand what is meaningful, the area the height and so on. In contrast, this approach doesn’t work if we give to the model a too complex visual recognition t ask. Example with flowers Features are the petal length and the sepal width. There are situations in which classification and modeling are easy to perform , and others where the model is more difficult to implement . Sometimes the feature that we extract are not able to separate the two classes, because the two dots are completely overlapped. In this case what we need to do is to make also the data driven part. The deep learning paradigm consists in making also the fi rst part of the algorithm, the feature extraction, and then the learned classifier to easily classify the images in categories. If the features are able to distinguish in a good way the elements, the classification task will be easy. So, the main differen ce between machine learning and deep learning is that in machine learning, we need to recognize patterns and rely them to a features extraction part and this returns to a vector; then we can ap ply a classifier over this vector. With deep learning also the feature extraction part becomes data driven. The better advantage is that we extract features only to obtain a better separation in categories and to improve the classification performance. The matter of fact is that we improve the visual information perf ormance of the model. DEEP LEARNING Image classification means that we give a natural image, and we have to classify the categories belongs to. Before 2012 this was done by hand -crafted features. We see that with the introduction of deep learning we imp roved the error, until it got below human error. This is possible to two conditions: large collections of annotated data and parallel computing architectures (which allow us to use sophisticated models having million of parameters). There are also very established frameworks for training these models (Te nsorFlow and PyTorch). Advanced Visual Recognition Problems with DL We can use DL not only to classify images, but also to solve modern advanced visual recognition problems, like segmenting images. Over these pictures there are like fake colors, and we ca n see some annotation per pixel. So, this model tells us what element in the picture is a car, a human and so on. We could see that DL not only segment persons, but also estimated the position of joints, the position of eyes for humans. We could reduce no ise from some picture bringing more details from images. You can associate real images with famous effects used by some painters such as van Gogh . Or you can associate via neural network captions with photos , this in combination of model for text. F inally, there is a site capable of generating images of some completely invented faces that do not exist in reality . There is a text description that make possible the generation of image. FROM PERCEPTRONS TO FEED FORWARD NEURAL NETWORKS People think machine learning and deep learning are the same thing, but they are not . M any things are solvable with artificial intelligence, such as scheduling, solving … so, machine learning is not A= but only a part and deep learning is only a part of machine learning. In deep learning you join to learn the features for classification or regression. While, in machine learning you have to extract features that are most s uited for a given task. There is not a better choice, but this depends on the number of elements that we have and on how much easy is to design features. So, if we have little data machine learning will work very well, often will outperforming deep learnin g, especially in tasks very specific. But every time there is a not an easy task, we will prefer deep learning. So, deep learning is a new frontier. Neural networks were invented when AI started; biological neural network was invented when life started, and artificial neural network were invented in the 50s together with AI. Most of techniques that we will spoke about were invented between 50s and 90s and even the convolutional neural network was invented between 90s and 2000. In 1950 we reach the level o f data which was enough to train very complex deep neural network. The official starting date of AI is 1955, where there was a meeting where 3 people propose a workshop to discuss different topics, more or less 7: automatic computers, neural networks, sel f-improvement, which is machine learning and abstractions. In 40/50s computers were running fast and precise in computing. But still they had one limit: they need to be programmed and they are not robust to noise the data, they interact with noisy data or directly with the environment . And also, they are not so parallel and not so tolerant, so if only a part of processor breaks everything is done. When neural network started, they were pieces of algorithm and today we have to think about neural network as a machine. Nowadays this is not totally true: most of neural network tools are software and we have to be specialized in run these. Only recently we have a sort of new wave of computers, which already implement the neural network in the hardware, but nowad ays they are still a software. To overcome this limitation researcher took inspiration from the brain. The human brain has a huge number of computing units: - 10 11 (one hundred billion) neurons - 7000 synaptic connections to other neurons - In total from 10 14 to 5 x 10 14 (100 to 500 trillion) in adults to 10 15 synapses (1 quadrillion) in a three year old child The idea was to approximate this way of performing computation, by designing a piece of hardware with a lot of distributed, single nonlinear units, so that computation was distributed, redundant and parallel. Very first model that came up is the percept ron . It was a first approximation of biological neuron. Computation in biological neurons The idea is to mimic what happens in neurons. This is a primary level description of neuron. You can imagine a cell as a sort of accumulated charge, which derives from other neurons because the mechanism of communication of neurons is electrochemical. Electrochemical charge arrives to a neuron through these sm all connections called dendrites. The charge accumulates inside the cell of neuron. The charge keeps accumulating until the potential inside the neuron does not pass a given threshold. When this charge is above the threshold, all the charge is released thr ough one connection called axon and reaches all the other neurons connected. The place where the axon touches the neuron is called synaptic. So, synapsis is the place where the charge is exchanged. This is a nonlinear mechanism. There is a mechanism which regards synapses such that when some charge arrives, this will increase or decrease the charge of the following neuron. Depending on the type of synapsis some charge will increase or decrease the accumulated charge  excitatory – inhibitory. The percept ron tried to mimic this thing. Once the accumulated charge is above the threshold, releases the charge. Computation in artificial neurons Information is transmitted through chemical mechanisms: - Dendrites collect charges from synapses, both Inhibitory and Excitatory - Cumulates charge is released (neuron fires) once a Threshold is passed . If we measure the neural activity of the brain, what we see is like a pattern of neural firing in synchronous way, and the overall combination of the brain is the firing activity of the neurons in the brain. I t is also usually said that the neuron spikes and the computation in the real brain is done by modulating the frequency and intensity and the phase of this firing. We want to do something easier: we don’t have any dy namic in the neuron . We will use a very simple model of neuron. We will model the accumulation of charges with summation. There are some signals from different neurons (x1,xi,xI) and we just sum them. The effect of synapsis that modulate the effect of sign als is obtained with these weights wi. They can be either positive or negative (excitatory – inhibitory synapsis). Then we have the accumulated charge. We compare this charge against a threshold b, the bias. If the difference between the charge and the bia s is positive, we have the fire effect; if is negative, we will not have output or a negative one. The output could be high or low, firing or not firing  1 – firing, 0 – no t firing. We have basically a step function. We can imagine it as a threshold unit. As notation we will use + -1 because it is simpler in calculations . The output is called ℎ�- h is the function computed by neuron – j is the number of neurons. h is a function of x (input vector), but also of some parameters (weights and bias). We can rewrite h as this: Big I represents the size of the input, a vector. For example, 8x8 image has 64 as size. b = -�0: I can imagine that the bias is just a tissue input always equal to 1, which is weighted by minus the bias. This is useful becaus e then b becomes �0∗1 and we can simply remove this piece and put it into the summation and this summation now goes from 0 to I. Conventionally input 0 is the bias and always equal to 1. It has a weight �0, which is the bias. Now the bias is not anymore, a threshold but input weight, conventionally 1. H aving this sum from 0 to I, we can write also in a compact form this part, as a scalar product. We have a vector of weights from 0 to I, we have a vector of input, in which the first is 1 until I. And the s um from 0 to I of �� is simply ��. Th e huge speed up in deep learning and neural network is given by GPUs (graphical processing unit), which has been designed to perform massively parallel scalar products. There are also TPUs ( tensor processing unit ) that are better performing than GPUs because they deal with tensors, which act on 3 or more dimensions . The idea seen so far was developed in 1943 by McCullog and Pitts . They just studied the idea of modulating neurons and in that case the Heaviside step (0 -1) was the thresholding function. The first physical realization of their model was in 1957 by Rosemblatt. The weighted were adjusted, because they were potentiometer, and this was a big electric network. And for each of these variable resistance s (weight) they had an electrical model to change it. So, weights were encoded in potentiometers, and weight updates during learning were performed by electric motors . In 1960 Bernard Widrow introduced the idea of representing the threshold value as a bi as term . So, originally, the bias was a part of cell that could not change. And the bias here were initialized randomly. Think about perceptron as a piece of hardware and think about this true table: The first column (bias) is always equal to 1 . If we s et the first weight equal to 1, the second equal to 1 and the �0 equal to -1/2 we get: It means that we get the logic OR computation. So, there is a way to set up these motors s.t. we have programmable hardware. No let’s see this other configuration: we obtain this: So, we obtain logic port AND. The idea is to use this preceptor as a learnable logic; it’s equivalent to use Boolean general function. The real key point is that the program just needs to set weig hts and we will get the program. The hardware was half of the story. The other half was the mean to program this hardware. Parallel to these research, Donald Hebb in 1949 was studying the mechanism of learning and how neurons change the synaptic in the co nnection providing the first theory of learning in the brain. Was a psychologist who wanted to understand the behavior of neurons to learn the behavior of neurons . One of the results was what is called the Hebbian rule : the strain of the synapsis, so how b ig is the weight (small, big, negative, positive…), how amount is the accumulation of charge in the cell. This can increase according to the simultaneous activation of input and output of the neuron. This means that if the accumulation will fire, this mean s that the charge has arrived as a generated fir ing. The idea is that if you have synchronous input output, then we wire that part, such that the synchronous input output will happen easier next time. Hebbian learning says that the more perception is us ed, the more is good the discrimination about 0 and 1. K is the iteration. The delta represents the variation (small); this change is related to: - ��: learning rate  how fast we modify weights - ��: the ��ℎ perceptron input at time ��. The input could be high or low. - ��: the desired target at time ��  if the input and the target are high, this is a positive quantity and means that we are increasing the weight. If the input high, but the target is low, this is negative quantity, and we are decreasing the weight. This is an automatic way to select the we ight. 1. For the mathematician in the perceptron, the learning rate is not important (equal to 1). 2. Where do we start run from? We start from a random initialization . If an example is fine, nothing to do; if, on the other hand, the result is wrong, something needs to be changed this is why this is iterative function . Fix the weights one sample at the time (online), and only if the sample is not correctly predicted Does this converge always? =f we can learn a certain problem, yes. Let’s just assume th at by doing a very simple procedure, we can train the weights of the preceptor and it will converge to that solution. Perceptron example this is an OR function, because is always high if at least one of the inputs is high. Let’s start with random init ialization : - start from random weights: w = [1 1 1]. - Chose a learning rate: �� = 0.5. - Cycle through the records by fixing those which are not correct - End once all the records are correctly predicted Fare ��x e con frontare il risultato ottenuto con la colonna della porta logica OR. Ad esempio, [1 1 1] x [1− 1− 1]�� -1  confrontare con OR  è uguale  ok Fare lo stesso procedimento per tutte le righe: 1 ok 2 ok 3 ok 4 ok Other example: w = [0 0 0] 1.  viene 0  non cor retto  update weights  calcolo ��    2. Viene -1/2  non corretto  3. Viene -1  non corretto  4. Viene 3/2  ok One pass through the data is called epoque. Each time you make a correction is called iteration and each pass through the data is an epoque . Let’ s assume we have the possibility to correct 2 sample at one time  2 iterations at the time 1. E very time I finish an e poque I restart from the first setpoint. Does this converge always? Yes, if it’s able to get 0 error, it will converge. Does it always converge to the same sets of weights? No, if you try at different initialization and different rates, you will find different sets of weights that are ok for our network. Perceptron math We said that the perceptron is a nonlinear function applied to the input and weights; we will remove bias because are including bias in weights. It means that what the perceptron does is give us a sign of the linear conditioner (thresholding). The perceptron computes a hyperplane in the input space (high dimensional space). If the input is above the plane will give 1. If is below the plane will give us -1. So, the perceptron is a linear classifie r. If we do the math in 2D, If we solve for one of the input variables, we see that the geometrical region where we have that the weighted sum of the input is equal to zero, this is the line. We have zero errors if we can separate positive samples and negative samples with a straight line. Example: we have a line that separates the positive and negative samples. Are there some Boolean functions that are not solvable with a line? Yes, the mo st famous is the XOR. The result is positive if and only if when one of x1 and x2 is positive. The perceptron cannot solve it. This problem was highlighted in 1969 by Minsky and Papert. They proved that if you have nonlinearly separate problem, there was no way you even increasing the number of inputs. This happens in every dimension. This cause what is known as the first winter of artificial intelligence. The Perceptron does not work anymore, and we need alternative solutions : - Nonlinear boundary - Alter native input representations The idea to solve this problem is using the Karnel problem : have problem in higher dimensional space. If we have a perceptron, the decision boundary is a line and there is not much we can do. But if we have a perceptron that c ompute an output and another perceptron which computes an output out of that, the regions that the model can approximate are quite interesting  complex region. But if we have a perceptron with 2 layers or in general multiple layers of perceptron  we can approximate the model. But the idea works, the implementation doesn’t. The :ebbian learning doesn’t work anymore. Non si può applicare l a formula solita perché mancano gli elementi cerchiati. Nowadays, we are able to train multilayer neural network, but we will not use preceptor anymore. The main trick to do that is to remove the threshold definition. So, the Hebbian learning is n ot good for multilayer perceptron because of some errors in training the model. The solution was found in 1985 to devise a general procedure to train the multilayer perceptron. T he idea was to change the perceptron to a linear model with the Adaline rule or by inventing ad -hoc training procedures that could be applied to multilayer perceptron. The general solution came from back propagation , the name used nowadays to describe the feed forward neural network. Feed f orward neural networks What is the difference between multilayer perceptron and feed forward neural network? The fact that the neurons are not perceptron: they don’t have a sign function and they don’t have a step function. But they have continuous activat ion function. It’s because of this small change that we can train the model. Let’s start defining the elements of FFNN. First of all, we have the so -called input layer, that basically is a sort of fictitious layer where the input derives and is distribute d to the first layer of neurons. Usually, we have a high number of neurons, so high number of inputs. If we have a regression problem, with 3 variables and one output, the number of inputs in the input layer is defined by 3. At the top there is 1, that’s t he bias. The bias is for each neuron, for each layer. Also the output layer is define d by the problem. If we are performing multi -class classification, we might have 1 output for each class. If we have a bidimensional regression, then we have 2 output. Th e number of output is defined by the problem. The minimal problem is 1 input and 1 output, but is not powerful. Then we have a sort of hidden layers, hidden because they are not visible because hidden by either the input or output layer, and we have as ma ny as we want. The idea is we have hundreds of layers. But for now let’s assume 1 -2 layers. If we think at this model as a big function, it’s a nonlinear function, so a nonlinear model, and the output depends on how many layers we put over there and whi ch function we put in each neuron. Is a function because the output is a function of the previous layer, which is the function of previous, until we reach a function of input. The specific function is defined in terms of number of neurons , number of layers , number of activation functions, weights and biases. Once we have defined the size of the problem, we define the weights, and this is the role of the training algorithm: find weights from the examples. Each layer is connected to the previous one with a s et of weights. So, we have one layer, one matrix of weights, and each neuron in the middle layer is computed in terms of weighted sum of the previous layer. Assume that the first middle layer has J1 neurons , matrix of weights of layer 1 is The size of t he first matrix is (I+1 - the bias)(J1) . Then we have a matrix for the second layer, and the size of the second matrix is (j1 +1)(J2). For the third we have (J2 + 1)(k). It’s very easy to compute these parameters. Now we have the so called fully completed parameters: every neural layer is connected t o all the previous. We can have simplified approach as when we have zero to many of them, and we can have other part like convolutional neural network. But for now, this is our reference. There is another limit ation which say that each layer is connected only to the previous one. This is not necessarily true: there also some skip connection, connection that comes from the input to the output. This called skip or short connection. But for now, let’s assume that t he function of each layer is only depended on the function of the previous layer. If we have this architecture and think about the path of the input, always goes forward. We don’t have any connection in the same layer, and we don’t have any connection com ing back. I’s for this that this network is called feed forward neural network. So, the solution was to restrict the kind of architecture that we want to study to feed forward neural network and impose that the activation function of the neuron is differen tiable. This because, if we differentiate each of those activation function, then also the entire architecture can be differentiated. The keep point of back propagation is the possibility to compute the derivative of the function computed by this network w ith respect to each single weight. Which activation function? The activation function is the nonlinear function computed inside the neuron. So, with the perceptron is the step or sign function, is the function applied to the weighted sum. With FFNN we can not use the sign (not differentiable), but we could use in theory any function, but the most used are 3-5. For sure, the most common function is the linear function. The output of the linear activation function is g(a) = a. Its derivative is equal to 1. Another activation function, the most famous historically, is the sigmoid. We can imagine it as the continuous approximation of the step function, with y axis limited from 0 to 1. When the input is negative it will below 0.5, when the input is positive is above 0.5. The derivative of the sigmoid is the value at one - point times 1 minus the value at one point. Fast to compute. The last activation function is the continuous version of the sign function, the tanh function. Is -1 if the weighted sum of the input is negative, +1 if the weighted sum of the input is positive . The hyperbolic tangent, the name of this function, goes through the origin and tends to -1 or +1, symmetric with respect to zero. The derivative is again convenient. Is the function the sa me for each layer? No. In principle we could have activation function different for each neuron. But there are rules to use the different activation functions. These activation functions have some problems related to the norm of the gradient. Output layer in Regression and Classification There are 2 classical tasks in supervised learning, which are regression and classification. In regression we try to predict a numerical value, so the codomain spans the entire real line. In this case we cannot use a sigmo id activation function, because this is bounded between 0 and 1, nor we can use the hyperbolic tangent, because the same reason. So, if we have regression problem , the output has to be a linear function. This is the best practice, because if we want to use the sigmoid function for some reasons, we have to normalize the output between 0 and 1 and then predict the normalized output. Classification means to select one item, label or class out of a set and predict that. Let’s assume to have a binary classifica tion problem, so 2 classes. In that case we can have 2 choices: - Classify the 2 classes with + -1, and in that case, we can use hyperbolic tangent as output. - More often we prefer to code the binary classes with 0 and 1, because if we code with + -1 we can interpret the output as a probability. If the output of the network is 0.99 we can interpret the output as 99 probability of being 1 and 1% of being -1. In thi s case we use sigmoid output activation and we can interpret the output as the probability of being 1. In case of multiple classes, the solution is called one hot encoding . It means that if we want to predict 1 out of 3 classes, we design the network wit h 3 neurons as output (class 0, 1 and 2). ��0 = [0 0 1], ��1 = [0 1 0], ��2 = [1 0 0 ]. By doing this there is not a straightforward way to enforce the sum of the 3 neurons to be 1, because we can interpret the output of the network as a probability . So, we like the output of the network to sum up 1. Each of the neuron represents one of the classes and then what we have to do is just to direct the most probable one, likely 1, when we perform classification. So, what we do is to use a sort of normaliz ed activation function; so, the output of each of these neurons, where k is the number of classes, is equal to: This output is called SoftMax. This because it will take the maximum of the three, and we would make it bigger with respect to the other becau se of the exponential function. Soft because is not a max, which is a nondifferentiable operator, but is differentiable: we can compute the derivative of this with respect to (?). For the rest of the network, we can choose either a sigmoid or hyperbolic t angent. Neural networks are universal approximators Why restrict ourself to sigmoid and hyperbolic tangent? Because in 1991 was approved the so called universal approximation theorem of neural network, which says that “A single hidden layer feedforward neural network with S shaped activation functions can approximate any measurable function to any desired degree of accuracy on a compact set” . We don’t need other elements to approximate function. Is like the Fourier transform: we can approximate any signa l just by summing sinusoidal functions. We can approximate any nonlinear function by using some polynomials. In this case we can approximate any nonlinear function just having a lot of hidden neurons with S function. Regardless the function we are learni ng, a single layer can represent it. 1. If exists a set of weights which allow us to approximate any function, our problem is to find these weights. 2. If we want to approximate complex functions, we might need an exponential number of hidden neurons. The prob lem is that the more you put neurons the more flexible become the function, because we have more parameters; the number of parameters goes quadratically with number of neurons . If we put 100 neurons on each of the hidden layer, we have 10000 weights. Ther e is another theorem what says that for classification we just need 2 middle layers. Optimization and learning (Supervised learning) We have a FFNN with 1 or 2 middle layers. To this neural network we have set the activation functions. Now the problem is how to learn the parameters of this network. Let’s focus on regression. So, let’s assume that our model is nonlinear function, linear output, sigmoid in the hidden neurons and if we think this as a big parametric model, it means that we have decided an analytical shape, but it’s complex. And then we can think about the output of this network as some function of the input and the parameters of this model, and we want to find these parameters. So, let’s assume to have a dataset, which is a set of examples where we have the input and the desired output. We have the input x1  des ired output t1 and so on. What we would like to d o in terms of learning is to have the output of our model as close as possible to the desired value. So, learning means to take these data and find the parameters such that our model, at least on these points, behaves as we’d like to be. Similar in classification means having the same probability distribution, in regression having the minimum square error, and other criteria to be as close as possible to the target. We can writ e supervised learning as finding parameters of our model such that at least our example, our model fits the data. Our parametric model in this case is the output of our neural network. The main thing is that the output of our network is as close as pos sible to our target. H is the activation function of middle neurons and g is the activation function of the output. They have different names because usually output and middle neurons have different activation functions. The output of the network is some g function applied to the weighted sum (j hidden neurons) of output of these neurons. H is the weighted sum of the input: ��(��0��(1)ℎ�(��0��(1)��)). Training this network is equivalent to find all the parameters in the networks such that this approximates our network. Let’s see a simple example in regression. What we have to do is to find the parameters of this network such that we minimize the square differences between our target and the output of the network: �� = ��(��− ��(��|�))2 Sum of squared errors Let’s assume that this is our data, where each point is our target, example; the blue line is one of the possible functions that our network can generate, based on parameters; the grey segments are the differences between the target and the model predicted. What we want to find are the parameters which minimize this error. In this specific case the blue line is a linear model. But we can have very complex functions adding hidden layers and hidden neurons. How do we know how many layer s? One for regression, two for classification, for now. How many neurons in these layers? We don’t know, but we will see some methods to obtain the right number of j hidden neurons. Now let’s discuss in how we find the weight which minimize this error fun ction. Nonlinear optimization 101 We cannot use straightforward list square. If we’ve done linear regression, we were getting the same result, but this is a quadratic error function, and the model was linear  we have a unique minimum in this function tha t we could compute using pseudo inverse. But now we cannot because the weights don’t appear as linear functions but appear inside nonlinear function. So, differently from linear regression, what we have is to minimize the nonlinear function. Minimizing a f unction means to find the parameters for which the derivative of the function is 0. The set of points where we have a minimum are part of the stationary points of a function. So, our goal is to find these stationary points and we compute the derivative equ al to zero. . This is not easy because, even the complexity and non -linearity of the function, we don’t have a closed -form the solution. Lastly, we have different stationary points, and we cannot compute them exactly. So, what we do in practice is use a very simple iterative solution to find these minimum. This iterative solution starts by: - setting the weight equal to zero (as we did with perceptron) - compute the derivative  it’s for this that we want a function differentiable: to train the model we need to minimize the cost function, which in literature is also known as loss function, and to do that we do it iteratively, and to do that iteratively we do gradient descent and finally to do that we need to compute the derivative of the cost function with re spect to the weights; and the activation functions are part of this derivative. In 1985 the idea was to use differentiable function, so that we could compute derivative, so that we could use the gradient descent to compute the update of weights. Gradie nt descent – Backpropagation We can have different gradient descent algorithm. Stressing that we have to minimize the cost of a function, we are facing regression problem: the sum of the differences between the function and the target, squared, is our error. So, what we get ideally is an error function, that we want to minimize. We start from a random weight initialization and perform gradient descent : we compute t he derivative of the error function, given those weights. If the derivative says going left the function increase, going right the function decrease, we go right in the direction of decrease (per il primo punto); we make a step in the opposite direction of the gradient. The gradient is the derivative of a function and is the direction of increase. So, if the gradient is positive, we have to decrease the weights, because we go in the opposite direction; if the gradient is negative (means that the function de creases with the weights), we have to increase the weights. That’s way there is a minus in the formula. How big we make these steps? We don’t want to make very big steps, otherwise what happens is that we might not converge. G oing from one step to the next , if too big we might go over the minimum . But if the step is too big, we might not notice it . The problem of this strategy is that, firstly, the solution that we will find depends on where we start from, on where is the initialization. This because we ca n also converge to a local minimum. One strategy to overcome this problem is that, using a parallelism, our rock rolling down the mountain, doesn’t stop because there is a too small obstacle and try to continue. So, one strategy is to use the inertial term , and add one term which is called momentum, which adds a sort of memory of the previous gradient. But this could not be the solution: In this example, when I pass the first valley, the moment is not important enough to pass it, although it is not the ab solute minimum . We have to start from different points and observe what is the best convergent value with respect to the error function. Gradient decent example We have to compute the gradient of some error function. What we want to do is to minimize the cost function E. Let’s try to be interested in finding the derivative of one of these weights, which is weight 3,5 – j = 3 and i = 5 (element 3,5 of the matrix of weights of the first layer). Our goa l is to compute the derivative of the error function with respect to the weight 3,5 of the first layer, which means to compute: Our w is the sum of squared error. * ** *: la derivata della somma è ugua le al solo elemento j = 3, perché è l‘unico che dipende da w3,5. **: ��01�3,��∗�� 3,5 = �� 35(�3,0�0+ �3,1�1+ ⋯ + �3,5�5+ ⋯ + �3,��)= �5 If we calculate the derivative of the second layer, it’s simpler because the computation has less elements, since the blue part doesn’t exist. Gradient descent this is the general formula . From now on we will dwell on the final summation . This summation means that, before updating the weight, we sum over all possible examples the contribution to the error to the derivative of that example. So, we don’t look only one example, but we put together the derivative of each example and correct t he weights by taking into account all the samples. This is called batch and it is the opposite of ‘on -line’. This is that we correct any sample (perceptron) , while now we correct all the sample at the time . So, we don’t have many iterations, but only one, one entire epoque. It has many good properties: for instance, the gradient descent will not go back and forth, but if 2 samples pull the error function to different directions, they cancel the 2 contributions. However, this has a problem. GPUs, which cont ain thousands of layers, are very common nowadays , and this approach becomes unpractical in these cases. The problem is that the data goes into GPU’s memory. The b atc h works if we can load in the memory of the GPU our model and all the data. Otherwise, we will iterate through the memory of the processor and all the advantages of parallel processing of GPU disappear. So, modern dataset with thousands of images don’t fit in the GPU. So, variation of batch has been proposed. We compute batch, the gradient of the error function with respect to each weight by averaging, over the dataset the contribution of each individual sample, if we can fit the data in the memory. If not, we can do a sort of approximation by resorting to the online approach . We load one samp le in the memory, we update the gradient, then load the second sample in the memory, update the gradient and so on. Online learning is related to stochastic gradient descent (SGD) : this idea of correcting the gradient by approximating the batch gradient de scent with the sequence of single sample update, it’s called stochastic. Stochastic because there is a sort of stochasticity in the result due to the stochasticity of the sampling process. Since the GPU memory is mapped on the processor memory what we can do is that, while the GPU updates the gradient, the parallel thread loads the next data point in the memory of the GPU, so that the GPU not even realize this stream of data. It’s unbiased, that means that on average we get the same result, but it has high variance, so very noise. We can do also something in the middle: if in our GPU memory we could fit samples, instead of updating the weights one sample at time, we can update a subset of sample at time. Usually use multiples of 8 (8,16,32). This method is called mini -batch gradient descent. We group the data in batches, the smallest batch is the size 1, and in this case, we do SGD. This is a tradeoff between very smooth batch gradient descen t and the high variance SGD. The bigger the batch size, the smoother is the training. If we have the same learning rate and a batch size of 10 samples, with SGD each step is 1 as size; if we group the samples, the size is 10 times bigger , so it seems ve ry slow and unstable because is too big learning rate. Is important remember that the data are represented as a tensor, 4D or 3D array. The first element is the batch size. Can we compute the gradient in an automatic way? The answer is yes. To understand how we can do that we have to know some concept related to automatic differentiation. Usually, we compute the gradient for each weight. It means that for each weight we compute directly the gradient, associated one weight to one gradient. There are some co mplex gradient evaluation techniques which just required 2 passes. - Let’s assume to have x as a real number and 2 functions, f and g (from real domain to real domain). - Consider z as z = f(g(x)) = f(y), where y = g(x)  z = f(y) - Now compute the derivative of z w.r.t. x: The same thing for back propagation. This means that the derivative of the error w.r.t the weight as: the derivative of error w.r.t the output of the network, times the derivative of the output of the network w.r.t the input of the output neuron, times the derivative of the input of the outpu t neuron w.r.t the output of the middle layer, times the derivative of the output of the middle layer w.r.t the input of the middle layer, times the derivative of the input of the middle layer w.r.t our value. There is a very simple pattern, related to t he straightforward structure of the network , that allow us to compute immediately the derivative. This way to compute the derivative is known as chain rule , because we can decompose one complex derivative as the chaining of the others. So, when we get an i nput, we can already compute what is the value of this hidden neuron, but once we know the value of the output of that neuron, we also know the derivative: this is a sigmoid and the derivative of this is the output times (1 – output). So, with the third pa ss we store for each of these nodes the value of the derivative of the error. This first evaluation is called forward pass and is the same operation that we do when we compute the output of the network. The only difference is that in training we store the value of the derivative. Then, to compute the gradient, we do what is called backward pass. Let’s assume that we are interested in calculating the back propagation of a weight, we use the input w.r.t that weight, multiply by the gradient of the following node, the following weight, the gradient of the following node and until we get to the derivative of the error  procedimento opposto rispetto al precedente For example, if we have a shortcut connection from xi to the output, the gradient will be the error function, the derivative of the output and the input. Some details: what is the right way to use error function? Are there other error functi ons? Nowadays we don’t use the term error function, but we talk about loss function. There are many loss functions, depending on task and on the bias we want to introduce in the network. This is the cost of the network, that we want to optimize. How can we design a cost? A note on maximum likelihood estimation Let’s assume we have some samples, from X1 to Xn and let’s assume that they come from normal distribution, with some unknown mean but known variance. This is a common situation: for example the evalu ation of temperature in a room. We don’t know the temperature in the room, but we expect that the data coming from this thermometer are distributed as the mean, + - some errors of the instrument. And we know what the typical error of the instrument is. So, if we estimate the mean, we will estimate the temperature in the room. We assume that each of these points came from Gaussian distribution. We want to find the best parameter ��, giving the data. Let’s assume we gets these 3 possible gaussians. We prefer the purple, because the purple is the one which makes the samples more likely. So, maximum likelihood estimation uses these criteria, based on the data: it selects the parameter which makes more li kely the data. Now that we know the principle, we can do some hypothesis to make the most of the points likely to be observed. Maximum likelihood estimation: The Recipe Let’s assume we have a distribution (any) with a vector of parameter , and if we wan t to find the maximum likelihood estimation: - Firstly, write the likelihood L = P(data| �) for the data - Sometimes, we take the logarithm of this probability: l = logP (data| �). This because in general logarithms transform products in sums, that are preferr ed. Secondly, if we use a computer and start multiplying a lot of numbers, and some of these are below 0.1, after few hundreds point, we go underflow. - Then we compute the derivatives to find the maximum, and we select the parameters which make the likelih ood derivative equal to zero. - We have to check that it’s a maximum and not a minimum. How do we maximize/minimize? Instead of gradient descent, let’s try with gradient ascent , because we are trying to maximize. Instead of writing -�*derivative, we put + � *derivative. Let’s do an example with gaussian. Now, the likelihood represents the joint probability of having observed all these data. To simplicity, we compute this joint probability as a product, multiplying the likelihood to each dataset. So, we a re getting this product over the gaussian, and we evaluate this gaussian function for each x. Next step is taking the logarithm. It takes the product and transforms it in sum. Then we have to compute the maximum, so we compute the derivative w.r.t th e parameters. In this case the parameter we are interesting is the mean . Last step is to set the derivative equal to zero. We said that our neural network is approximated with the target function t, over a set of observation. But we never say where this t -value come from. From statistical machine learning we know that samples come from unknown function + some zero mean arbitra ry noise. Let’s make the same assumption: my target, observation, come from function + some noise. Let’s try to say that this function is exactly our neural network. Finally, let’s assume that the noise that we have is some gaussian noise with zero mean. What does it mean? Remember our case we have some data. These data points were distributed according to some functions + some noise. Let’s assume that the target comes from our network, through which some added error. Otherwise consider the fact that this error is gaussian, with 0 mean. If we cut one point at some x our function g, data along this line will be distributed as a gaussian with 0 mean. Is like moving a gaussian distribution along a line. The mean of this gaussian function is our neural network. Indeed, writing that our target is given by some functions + some gaussian noise with 0 mean, is equivalent to say that our data is distributed with the mean given by the network, and some gaussian. If our data are 3 + gaussian with 0 mean, this mea ns that they are distributed as gaussian function with mean equal to 3. Instead of having 3 we are another deterministic value, which is our network, and so we have these data that come from gaussian, which mean depends on the input. Maximum likelihood es timation for regression Now we are saying that our data follow a distribution. Let’s use maximum likelihood to compute the parameters of this distribution. Same trick that we did for the gaussian. Now, the only strange part is that, instead of having the m ean of the distribution we have the output of our neural network. We compute the likelihood of the data, putting the output of the network instead of the mean. Then we compute the log likelihood of the data  same result of before, but with this only di fference of the output instead of the mean. Now, instead of computing the derivative, we just have to look for the weights that just maximize the likelihood. The final result is that the solution of maximum likelihood estimation is the minimization of t he sum of the squared error. There is one case in which this situation is wrong  classification Neural networks for classification We cannot represent the output of neural network which does classification, which selects one class as a number + some gaussian noise. The output of a classifier is a multinomial distribution over the classes, which tells that class 1 has 0.3 probability, class 3 0.5…. Even in the binomial case, the output of the network is 0,1, and it will be always the target either 0 + 0 or 1 + 1. We cannot think at the target as something given as a number + some noise. So, the output of our network will be a random var iable, which will be either 1 or 0. If the output of network is 0.999, it’s telling us that the output will be 1 with probability of 0.999. If the output of network is 0.001, it means that it’s predicting 0 with this probability. So, more appropriate way of modeling target of classification problem is to use Bernoulli, binary probability, with a parameter which is the probability that the network is giving us. So, the output of the network will be the probability of class 1. So, the expected output will be 1 with this probability. Now we have a parametric distribution. We have a tool, which is maximum likelihood. What would be the best network, according to maximum likelihood in case of binary classification with 0 and 1? Let’s apply the maximum likeliho od Maximum likelihood estimation for classification The Bernoulli distribution is the probability of class 1 when the class is 1, and 1 - the probability of class 1 when the class is 0. That exponential notation says exactly this. Same trick of regression: In classification we don’t use sum of squared error. This function has a name: Crossentropy . We can write it in a short form: The output is given by the vector with as many neurons as the classes. T is the target vector. So, we can write it as t transposed times the log of the output of the network. For class 1 we will have 1 log(output); for class 0 we will have 1 (1 -log(output) ). X esame: esercizio classico è c he danno il codice del network e dobbiamo dire cosa fa. Il compito del network è scritto nella los s function e se è crossentropy, sappiamo che stiamo predicendo delle classi. NEURAL NETWORKS TRAINING AND OVERFITTING Last week we discussed about t he universal approximation theorem, which state that FFNN just with one middle layer, if we put enough neurons, so enough parameters, we can fit any nonlinear function. It means that whatever function we want to approximate (regression or classification), we can. But on the other hand, this is not good news: - It doesn’t mean that we can find the necessary weights. - Another problem that we have with FFNN is that we have to put enough hid den neuron, so a lot of them. - Additionally, it might be useless in practice if it doesn’t generalize: it means they can provide a proper output on data different from those that have been used for training. But new data coming from same distribution, but n ot exactly the same samples. Ockham was a philosopher who stated that if we don’t need complex explanation of phenomenon, we should go for the simplest one. Also in machine learning, when we have 2 models with the same performances, we should choose the s implest, because the more complex the model, the more likely the overfitting  we are very good in training the model, but this model couldn’t behave properly. The model has just learned by memory the data point and is not learned the phenomenon that is ge nerated the data point. Model complexity An example from regression: let’s assume to have some data that come from a quadratic function (parabola). If we try to fit these data with a linear regression (line M1) we might not be able to fit properly the po int . We say that the model is too simple and we underfit the data  the model has high bias. The model that we assume to use to fit the data is called inductive hypothesis. With neural networks we have the opposite problem: we have a very flexible model. Theory states that we can learn any nonlinear function, but this means that if we have enough neurons, parameters, we are able to pass through the points. In this case, the model Mn (n = number of parameters), has the problem that passing very close to the point in training doesn’t mean that when another point comes from quadratic function, we will be able to pass through it. In general, what happens is that we are more far from the true model than what we expect. This model is good in predicting the training sample, but is not good in predicting new data  overfitting. Is the opposite situation in loosing generalization capability. This overfitting issue is very common in neur al network. We have to discuss the techniques to avoid the overfitting. How to measure generalization? When we train the model, we try to minimize the error function , that is the sum of squared error on a training set. And we obtain a value that is the mi nimum obtained with gradient descent. Evaluating the quality of the model considering this minimum value is quite biased. This error tells us half of the story. Tells us that the model is training, that back propagation is working, that we might get a very low error like that one, but it doesn’t tell us what is going to be the error on new data. So, we need to evaluate the model from data different those that we have used for training. This dataset used for evaluating the model is called test set. We train the model using a training set and when we finish, to understand how good is this model, we test it on new data. Where does this test set come from? E ither they are provided to us from outside, or they are data that we do not use until but set aside until now . So, when we talk about the test set, we talk about the last step, ever, also if the model doesn’t work. =f we go back after the test we cheat, and it’s dangerous. =n practice, we use 3 sets: the training (to set the weights), one is the test (used to assess the quality of the model) and the last is the validation (used to test how good are some choices as removing or adding some neurons). When we split the data, we might do it randomly. Sometimes, we would like to have the same distribution between t he training and the test. =f we have regression problem, it’s simple shuffle or subsample from the distribution and get the test. With classification we have to be more careful, because if we have different classes and some classes are more common respect others, if we just random subsample, in the test set we might not even have a single instance of rare classes, but we want to guarantee that the training and the test have the same distribution of classes. So, we use a technique called stratified sampling. Let’s put 90% of the data in the training set and 10% in the test. =n this case, for each class we maintain the division of 90% and 10%. By doing this stratified sampling, we guarantee that the distribution of classes in the training data and in the test ing data is the same. When we get some data, we said that we have a training dataset . Then we can split these data in training set , used to develop the model, and test set , used to assess how good is the model. Test is the last step that we done and we do it to perform final model assessment. Model development means 2 things: decide the structure of the model (number of layers, neurons, activation functions) and train the weights. We take these training data and split it in 2 , training set and others are selected to perform model selection. For example, let’s assume to have 10 or 100 neurons. We train with the training set 2 neural networks, one with 10 neurons and the other with 100, with back propagation. Once we have finish ed, we have to decide what of the 2 we want to use. We test this on validation . It seems that according to the validation, the best one is the one with 100 neurons, and we can think that the bigger is the number of neurons, the better is the model. But wha t happens if we take 200 neurons? We develop a third network with 200 neurons. After the validation set, we understand if is really better. If is better, we go on with 1000 neurons.