Welcome

Our website is made possible by displaying online advertisements to our visitors.
Please disable your ad blocker to continue.

Biomedical Engineering - Methods and applications of ai in biomedicine

Completed notes of the course

Complete course

From ML to DL ML paradigms Machine Learning is the field of study focused on the development of algorithms and statistical models able to perform tasks not being specifically programmed to solve it , but to learn the optimal procedure to obtain the best outcomes. Machine learning algorithms usually consist of data -driven models that are designed to be fed with hand - crafted feature vectors and to make predictions or decisions based on them. Those feature vectors, in ML approaches, con sist of structured datasets containing features that are extracted from raw data, usually by experts in the field (H AND -CRAFTED FEATURES ). On the other hand, in DL approaches, also the feature extractor is data -driven, thus designed to be learned from da ta. Hand -crafted vs Data -driven Hand -crafted features may be more convenient than data -driven feature extraction in situations wher e the characteristics of the data are well -understood, and the relevant information can be easily identified and extracted by domain experts . In this case, the use of hand -crafted features in combination with ML/statistical models result to: ∟ be more efficient (during training) , less expensive in terms of resources (time, computational power) but performing even with limited training data available ∟ be interpretable and adjustable (when you know what the model is using to make prediction and how, you can give more/less relevance to some features to adjust the output) ∟ exploit the expert’s knowledge ∟ require more design/p rogramming efforts ∟ be not general and portable, what means that they may not perform well or be useful on other data sets or in other contexts . This is because the extracted features are usually very tailored to the specific problem/case, with respect to features learned from data that in many cases can be extended to other domains or be reused to some extent (maybe as starting point, such as weights for features extraction). However, data -driven feature extraction methods can be more powerful and may be more appropriate in situations where a large amount of unstructured data is available in which relevant features are not immediately obvious . In this case, the deep learning approach results to: ∟ enable the solution of problems that would not have been pos sible otherwise , also extracting patterns from natural images or texts that are far beyond human understanding ∟ not require any a priori knowledge on the input -target relationship ∟ be more portable, since they are learned directly from data and maybe to some extent more able to generalize on input characteristics (usually higher performance on general tasks) ∟ be not interpretable, but explainable AI methods can help ∟ require large datasets to reach high performance and powerful GPUs to be trained ∟ be very time and computational power consuming, especially when dealing with complex models and huge amount of data Structured vs unstructured data Structured data refers to information that is organized in a specific format , typically in a table or spreadsheet. Examples of structured data include medical records , customer data, etc. Structured data is often used as the input for supervised machine learning algorithms , as they are organized to provide a set of information on which the target output can be predicted . An example can be a medical record containing patient’s personal and health data (e.g., age, ethnicity, diagnosis, treatment, etc.) on which to estimate the survival rate by means of a proportional hazard model. Unstructured data, on the other hand, refers to information that is not organized in a specific format , as for text documents, images, time series and videos. To deal with unstructured data it is often (not always) more convenient to use deep learning appr oaches , to let the algorithm learn and extract features and patterns. It is worth noticing that the data -driven extraction of features is often more convenient for unstructured data because the lack of structured format makes more difficult its analysis an d search, but in case domain knowledge is available , and the problem is well understood, and relevant information can be easily identified and extracted by domain experts the use of hand -crafted features to be used with ML or statistical approaches might be more effective and much more resources saving. To conclude both structured and unstructured data can be used as input for AI models to uncover insights and make predictions from data. Supervised vs Unsupervised Neural Networks Artificial neural networks (ANNs) are computing systems inspired by the biological neural networks that constitute animal brains. Brain computational model (A model for the brain functioning, on which the ANN is inspired) An ANN is based on a collection o f connected units or nodes called artificial neurons, which loosely model the neurons in a biological brain. An artificial neuron receives signals then processes them and forwards output signals to neurons connected to it. Each connection , like the synapses in a biological brain, can transmit a signal to other neurons. The "signal" at a connection is a real numbe r, and the output of each neuron is computed by some non -linear function of the sum of its inputs → Neurons may have a threshold such that a signal is sent only if the aggregate signal crosses that threshold. The connections are called edges. Neurons and edges typically have a weight that adjusts as learning proceeds. The weight increases or decreases the strength of the connection . Perceptron model It is the model of a single connection unit → single neuron model: • Neuron receives signals from other neurons → INPUTS • Neuron cumulate the overall incoming charge (synaptic wave) → INPUT SUM • Neuron processes the incoming signals with a non -linear behaviour → NON -LINEAR ACTIVATION (thresholding according to the bias) • Neuron forwards it s output to other neurons connected to it (action potential along the neuron’s axon and transmitted to other neurons at the synapsis) → OUTPUT CONNECTIONS • Edges’ strength is modulated → WEIGHTED CONNECTIONS • Learning process adjusts the s trength of connection → LEARNING PROCEDURE with WEIGHTS ADJUSTMENT PARAMETERS Each neuron is characterized by M+1 parameters: • M input connections → M input connection weights • Bias → 1 added weight for the unitary input (modelling the threshold for th e activation function) LEARNING PROCEDURE In neuroscientific theory , brain learning procedure is modelled by the Hebbian learning : Hebbian Learning “The strength of a synapse increases according to the simultaneous activation of the relative input and the desired target. ” (Donald Hebb, The Organization of Behavior, 1949) In other words, simultaneous activation of cells leads to pronounced increases in synaptic strength between those cells, or even more simplistic " Cells that fire together wire together”. Given a Single perceptron model, with an initialized set of weights (w0, w1, …, wn ) (n weights + bias) on the input connections, and a set of samples, each identified by a n -dimensional vector (x1, …, xn) and the desired output (target, in this case considered binary (+1/ -1)). For each input perform the following steps: Does this process converge? YES, if the d ataset is PERFECTLY LINEARLY SEPARABLE The output of the Single Perceptron is computed as ⥏(⥞ⴿ+ ⥞ⵀ⥟ⵀ+ ⢹ + ⥞ⷑ⥟ⷑ)= 㑛 ╾ ⥐⥍ (⥞ⴿ+ ⥞ⵀ⥟ⵀ+ ⢹ + ⥞ⷑ⥟ⷑ)> ╽ ╽ ⥐⥍ (⥞ⴿ+ ⥞ⵀ⥟ⵀ+ ⢹ + ⥞ⷑ⥟ⷑ)= ╽ −╾ ⥐⥍ (⥞ⴿ+ ⥞ⵀ⥟ⵀ+ ⢹ + ⥞ⷑ⥟ⷑ)< ╽ Thus, the Single Perceptron works as a Linear Classifiers with a decision boundary given by the hyperplane: ⥞ⴿ+ ⥞ⵀ⥟ⵀ+ ⢹ + ⥞ⷑ⥟ⷑ = ╽ > we have INFINITE SOLUTIONS (W*) that correspond to the same hyperplane > according to the INITIALIZATION we get a different solution BUT, if the data i s NON -LINEARLY SEPARABLE then the procedure will NOT CONVERGE, since there will never be an epoch with no errors!! > Solution: use a MULTI -LAYER PERCEPTRON to have a more complex representation available to fit the problem (no more just a linear classifica tion) = aggregate neurons in layers and stack multiple layers → > PROBLEM: The Hebbian procedure can’t work… > SOLUTION: Feed -Forward Neural Networks with BACK PROPAGATION Single Perceptron model to represent Boolean operators Feed -forward NN They are non -linear models consisting of a set of neurons, characterized by the non -linear activation function, and weighted connections. In which the output of each layer depends o nly on the combined sum of the signal coming from previous layers (feed -forward). A FFNN is indeed characterized by its ● parameters (weights + biases) ● Hyperparameters (number of units per layer, number of layers, activation functions, etc.) NB: all functions that contributes to the final output must be DIFFERENTIABLE to allow the training procedure with gradient descent. Activation functions RELU Disadvantages • non -differentiable in zero actually n ot a problem, it can be defined the point to have h’(0)=0 • unbounded ⥏(⥈)❧ ◆ if ⥈ ❧ ◆ which might lead to have big weight updates • non -centered outpu t ❧ ⤲[⤿⥌⤹⥂ (⥈)]≠ ╽ • dying neurons (the only real issue) ReLU neurons can be pushed to a state where they become inactive for almost all inputs (no input manage to overcome the ReLU threshold) → the output and the derivative will remain null → null update for the weights → the capacity of the model is significantly reduced (like if there were less neurons) BUT it can be solved using: • a proper weight initialization • small learning rates • non -null derivative for a BACK PROPAGATION The process of propagating the error back through the network to perform the weights adjustment in the direction of the loss minimization (with a step determined by the learning rate ) performed during ANN training. Back propagation is an algorithm used in supervised learning to train ANNs. Specifically, it is the method employed to perform the weights adjustment in the direction of the loss minimization, what means to adjust the strength of connections to increase th e probability to have the prediction as close as possible to the target. The algorithm envisages 2 steps: 1. FORWARD PASS , using the ANN to compute the predicted output and the relative error (with respect to the ground truth) 2. BACK WARD PASS , the error is p ropagated back through the network (layer -by -layer) and the weights are updated by a factor that is proportional to the obtained error in the direction of the loss minimization LOSS FUNCTION → It is the metric defined during the ANN design that has been defined to properly evaluate the network performance (= how close the current output is from the wanted one) GRADIENT DESCENT → Weights are adjusted according to the gradient of the loss function > each weight is updated by a factor that is proportional to the derivative of the error function with respect to it LEARNING RATE → It determines how big the update step is CHAIN RULE → The analytic method employed to compute the gradient throughout the network (with respect to each connection to be updated) in an efficient manner The adjustment of each weight is given by: ⥞ⷩⵉⵀ= ⥞ⷩ− ⧪⧽⤲ (⥞) ⧽⥞ 㑯 ⷵ⻡ ⥞ⷩ is the weight of the generic connection between the i -th neuron of the n -th layer and the j -th neuron from the previous layer, at the k -th iteration ⸬ⷉ (ⷵ) ⸬ⷵ is the derivative of the loss function with respect to ⥞ⷩ and ⧪ is the learning rate So to update the weights it must be computed the gradient of the Loss function. To do so we can use the Chain rule: the derivative of a composed function can be performed by multipling the partial derivatives ⧽⥍ (⥎(⥟)) ⧽⥟ = ⧽⥍ (⥟) ⧽⥎ (⥟) ⧽⥎ (⥟) ⧽⥟ This applied to our case: ⧽⤲ (⥞) ⧽⥞ = ⧽ ⧽⥞ ( ⿘ (⥛ⷧ− ⥎(⥟ⷧ⏬⥞))ⵁ ⷒ ⷧⵋⵀ )= ⿘ ╿(⥛ⷧ− ⥎(⥟ⷧ⏬⥞)) ⷒ ⷧⵋⵀ (− ⧽⥎ (⥟ⷧ⏬⥞) ⧽⥞ ) ⧽⥎ (⥟ⷧ⏬⥞) ⧽⥞ = ⧽⥎ (◎ ⥏ⷨ(▹) ⷎⷨⵋⵀ ) ⧽⥞ = ⿘ ⧽⥎ (▹) ⧽⥏ⷨ(▹) ⧽⥏ⷨ(▹) ⧽⥞ ⷎ ⷨⵋⵀ For the a djustment of weights in W(2) : ⧽⤲ (⥞) ⧽⥄ (ⵁ)= ⿘ −╿(⥛ⷧ− ⥎(⥟ⷧ⏬⥞)) ⷒ ⷧⵋⵀ (⧽⥎ (⥟ⷧ⏬⥞) ⧽⥄ (ⵁ) ) ⥞⥏⥌⥙⥌ ⥎(⥟ⷧ⏬⥞)= ⥎(⿘ ⥞ⷨ(ⵁ)⥏ⷨ(⏰ ) ⷎ ⷨⵋⵀ ) ⧽⥎ (⥟ⷧ⏬⥞) ⧽⥞ⷨ(ⵁ) = ⥎✭(⿘ ⥞ⷨ(ⵁ)⥏ⷨ(⏰ ) ⷎ ⷨⵋⵀ ) ⧽(⥞ⷨ(ⵁ)⥏ⷨ(⏰ )) ⧽⥞ⷨ(ⵁ) = ⥎✭(⿘ ⥞ⷨ(ⵁ)⥏ⷨ(⏰ ) ⷎ ⷨⵋⵀ )(⥏ⷨ(⏰ )) ⥞⥏⥌⥙⥌ ⥏ⷨ(⏰ )= ⥏ⷨ(⿘ ⥞ⷨⷰ⥟ⷰ ⷍ ⷰⵋⵀ ) ⧽⤲ (⥞) ⧽⥄ (ⵁ)= −╿⿘ (⥛ⷧ− ⥎(⥟ⷧ⏬⥞)) ⷒ ⷧⵋⵀ ⟦⥎✭(⥟ⷧ⏬⥞)⟦⿘ (⥏ⷨ(⿘ ⥞ⷨⷰ⥟ⷰ⏬ⷧ ⷍ ⷰⵋⵀ )) ⷎ ⷨⵋⵀ For the a djustment of weights in W(1) : ⧽⤲ (⥞) ⧽⥄ (ⵀ)= ⿘ −╿(⥛ⷧ− ⥎(⥟ⷧ⏬⥞)) ⷒ ⷧⵋⵀ (⧽⥎ (⥟ⷧ⏬⥞) ⧽⥄ (ⵀ) ) ⧽⥎ (⥟ⷧ⏬⥞) ⧽⥄ (ⵀ) = ⥎㏼(⥞ⷨ(ⵁ)⥏ⷨ(⏰ )) ⧽(⥞ⷨ(ⵁ)⥏ⷨ(⏰ )) ⧽⥄ (ⵀ) = ⥎㏼(⥞ⷨ(ⵁ)⥏ⷨ(⏰ )) ⥞ⷨ(ⵁ) ⧽⥏ⷨ(◎ ⥞ⷨⷰ(ⵀ)⥟ⷪ ⷍⷪⵋⵀ ) ⧽⥄ (ⵀ) ⧽⥏ⷨ(◎ ⥞ⷨⷰ(ⵀ)⥟ⷰ ⷍⷰⵋⵀ ) ⧽⥞ⷨⷰ(ⵀ) = ⥏ⷨ㏼(⿘ ⥞ⷨⷪ(ⵀ)⥟ⷧ⏬ⷪ ⷍ ⷪⵋⵀ ) ⧽(⥞ⷨⷰ(ⵀ)⥟ⷰ) ⧽⥞ⷨⷰ(ⵀ) = ⥏ⷨ㏼(⿘ ⥞ⷨⷪ(ⵀ)⥟ⷧ⏬ⷪ ⷍ ⷪⵋⵀ ) ⥟ⷧ⏬ⷰ ⧽⤲ (⥞) ⧽⥞ⷨⷰ(ⵀ)= −╿⿘ (⥛ⷧ− ⥎(⥟ⷧ⏬⥞)) ⷒ ⷧⵋⵀ ⟦ ⥎㏼(⥟ⷧ⏬⥞) ⥞ⷨ(ⵁ)⟦⥏ⷨ㏼(⿘ ⥞ⷨⷪ(ⵀ)⥟ⷧ⏬ⷪ ⷍ ⷪⵋⵀ ) ⥟ⷧ⏬ⷰ > LEARNING PROCESS: Doing so you update the weights to going closer to the local minimum value of the loss… …once the local minimum has been reached, the weights that has been found are the local optimal parameters for our model that provide the best performance. LE ARNING RATE The learning rate is what determines the magnitude of the weight adjustment . > it is what determines whether and which local minimum we’ll find if it is too big, there is the possibility to miss the local minimum, going back and forth over it if it is too small, there is the possibility to get stuck in a local minimum far from the optimum > SOLUTION 1: decrease the learning rate when no more improvements > SOLUTION 2: use a momentum add to the weight update an inertial factor that allows to gain moment during the loss decay and that in some way forces to go ahead in the taken d irection . This allows the decay to go ahead towards the minimum even when there are o saddle points (flat lo ss, null gradient) , that might make the learning process stagnant o small local minima (jitter path) as a ball would do in rolling down a hill, thanks to its inertial moment . To account for the momentum, we can use a moving average over the past gradients , eventually exponentially weighted such that the recent gradients are given more weightage (Exponential Moving Average (EMA) ). NB: such solution also enables the adjustment of the weights according to the gradient stability. This is because a s table decay would make gain a higher momentum resulting in a bigger update step (past contribution with concordant sign → greater momentum with that sign → contribution to keep going in that direction). While when the gradient oscillates the moving average will provide a small contribution, thus a smaller update step (different signs → mutual negative interference → small contribution in whatever direction). Stagnant process due to saddle points Let’s assume the initial weights of the network under con sideration correspond to point A. With gradient descent, the Loss function decreases rapidly along the slope AB as the gradient along this slope is high. But as soon as it reaches point B the gradient becomes very low. The weight updates around B are very small. Even after many iterations, the cost moves very slowly before getting stuck at a point where the gradient eventually becomes zero. In this case, ideally, cost should have moved to the global minimum point C, but because the gradient disappears at po int B, we are stuck with a sub -optimal solution. How can momentum fix this? Now, Imagine you have a ball rolling from point A. The ball starts rolling down slowly and gathers some momentum across the slope AB. When the ball reaches point B, it has accumulated enough momentum to push itself across the plateau region B and finally following slope BC to land at the global minima C. Adjustment coherent with gradient stability Case 1: When all the past gradients have the same sign (stable) The summatio n term will become large, and we will take large steps while updating the weights. Along the curve BC, even if the learning rate is low, all the gradients along the curve will have the same direction(sign) thus increasing the momentum and accelerating the descent. Case 2: when some of the gradients have +ve sign whereas others have -ve (unstable) The summation term will become small and weight updates will be small. If the learning rate is high, the gradient at each iteration around the valley C will alter its sign between +ve and -ve and after a few oscillations, the sum of past gradients will become small. Thus, making small updates in the weights from there on and damping the oscillations. BATCH VARIATIONS The larger the batch the smaller should be the learning rate to avoid having too big update steps , which eventually might: > delay the convergence > bring to gradient descent instability ⧪ ⷲⷰ⷟ⷢⷧⷲⷧⷭⷬ⷟ⷪ < ⧪ ⷫⷧⷬⷧ ⵊⷠ⷟ⷲⷡ ⷦ < ⧪ ⷗ⷋⷈ Universa l approximation theorem The universal approximation theorem states that a feedforward neural network with a single hidden layer containing a finite number of neurons with non -linear activations can approximate any continuous function (=any input output re lation) to any desired degree of accuracy , given enough data. > a single hidden layer neural network with enough neurons can approximate any function underlying any given input -output problem , given enough data > a single layer is enough to model any proble m  any REGRESSION problem can be theoretically approximated using a FFNN with 1 single hidden layer  any CLASSIFICATION problem can be theoretically approximated using a FFNN with 1 hidden layer and an additional layer that is necessary to map the continuous output to a discrete value (class label) However, it is not guaranteed that: ∟ the number of required neurons in the layer would be very huge ∟ the optimal solution (optimal weights) wil l be found with training > eventually in real applications, deeper layers might be a better option > more layers can provide a better solution , especially in cases where the data is highly complex or highly non -linear, or when the data is noisy or has outlie rs Loss functions The loss function is what defines during the training how good is the output of the model and how to accordingly update the weights. Therefore, it is strictly task dependent. ⤲(⥞)= ⿘ ◅(⥟ⷧ␋⥞⏬⥛ⷧ) ⷒ ⷧⵋⵀ ❧ ◅(▹) ⥐⥚ ⥛⥈⥚⥒ ⥋⥌⥗⥌⥕⥋⥌⥕⥛ For example, when some sort of numerical distance is a suitable metric for performance evaluation, as in regression, the best choice is always the Sum of Squared Error . Conversely, in classification problems, the perfor mances can’t be measured in terms of continuous distances and the output cannot be approximated by a Normal/Gaussian distribution. Hence in these cases, it is better to model the target distribution as a Bernoulli distribution (the likelihood to belong or not from a class) and to use the cross -entropy as loss. REGRESSION → Sum of Squared Errors: ◅(⥟ⷧ␋⥞⏬⥛ⷧ) = (⥛ⷬ− ⥎(⥟ⷬ⏬⥞))ⵁ BINARY CLASSIFICATION → Binary Cross -Entropy: ◅(⥟ⷧ␋⥞⏬⥛ⷧ)= − ⥛ⷬ⊙⊜⊔ (⥎(⥟ⷬ⏬⥞))+ (╾− ⥛ⷬ)⊙⊜⊔ (╾− ⥎(⥟ⷬ⏬⥞)) MULTI -LABEL CLASSIFICATION → Categorical -Cross E ntropy: ◅(⥟ⷧ␋⥞⏬⥛ⷧ)= − ⿘ ⥛ⷬ⊙⊜⊔ (⥎ⷩ(⥟ⷬ⏬⥞)) ⷏ ⷩⵋⵀ ⥞⥐⥛ ⥏ ⤸ ⥊⥓⥈⥚⥚⥌⥚ Besides, in complex tasks, also custom loss functions can be used. In designing a loss function: ∟ consider how to qu antify the difference between the desired output and the given output ∟ exploit the a priori knowledge on the problem and on the data distribution Weighted loss Weighted loss is a way to handle class imbalance in a dataset, where one class has much more examples than the other. The idea is to assign a higher weight to the minority class to compensate for its underrepresentation by making the model paying more attention on those instances that will be seen less frequently . A common approach is to set them inversely proportional to the class frequencies. The mathematical formulation for a weighted loss function in a multi -class classification problem is as follows: ⤲(⥞)= ⿘ ⧼(⥛ⷧ)▹◅(⥟ⷧ␋⥞⏬⥛ⷧ) ⷒ ⷧⵋⵀ ❧ ⧼(⥛ⷧ)= ⥞⥌⥐⥎ ⥏⥛ ⥍⥖⥙ ⥊⥓⥈⥚⥚ ⥛ⷧ Maximum Likelihood Estimation Maximum Likelihood estimation is a method for estimating the parameters of a statistical model, given a set of observations. The idea is to find the parameter values that make the observed data most probable, according to the model. Given a dataset ⤱, whose distribution is ⥗(⥟ⷬ␋⧫), then the aim of MLE is to learn the parameters ⧫ or a model that provides a data representation that is the closest as possible to ⥗(⥟ⷬ␋⧫)  maximize the log of data likelihood 1. Define the log -likelihood: ⤹(⧫)= ⊙⊜⊔ (⥗(⥟ⷬ␋⧫)) 2. Find the parameters that maximize it: ⥈⥙⥎⥔⥈ ⥟⸚(⤹(⧫)) ⥞⥐⥛ ⥏ ⸬ⷐ ⸬⸚⻟= ╽ (solving the set of simultaneous equations) In ANN, this is performed minimizing the loss function ( → the lower the loss, the closer the estimated representation to the real distribution) Performing the MLE, it can be demonstrated that the Sum of Squared Errors and the binary Cross - Entropy are the optimal loss functions for respectively regression and cl assification. i.i.d. = independent and identically distributed Training data follow a distribution that is supposed to be close to the real distribution (if unbiased) A model with defined parameters identifies a specific representation for the provided dataset We want the learned representation to be as close as possible to the real distribution Maximize the likelihood of having predictions around the most likely real values Regression Given a dataset ⤱ = [⥟ⵀ⏬⏰ ⏬⥟ⷒ] And a relative target function ⥁ = [⥛ⵀ⏬⏰ ⏬⥛ⷒ] following a Gaussian distribution ⥁ ┼ ⤻(⧯⏬⧵) ⤻(⧯⏬⧵)⏮ ⥗(⥒␋⧯⏬⧵)= ╾ ◉╿⧳⧵ⵁ⥌ⵊ (ⷩⵊ⸞)⸹ ⵁ⸤⸹ Solving the regression problem consists in finding the model ⤺(⥞) that provides the output ⥎(⥟ⷧ⏬⥞) that best approximates the target ⥁: ⥛ⷧ= ⥎(⥟ⷧ⏬⥞)+ ⧾ⷧ ⧾ ┼ ⤻(╽⏬⧵⸭ⵁ) ⥛ⷧ ┼ ⤻( ⥎(⥟ⷧ⏬⥞)⏬ ⧵⸭ⵁ) ⥗(⥛ⷧ␋⥎(⥟ⷧ⏬⥞)⏬ ⧵⸭ⵁ)= ╾ ⾰╿⧳ ⧵⸭ⵁ⥌ⵊ (ⷲ⻟ⵊⷥ(ⷶ⻟⏬ⷵ))⸹ ⵁ ⸤⼥⸹ 1. Log -likelihood: L(⑐)= ⊙⊜⊔ ( ⥗(⥁⏬⧫))= ⊙⊜⊔ (⿜ ⥗(⥛ⷧ) ⷒ ⷧⵋⵀ )= ⿘ ⊙⊜⊔ (⥗(⥛ⷧ)) ⷒ ⷧⵋⵀ = ⿘ ⊙⊜⊔ ( ╾ ⾰╿⧳ ⧵⸭ⵁ⥌ⵊ (ⷲ⻟ⵊⷥ(ⷶ⻟⏬ⷵ))⸹ ⵁ ⸤⼥⸹ ) = ⷒ ⷧⵋⵀ = ⿘ (⊙⊜⊔ ( ╾ ⾰╿⧳ ⧵⸭ⵁ)− (⥛ⷧ− ⥎(⥟ⷧ⏬⥞))ⵁ ╿ ⧵⸭ⵁ ) ⷒ ⷧⵋⵀ = ⤻ ⊙⊜⊔ ( ╾ ⾰╿⧳ ⧵⸭ⵁ)− ╾ ╿ ⧵⸭ⵁ ▹⿘ (⥛ⷧ− ⥎(⥟ⷧ⏬⥞))ⵁ ⷒ ⷧⵋⵀ 2. Find the parameters that maximize the Log -likelihood: ⥞⟦= ⥈⥙⥎⥔⥈ ⥟ⷵ(⤻ ⊙⊜⊔ ( ╾ ⾰╿⧳ ⧵⸭ⵁ)− ╾ ╿ ⧵⸭ⵁ ▹⿘ (⥛ⷧ− ⥎(⥟ⷧ⏬⥞))ⵁ ⷒ ⷧⵋⵀ ) = ⥈⥙⥎⥔⥐ ⥕ⷵ(⿘ (⥛ⷧ− ⥎(⥟ⷧ⏬⥞))ⵁ ⷒ ⷧⵋⵀ ) ❧ ⥛⥏⥈⥛ ⥐⥚ ⥛⥏⥌ ⥀⥜⥔ ⥖⥍ ⥀⥘⥜⥈⥙⥌⥋ ⤲⥙⥙⥖⥙⥚ Conclusion: The sum of squared errors is the optimal loss function in regression problems where the target is a set of i.i.d. samples following a Gaussian distribution (linear mean and constant variance) that result to be properly approximated by the regression line = prediction error is a white noise > Elsewise is a sub -optimal solution Classification Given a dataset ⤱ = [⥟ⵀ⏬⏰ ⏬⥟ⷒ] And a relative binary target function ⥁ = [⥛ⵀ⏬⏰ ⏬⥛ⷒ] where ⥛ⷧ⟛{╽⏬╾} ⟕⥐ Consisting in a set of i.i.d. samples following a Bernoulli distribution ⥁ ┼ ⤯⥌ (⥘) ⤯⥌ (⥘)⏮ ⥗(⥒␋⥘)= ⥘ⷩ(╾− ⥘)(ⵀⵊⷩ) ⥞⥐⥛ ⥏ ⥒⟛{╽⏬╾} Then the aim is to define a model for classification whose output approximates the probability of the target: ⥗(⥛ⷧ⥟ⷧ ▯ )= ⥎(⥟ⷧ⏬⥞) ⥛ⷧ ┼ ⤯⥌ (⥎(⥟ⷧ⏬⥞)) ❧ ⥗(⥛ⷧ⥎(⥟ⷧ⏬⥞) ▯ )= ⥎(⥟ⷧ⏬⥞)ⷲ⻟▹(╾− ⥎(⥟ⷧ⏬⥞))(ⵀⵊⷲ⻟) 1. Log -likelihood ⤹(⧫)= ⊙⊜⊔ (⥗(⥁⏬⧫))= ⊙⊜⊔ (⿜ ⥗(⥛ⷧ⥎(⥟ⷧ⏬⥞) ▯ ) ⷒ ⷧⵋⵀ )= ⿜ ⊙⊜⊔ (⥗(⥛ⷧ⥎(⥟ⷧ⏬⥞) ▯ )) ⷒ ⷧⵋⵀ = ⿘ ⊙⊜⊔ (⥎(⥟ⷧ⏬⥞)ⷲ⻟▹(╾− ⥎(⥟ⷧ⏬⥞))(ⵀⵊⷲ⻟)) ⷒ ⷧⵋⵀ = ⿘ ⊡⵲⊙⊜⊔ (⥎(⥟ⷧ⏬⥞))+ (╾− ⥛ⷧ) ⊙⊜⊔ (╾− ⥎(⥟ⷧ⏬⥞)) ⷒ ⷧⵋⵀ 2. Find the optimal parameters ⥞⟦= ⥈⥙⥎⥔⥈ ⥟ⷵ⤹(⥞)= ⥈⥙⥎⥔⥐ ⥕ⷵ(− ⿘ ⊡⵲⊙⊜⊔ (⥎(⥟ⷧ⏬⥞))+ (╾− ⥛ⷧ) ⊙⊜⊔ (╾− ⥎(⥟ⷧ⏬⥞)) ⷒ ⷧⵋⵀ ) Conclusion: The binary cross -entropy is the optimal loss function to be used in a binary classification problem, whose target are i.i.d. samples following a Bernoulli distribution. Elsewise it is still the sub -optimal option. Weights initialization The final result of t he gradient descent is affected by weight initialization. Update rule: ◊⪨ ❧ ⧽⤲ (⥞) ⧽⥞ⵀ = −╿ ⿘ ⏰ ▹ ⪙⳦㏼(▹) ▹ ⪨⳦ ▹ ⪙⳥㏼(▹) ▹ ⪩ ⷒ ⷬⵋⵀ if we initialize: • ⥞ = ╽ ❧ ◊⥞ = ╽ ❧ no learning • ⥞┼⤻(╽⏬⧵ⵁ) ⫗⳦⠲ ⳥ ❧ ⪒⠲ ⳥ ⫗⳦⠳ ⳥ ❧ ⪒⠳ ⳥ ⪉⪖⪃⪌ ⥏(⥈)= ╽ ⥈⥕⥋ ⥏㏼(⥈)= ╽ vanishing gradient (*) ⥏(⥈)⠳ ╾ ⥈⥕⥋ ⥏㏼(⥈)= ╾ exploding gradient ⪊⪚⪘ ⥏(⥈)⠑ ╾␋╿ ⥈⥕⥋ ⥏㏼(⥈)⠑ ╾␋▁ vanishing gradient ⥏(⥈)❧ ╾ ⥈⥕⥋ ⥏㏼(⥈)❧ ╽ vanishing gradient ⪋⪒⪟⪙ ⥏(⥈)⠑ ╽ ⥈⥕⥋ ⥏㏼(⥈)⠑ ╾ vanishing gradient ⥏(⥈)❧ ╾ ⥈⥕⥋ ⥏㏼(⥈)❧ ╽ vanishing gradient (*) not true that it is exactly 0, but small weights lead to small inputs at each layer, thus it is more likely to have input solution Xavier initialization or its variants Xavier initialization and its variants Initialize the weights to let pass the input throughout the layer with the variance unchanged (= have the layer’s output variance equal to the layer’s input variance). • if the neuron’s activation function is linear and we assume w i and x i to i.i.d. ⥏(▹)= ⿘ ⥞ⷧ⥟ⷧ ⷍ ⷧⵋⵀ ❧ ⥃⥈⥙ [⥏(▹)]= ⥃⥈⥙ ⽰⿘ ⥞ⷧ⥟ⷧ ⷍ ⷧⵋⵀ ⽴= ⤶▹⥃⥈⥙ [⥞ⷧ⥟ⷧ]= ⤶▹⧯ⷵ⧵ⷶⵁ▹⧯ⷶ⧵ⷵⵁ▹⧵ⷶⵁ⧵ⷵⵁ • if both x and w are zero mean processes ⥃⥈⥙ [⥏(▹)]= ⤶▹⧵ⷶⵁ▹⧵ⷵⵁ > so to build the network to have layers that don’t affect the input variance: ⥃⥈⥙ [⥏(▹)]= ⧵ⷶⵁ ❧ ⧵ⷵⵁ= ╾ ⤶ ❧ ⥞ⴿ┼⤻ (╽⏬╾ ⤶) ≠ if the neuron’s activation function is non -linear the varia nce or if the hypothesis don’t hold ⥃⥈⥙ [⥏(▹)]≠ ⧵ⷶⵁ, but ⥞ⴿ┼⤻(╽⏬ⵀ ⷍ) will still provide good results. Train and test sets Subdivision of the available data in independent and strictly separate d sets to assess the model performances without overestimating them by testing the model on training data. (if you assess the model performance on samples that were included in th e training set, you are overestimating the model’s performances on unseen data in future applications). > TRAINING SET, must be used to find a parameter configuration → model fitting > VALIDATION SET, must be used to assess the performance of the fitted param eter configuration, to tune the hyperparameters → model selection (go back and forth from training to validation until satisfied, once satisfied…) > TEST SET, must be left aside until the very last evaluation (using the test set to validate the model duri ng the design phase will lead to further overfitting of the test set to) → model assessment ➔ the model selection on the validation set makes sense since the prediction error on the validation set is supposed to be very similar to the one on the test set. I n some cases, the performance can be slightly better on the validation due to the partial overfitting that results from adjusting the design on it. But it can also happen that the error on test set is higher on validation than on test, due to the reduced d imension of the set that might cause an unfavourable bias. Split techniques > RANDOM SPLIT Random sampling of the original dataset with defined proportion. > BOOTSTRAPPING Random sampling with replacement of the original dataset. The final sets will have t he a priori defined dimensions, but each observation can be present in a set multiple times. This is usually done to increase the number of training datapoints when dealing with small dataset in ML. NB: in DL much less common, since it doen’t make sense to perform DL on small datasets, unless using Transfer Learning or other techniques to exploit different mechanisms to learn the data - driven feature extractor. ➔ always STRATIFY the split in classifica tion problems, especially when there is a significant class imbalnce (= maintain the class distribution throughout the split) Validation techniques Possible solutions to perform validation. NB: the test set must be always defined with an hold out approach (just retain about 20% of the provided dataset for final evaluation) • Hold out when you have a very big dataset available you can simply retain a portion of the training data for validation (about 10 -15%) NB: to relate on this approach it is requ ired to have a large enough dataset to rely on the hypothesis that the original dataset contains enough variability to provide a reasonable degree of generalization in representing the phenomenon and that the random and stratified split will maintain this property ( → the reduced validation dataset is still a general representation of the phenomenon). Otherwise the split will be biased. • K-fold cross -validation to exploit all training data to learn the model, not retain a validation set, but divide the tra ining data in k groups (= k folds). Then iteratively: ⥍⥖⥙ ⥐= ╾⏬⏰⏬⥒ ❧ ⥙⥌⥛⥈⥐⥕ ⥐ⷲⷦ ⥍⥖⥓⥋ ⥍⥖⥙ ⥝⥈⥓⥐⥋⥈⥛⥐⥖⥕ ⥈⥕⥋ ⥛⥙⥈⥐⥕ ⥛⥏⥌ ⥔⥖⥋⥌⥓ ⥖⥕ ⥛⥏⥌ ⥙⥌⥔⥈⥐⥕⥐⥕⥎ ⥋⥈⥛⥈ ❧ ⥐ⷲⷦ ⥗⥈⥙⥈⥔⥌⥛⥌⥙⥚ ⥊⥖⥕⥍⥐⥎ ⠂ ⥄ⷧ ❧ ⥒ ⥔⥖⥋⥌⥓⥚ ⤳⥐⥕⥈⥓ ⥗⥙⥌⥋⥐⥊⥛⥐⥖⥕ ⏮ ⥜⥚⥌ ⥛⥏⥌ ⥔⥖⥋⥌⥓ ⥞⥐⥛ ⥏ ⤮⥝⥎ (⥄ⷧ )❧ ⥗(⤱) ⤼⤿ ⥜⥚⥌ ⥈⥓⥓ ⥔⥖⥋⥌⥓⥚ ❧ ⥗ⷧ(⤱) ⥈⥕⥋ ⥗(⤱)= ⤮⥝⥎ (⥗ⷧ(⤱)) NB: it is NOT a solution to use the param. config. that provided the best performance among the k ones, since in that case you would still end up with a biased estimation over a specific folding. So you must embedd in some way the so obtained k models. NB: the more the folds the lower the overall bias Best practice: 1. define a set of possible models [M1, M2, …] 2. perform k -fold cross validation on each Mi and fine an unbiased estimation of its prediction error (computing the mean error over the k predictio ns) 3. choose the best model Mj as the one that results the most performing according to the unbiased estimation of the prediction error 4. re-train Mj on whole training data → obtain the final parameter configuration 5. test what you obtain on test data • Leave -on e-out cross -validation the extreme case of k -fold CV with k= numeber of data points. This lead to perform overall N trainings on a dataset with N datapoints, obtaining overall N parameters configurations for the models. Batch Normalization Pre -process input data to remove the “co -variate shift”, what means to apply transformation to have - zero mean and reduced variance (“rescale”) or unit variance ("whitening") ❧ ⷶⵊ⸞ ⸤ - features that are not correlated with each other (a technique called "decorrelation") ➔ Batch Normalization layers inside the network (after fully connected layers and nonlinearities) BU T the “co -variate s hift” is not a priori known, thus the parameters how muc h normalization is needed (⧦⏬⧥) are not known a priori → they must be fit on data → Batch normalization is included in the learning process → this is not a problem since it is a linear transformation thus differentiable → gradient computation and backpropagation still feasible. ➔ each Batch normalization layer has 4 parameters (per channel), of which 2 are trainable ( ⧦ and ⧥) and 2 are computed on the input ( ⧯ and ⧵). ATTENTION! Any statistic must be computed after the dataset split and only on the training set, then it can be applied to validation and test set. ➔ batch normalization can perform differently at test time, since the used statistics have been fitted on training data ➔ batch normalization has no overhead at inference time, since once the parameters have been computed (⧯ and ⧵) or learned (⧦ and ⧥) on training data , only a shift and scale is performed (linear operator) at inference time, which can be even merge with the previous conv/dense layer Benefits: - get rid of any -bias and non -relevant information included in the specific batch - improve training ( network will learn more quickly and efficiently , thanks to the possibility of using higher learning rates and impr oving the gradient flow through the network ) - reduce the final result dependence on weights initialization - introduce a form of regularization that slightly reduce the need for the dropout - make the final model less sensitive to perturbations in parameters - useful in gradient -based optimizers TRAINING TIME ⤱⥈⥛⥈⥚⥌⥛ = [⤯ⵀ⏬⏰ ⏬⤯ⷩ] ❧ ⥌⥈⥊ ⥏ ⥌⥗⥖⥊ ⥏ ⥛⥙⥈⥐⥕⥐⥕⥎ ⥎⥖⥌⥚ ⥖⥝⥌⥙ ⥒ ⥔⥐⥕⥐ − ⥉⥈⥛⥊ ⥏⥌⥚ ⥍⥖⥙ ⥌⥈⥊ ⥏ ⤯ⷧ= [⥟ⵀ⏬⏰ ⏬⥟ⷫ] ❧ ⥊⥖⥔⥗⥜⥛⥌ ⥛⥏⥌ ⥔⥐⥕ ⥐− ⥉⥈⥛⥊ ⥏ ⥔⥌⥈⥕ ⥈⥕⥋ ⥝⥈⥙⥐⥈⥕⥊⥌ ⧯ⷧ= ╾ ⥔ ⿘ ⥟ⷨ ⷫ ⷨⵋⵀ ⥈⥕⥋ ⧵ⷧⵁ= ╾ ⥔ ⿘ (⥟ⷨ− ⧯ⷧ)ⵁ ⷫ ⷨⵋⵀ ⥕⥖⥙⥔⥈⥓⥐⥡⥌ ⥛⥏⥌ ⥐⥕⥗⥜⥛ ⏮ ⥟⸶⿨ = ⷶ⻠ⵊ⸞⻟ ⸤⻟⸹ ⥈⥕⥋ ⥈⥗⥗⥓⥠ ⥈ ⥍⥜⥙⥛ ⥏⥌⥙ ⥚⥊⥈⥓⥌ ⥈⥕⥋ ⥚⥏⥐⥍⥛ ⥖⥗⥌⥙⥈⥛⥐⥖⥕ ⏮ ⥠ⷨ= ⧦ ⥟⸶⿨ + ⧥ The latter shift -and -scale is included to eventually re -correct the normalization to reduce or compl etely annul its effect. Indeed, the parameters ⧦ and ⧥ are parameters included in the learning process, so they can be learned to balance the normalization effect: ⥠ⷨ= ⧦ ⥟⸶⿨ + ⧥ = ⧦ ⥟ⷨ− ⧯ⷧ ⧵ⷧⵁ + ⧥ ⥚⥖ ⥐⥍ ⧦= ⧵ⷧⵁ ⥈⥕⥋ ⧥ = ⧯ⷧ ❧ ⥠ⷨ= ⥟ⷨ TEST TIME At test time the batch is normalized using the global statistics estimated using training running averages: ⧯ⷋ= ⥈⥝⥎ (⧯ⷧ) ⥈⥕⥋ ⧵ⷋⵁ= ⥈⥝⥎ (⧵ⷧⵁ) and learned parameters ⧦ and ⧥ are used to weigh t the normalization effect. BN in CNN s Typically, BN layers in CNNs are placed between dense layers and activations, but sometimes also between convolutional layers. In this case, the Batch Normalization is computed channel by channel. The most common approach in CNNs is to zero -center data a nd to normalize every pixel. PCA/whitening is not commonly used in CNNs. Common techniques are: Overfitting prevention Generalization The basic hypothesis in phenomena mod elling is: “A model that has been properly fit on a reasonably large dataset is able to represent the phenomenon with a sufficient degree of generalization to approximate also the behaviour of unseen samples”. > a model must provide a general representatio n for the phenomenon How to measure the model’s generalization power? Testing the model on an independent dataset. The model’s performance on training data ∟ is not a good metric of the model performance on real data, since it includes the noise on which it has been trained, while testing the performance on independent data would provide a realistic idea of the generalization power of the model (= validation set) > at fixed complexity it is likely to have higher performance on validation than on training, this is because the model has fit some specific patterns in training data that probably are not part of the underlying phenomenon (or not included also in the validation set, or vice versa if the model lacks something) ∟ is expected to tend to zero as the model complexity increases. While the performance on test set is expected to have a maximum at the optimal model complexity and then to increase back due to overfitting on training data. > the optimal solution would b e to have a completely independent dataset, but this is often not the case… The classical approach is to divide the provided dataset in strictly separated sets, each one to be used in a different phase (training, validation, test) Model’s complexity To ha ve generalization, it is fundamental to use a model with proper complexity. Ockham principle of parsimony w ith model complexity: if 2 models (with different complexity) have roughly the same performance, then it is good practice to prefer the simplest one. This is because: ∟ the higher complexity provides more degree of freedom to fit data ∟ thus, if the model complexity is much more higher than the intrinsic complexity of the problem the risk is to strictly represent the given dataset in a specific way (overfitting ) instead of identifying the general input -output mechanism underlying the observed phenomenon ∟ moreover real data are known to be noisy and an overrated number of parameters would allow also for noise fitting (= including noise in the learned representation ) Solutions to prevent overfitting 1. Early stopping Consid ering the trend of validation error over training error along epochs : ∟ training consist in adjusting the model parameters to better represent the problem epoch by epoch (fixed complexity) ∟ the prediction error is expected to progressively decrease thanks to the gradient descent until the local minimum of the loss is found ∟ however, it can happen that this minimum loss is given by a parameter configuration that corresponds to an overfitted model and that the maximum power of generalization has been reached be fore ∟ so, an effective method to prevent the degeneration of the parameter configuration towards overfitting is to monitor both the training and the validation error at each epoch → ONLINE ESTIMATION OF THE GENERALIZATION ERROR ∟ to early identify a further decrease of the training error that is not reflected also on the validation error ∟ so, the moment when the model starts to overfit and stop once the configuration that provided the best performance on validation data has been reached = stop before the max n um. of epochs has been reached if the next iterations are expected only to reduce the generalization power of the model ➔ this is performed in practice by setting: max_iter = … metric_to_be_monitored = validation error or accuracy patience = how many epochs you are willing to wait with no improvements before asserting that the validation trend has reached a plateu (won’t improve), so to stop the training … but with this technique you are still retaining a portion from training data to be used for val idation. So, you are still not exploiting all datapoints to learn the parameters… 2. Regularization The technique is based on the assumptions that - overfitting is introduced when the model’s complexity is much higher than the problem complexity (because the extra degrees of freedom are employed to fit non -relevant oscillations in data) - if some neurons are forced to be ineffective then the complexity of the model is reduced, even if the actual structure is not simplified in terms of number of neurons (but if a neuron has almost null effect on the net then it is like it wouldn’t be there) ➔ introducing a factor that constraints the model freedom in training forcing the ineffectiveness of some neurons in some way will lead to simpler models and lower likelihood of overfitting L2 regularization on weights Also known as WEIGHT DECAY, it is a method to prevent overfitting based on regularization with L2 norm . ⤹⥖⥚⥚ ⷖ(⥞)= ⤹⥖⥚⥚ (⥞)+ ⧦ 〶␌⥞␌〶ⵁ ⵁ idea : if the weight of a connection is forced to be close to zero , then the related hidden neuron would result less effective, resulting in simpler models . ➔ introducing a factor in the loss function that tends to keep wights around small values along the updat es, will help to keep simple the learned representation and to avoid overfitting . ➔ moreover, small weights updates will also result in a smoother gradient descent ➔ the regularization factor that is introduced is imposed in the Loss function as an a-priori assumption on the weight distribution (Normal distribution , zero mean and variance ⧵ⷵⵁ): Maximum A -Posteriori estimation (MAP) : Find the parameters that Maximize the likelihood of weights given data knowing that the weights follow a specific distribution. Regularization rate (⧦) It defines the strenght of the regularization by imposing the proportion of the weight variance upperbound depending on data variance: ⧵ⷵⵁ= ╾ ⧦⧵ⵁ ❧ ⧦= ╽ ❧ ⧵ⷵⵁ┼◆ ❧ ⥕⥖ ⥙⥌⥎⥜⥓⥈⥙⥐⥡⥈⥛⥐⥖⥕ ⧦< ╾ ❧ ⧵ⷵⵁ> ⧵ⵁ ❧ ⥞⥌⥈⥒ ⥙⥌⥎⥜⥓⥈⥙⥐⥡⥈⥛⥐⥖⥕ ⧦> ╾ ❧ ⧵ⷵⵁ< ⧵ⵁ ❧ ⥚⥛⥙⥖⥕⥎ ⥙⥌⥎⥜⥓⥈⥙⥐ ⥡⥈⥛⥐⥖⥕ Regularization reduces the importance given to datapoints during the weight adjustment = the value of gamma is inversely proportional to the importance given to data in the gradient descent. ➔ the value of gamma for each regularization layer is an additional hyperparameter to be tuned Dropout (or Stochastic Regularization) Randomly turning off neurons in the network at each iteration. This prevent neurons co -adaptation (= prevent from neurons learning similar representations, resulting in using more degree of freedom than what is actually needed to model the problem). Dropout rate : percentage of neurons in each hidden layer to be switched off at each iteration ➔ hyperparameter to be tuned Hyperparameter tuning Hyperparameters tuning with early stopping The traditional Early Stopping approach applies on epochs during fitting, but a similar approach can be used to prevent overfitting also in setting the model complexity. We know that the higher the complexity the lower the training error, but also that t he test/validation error doesn’t share the same trend. Indeed, the higher the complexity the more prone the model to overfitting. So, to define a model complexity that is high enough to proper describe the intrinsic complexity of the problem without leavin g to many degrees of freedom to be fitted on non -informative oscillations in data, we can use the online generalization error estimation also over model complexity. How? NB: The pipeline is valid both with and without early stopping at point 3 Hyperparameter search • GRID SEARCH Grid search is a method for hyperparameter tuning that involves specifying a set of possible values for each hyperparameter, and then training and evaluating a model for each combination of hyperparameter values. The go al is to find the combination of hyperparameter values that results in the best performance on a validation set. It is called grid search because it creates a grid with the different values of each hyperparameter as the rows and columns, and it checks the performance of the model on every intersection. It is simple to implement but it could be computationally expensive and time -consuming , especially when the number of hyperparameters or the number of possible values for each hyperparameter is large . • RAN DOM SEARCH Grid search is a method for hyperparameter tuning that involves specifying a range of possible values for each hyperparameter , sampling random combinations within those ranges of hyperparameter values and training and evaluating a model for each combination. The goal is to find the combination of hyperparameter values that results in the best performance on a validation set. One of the advantages of random search is that it has been shown to perform better than grid search in some cases, pa rticularly when the number of hyperparameters or the number of possible values for each hyperparameter is large. This is because random search explores a wider range of the hyperparameter space and is less likely to get stuck in suboptimal regions. It also requires less computation time than grid search as it doesn't check all combinations. However, it can be less efficient than grid search if the number of iterations is low and doesn't cover the entire search space. • BAYESIAN SEARCH Bayesian optimization is a method for hyperparameter tuning that uses Bayesian inference to model the underlying function that maps from hyperparameter values to the performance of the model on a validation set. The Bayesian optimization algorithm starts by defining a prior p robability distribution over the hyperparameter space. As it evaluates the model with different hyperparameter values, it updates the prior distribution to form a posterior distribution based on the observed data (performance on validation set). Then, it s amples the next set of hyperparameters from the posterior distribution, and trains and evaluates the model with those hyperparameters. This process continues for a specified number of iterations or until a stopping criterion is met. The algorithm tries to balance exploration of the hyperparameter space with exploitation of the regions where good performance is likely to be found. One of the advantages of Bayesian optimization is that it can converge faster and find better solutions than grid search or ra ndom search because it uses the information from previous evaluations to inform the next set of evaluations. Bayesian optimization also can handle constraints and can be useful when the number of hyperparameters is large and the search space is complex. However, it can be more computationally intensive than grid search or random search and requires a good understanding of Bayesian inference to implement it. Model evaluation Computer vision Digital images ● Higher dimensional images ● medical MRI ● videos Spatial transformation Linear Spatial filters Filtering consists in convolution between a given image and a filter (or kernel) . Convolution is defined as correlation up to a filter flip (as in signal theory), so it is important to consider this flip when designing filters for image processing and use them consistently. NB: in CNN arithmetic this flip is not considered since filt ers are learned on data, so it is only important to use in a consistent way what the net learned. Padding The operation of e xtending the input with default values, in this context specifically to handle local transformations at the boundaries (where the neighbourhood exceeds the image dimension). Binary target matching Linear classifiers Issues in Image classification Is image classification a challenging problem? Yes, it is, and the main challenges are: C onvolutional N eural N etworks (CNNs) Data driven feature extractor So far, we understood that using each pixel as an independent feature is inefficient and ineffective. Thus, we need to extract some relevant features from images on which classification can be based. Option 1: Hand -crafted features pros: • Exploit a priori / expert information • Features are interpretable (you mi ght understand why they are not working) • You can adjust features to improve your performance • Limited amount of Training Data needed • You can give more relevance to some features cons: • Requires a lot of design/programming efforts • Not viable in many visual recognition tasks that are easily performed by humans ( e.g., when dealing with natural images) • Risk of overfitting the training set used in the feature design • Not very general and "portable" Option 2: Data -driven feature extractor Which consists in a NN using convolutional layers and other techniques to concentrate information into a reduced spatial extent in an optimal way for the specific task . CNNs are typically made o f blocks of: CONVOLUTIONAL LAYER → ACTIVATION → POOLING > padding and pooling affect the width and height of the output > number of filters in the conv layer the depth In numbers.. Given : - input image with dimension RxC and N channels ❧ ⤶⏮(⤿⊥⤰⏬⤻) - conv layer with K filters AxB with stride equal to (S,S) ❧ ⥍⏮(⤮⊥⤯⏬⤻) [each filter has as many channels as the input map] Number of parameters: - each filter has a matrix AxB of weig hts for each channel + bias ❧ ⤻(⤮▹⤯)⽐ ╾ weights - total number of filters K ❧ ⤸ ▹(⤮▹⤯▹⤻ ⽐ ╾) Output: - each filter is convoluted with the image producing a channel in the output ❧ ⤸ output channels - padding defines how RxC is reduced to R’xC’ (assuming A, B odd): SAME ❧ ⤿㏼= ⤿ ⥈⥕⥋ ⤰㏼= ⤰ VALID ❧ ⤿㏼= ⤿⽑ (⤮⽑ ╾) ⥈⥕⥋ ⤰㏼= ⤰⽑ (⤯⽑ ╾) FULL ❧ ⤿㏼= ⤿⽐ ⤮⽑ ╾ ⥈⥕⥋ ⤰㏼= ⤰⽐ ⤯⽑ ╾ - stride equal to s means that in both direction we take an output value each S input value ❧ ⤼⏮(⤿㏼ ⥀ ⊥⤰㏼ ⥀ ⏬⤸) ** the 75% is discarded because only the 25% is kept, since the final output is R/2xC/2xK, what means ¼ of the initial number of values RxCxK Architecture Global Averaging Pooling layer s (GAP) An example to highlight spatial invariance: Sparsity and weight sharing CNN can be seen as a Multi -Layer Perceptron with sparse connectivity and weight sharing. MLP Dense connectivity = each neuron is connected with all neurons of the previous layer, so its output depends on all inputs CNN Unfolding the structure, it emerges the similarity: Receptive field (RF) The Receptive field is the region of the input image that a particular neuron in a convolutional layer is looking at or "receiving information from". The output of shallow layers has smaller receptive field than deeper once, since the aim of convolutional blocks is to shrink information in reduced spatial extents, so to compress information coming form wide areas on the input image and compress it in small areas on the output volume. • Convolutional layers General formula: ⤿⤳ⷧⵊⵀ= ⥚ⷧ⟦⤿⤳ⷧ⽐ (⤻ ⽑ ⥚ⷧ) • Pooling layers General formula: ⤿⤳ⷧⵊⵀ= ⥚ⷧ⟦⤿⤳ⷧ⽐ (⤻ ⽑ ⥚ⷧ) Excercises: (Note that here it is implicit stride=1 for each conv layer and stride=2 for each maxpooling) Training procedure ❧ ⤺⤽ (⥟)= 㑏⥟ ╂ ⊚⊎⊥ ⥓⥖⥊⥈⥛⥐⥖⥕ ╽ ╂⥖⥛ ⥏⥌⥙ ⥓⥖⥊⥈⥛⥐⥖⥕ ❧ ⤺⤽ ✭(⥟)= 㑏╾ ╂ ⊚⊎⊥ ⥓⥖⥊⥈⥛⥐⥖⥕ ╽ ╂⥖⥛ ⥏⥌⥙ ⥓⥖⥊⥈⥛⥐⥖⥕ … keep track of the max location and define t he derivative accordingly CNN issues The representation of images that CNNs learn are effective → they manage to grasp perceptual dissimilarity but they are likely to be NON -INTEPRETABLE and defined in very HIGH DIMENSIONAL SPACES. This can be a problem in terms of interpretab