Welcome

Our website is made possible by displaying online advertisements to our visitors.
Please disable your ad blocker to continue.

Computer Engineering - Machine Learning

Full exam

DIPARTIMENTO DI ELETTRONICA, INFORMAZIONE E BIOINGEGNERIA Politecnico di Milano Machine Learning (Code: 097683)January 27, 2025 Surname: Name: Student ID: Row: Column:Time:2hoursMaximumMarks:33 •The following exam is composed of8 exercises(one per page). The first page must be filled with yourname, surname, and student ID. The following pages should be used only in the large squarespresent on each page. Any solution provided either outside these spaces orwithout a motivationwill not be considered for the final mark. •During this exam, you arenot allowed to use electronic devices, such as laptops, smartphones, tablets, and/or similar. As well, you are not allowed to bring with you any notes, books, written schemes, and/or similar. You are also not allowed to communicate with other students during the exam. •The first reported violation of the above-mentioned rules will be annotated on the exam and will be considered for the final mark decision. The second reported violation of the above-mentioned rules will imply the immediate expulsion of the student from the exam room and theannulment of the exam. •You can write the exam either with a pen (black or blue) or a pencil. It is your responsibility to provide a readable solution. We will not be held accountable for accidental partial or total cancellation of the exam. •The exam can be written either inEnglishorItalian. •You are allowed to withdraw from the exam at any time without any penalty. You are allowed to leave the room no earlier than half the time of the duration of the exam. You are not allowed to keep the exam text with you while leaving the room. •Three of the points will be given based on how quick you are in solving the exam.If you finish earlier than 30 min before the end of the exam, you will get 3 points; if you finish earlier than 20 min, you will get 2 points; and if you finish earlier than 10 min, you will get 1 point (the points cannot be accumulated). •The box on Page 10 can only be used to complete the Exercises 7, and/or 8.Ex. 1Ex. 2Ex. 3Ex. 4Ex. 5Ex. 6Ex. 7Ex. 8TimeTot. / 7/ 7/ 2/ 2/ 2/ 2/ 4/ 4/ 3/ 33 Student’s name: Please go on to the next page. . . Machine Learning Page 2 of 9Exercise 1 (7marks) Describe and compare Ridge regression and Lasso regression.Student’s name: Please go on to the next page. . . Machine Learning Page 3 of 9Exercise 2 (7marks) Describe the logistic regression model and how it is trained.Student’s name: Please go on to the next page. . . Machine Learning Page 4 of 9Exercise 3 (2marks) Consider the following snippet of code:1 X 1 = n p . c o l u m n _ s t a c k ( ( d a t a s e t [ [ ’ f e a t u r e 1 ’ ] ] . v a l u e s , d a t a s e t [ ’ f e a t u r e 2 ’ ] ) )2 X 2 = n p . c o l u m n _ s t a c k ( ( d a t a s e t [ [ ’ f e a t u r e 1 ’ ] ] . v a l u e s , d a t a s e t [ ’ f e a t u r e 2 ’ ] * * 2 ) )3 y = d a t a s e t [ ’ t a r g e t ’ ] . v a l u e s4 5 m o d e l 1 = L i n e a r R e g r e s s i o n ( )6 m o d e l 1 . f i t ( X 1 , y )7 m s e 1 = m e a n _ s q u a r e d _ e r r o r ( y , m o d e l 1 . p r e d i c t ( X 1 ) )8 9 m o d e l 2 = L i n e a r R e g r e s s i o n ( )10 m o d e l 2 . f i t ( X 2 , y )11 m s e 2 = m e a n _ s q u a r e d _ e r r o r ( y , m o d e l 2 . p r e d i c t ( X 2 ) )12 13 b e s t _ m o d e l = m o d e l 1 i f m s e 1 < m s e 2 e l s e m o d e l 214 1.Describe the procedure performed in the code above, detailingline-by-linethe operations and their purpose. 2.Is the process used in the code correct? Explain why or why not. If incorrect, propose a fix. 3.Compare and contrast the two models (model1andmodel2) in terms of flexibility and risk of overfitting. 1.The code implements two linear regression models:•Line 1 constructsX1, which consists offeature1andfeature2. •Line 2 constructsX2, which consists offeature1and the square offeature2, introducing a nonlinear transformation offeature2. •Line 3 extracts the target variabley. •Lines 5–7 train a linear regression model (model1) onX1and compute its mean squared error (MSE) on the training set. •Lines 9–11 train another linear regression model (model2) on the transformed dataset (X2) and compute its MSE. •Line 13 selects the model with the lower MSE as the best model. 2.The process of using training MSE to select the best model is flawed. Evaluating performanceon the training set does not provide a reliable estimate of the generalization error. A validation set or cross-validation should be used to select the best model. 3.The two models are not comparable since there is no inclusion relation between the set offeatures considered by the two models. We cannot say which one is more prone to overfitting or which one is more flexible.Student’s name: Please go on to the next page. . . Machine Learning Page 5 of 9Exercise 4 (2marks) Tell whether the statements about Monte Carlo (MC) and Temporal Difference (TD) estimation methods are true or false. Motivate your answers. 1.MC methods are model-free, while TD methods require a model of the environment. 2.MC methods are unbiased, while TD methods are biased. 3.Given an episode of length T, you can generate T samples for the MC estimation of the valuefunction. 4.MC estimation works better than TD estimation if the problem is not Markovian. 1.False. MC and TD methods are model-free, meaning they do not require a model of theenvironment (i.e., knowledge of transition probabilities or the reward function). They learn directly from the interactions with the environment by sampling. 2.True. MC methods are unbiased because they compute the return based on the completeepisode, which represents the true expected value under the given policy. TD methods, on the other hand, are biased because they estimate the value function based on bootstrapping (i.e., using estimates of future values rather than the true future values). 3.True. MC methods generate one sample for the return at each time steptin the episode. For an episode of length T you can generate T samples, one for each state visited during the episode. Each sample corresponds to the cumulative reward from timetuntil the end of the episode. 4.True. MC methods do not rely on the Markov property since they compute the return basedon the actual sequence of rewards observed in an episode. In contrast, TD methods assume the Markov property because they use the value of the next state in their bootstrapping process. If the problem is not Markovian, the value estimates in TD may be inaccurate due to the violation of the Markov assumption, making MC a better choice.Student’s name: Please go on to the next page. . . Machine Learning Page 6 of 9Exercise 5 (2marks) Are the following statements about Support Vector Machines (SVMs) True or False? Motivate your answers. 1.When using an SVM, the computational cost of computing predictions scales with the sizeof training samples. 2.When training a soft-margin SVM, the noisier the data, the larger should be set the value ofhyperparameter C. 3.Hard-margin SVMs can be successfully applied also to datasets that appear to be not linearlyseparable. 4.The aim of the Kernel Trick is to limit the computational cost of the SVM training ondatasets with a very large number of samples. 1.FALSE: SVMs are sparse models and the prediction cost scales with the number of supportvectors. 2.FALSE: the larger is C, the lower the bias (and conversely, the higher the variance). This isexactly the opposite of what we should expect to do on a noisy dataset. 3.TRUE: applying the kernel function, we could find the problem to be linearly separable inthe feature space. 4.FALSE: the computational cost of training SVM is, unfortunately, cubic in the number oftraining samples. Kernel trick allows to limit the cost of computing a large number of features.Student’s name: Please go on to the next page. . . Machine Learning Page 7 of 9Exercise 6 (2marks) You work as a data scientist for a wildlife conservation organization. Your goal is to predict the health condition of birds in a sanctuary based on non-invasive measurements such as wing span, weight, beak length, and feather reflectance. Your organization has limited computational resources but can count on a large dataset labeled by expert ornithologists who have assessed the birds’ health conditions based on medical tests and observations.For each of the following scenarios, explain how you would model this task as a machine-learning problem and propose the method you would use to solve it. •The health condition of the birds is represented as a percentage, where 0% means the bird is severely ill and 100% means the bird is in perfect health. •Recommend one of several predefined treatments for the birds (e.g., ”Dietary Change,” ”Medical Checkup,” ”No Intervention”) to improve their health. The actual impact of the treatments is unknown and can only be evaluated after application (e.g., observing how the bird’s condition changes over time). 1.This is a regression problem where the target variable (health condition) is a continuousvalue ranging from 0 to 100. The input variables are the measured features (wing span, weight, etc.). Given the limited computational resources, a parametric method such as linear regression is recommended, possibly with regularization, to prevent overfitting. If the dataset is large, stochastic gradient descent could be used for training. 2.This is a Multi-Armed Bandit (MAB) problem where each ”arm” corresponds to a possibletreatment (e.g., ”Dietary Change,” ”Medical Checkup,” ”No Intervention”). The goal is to maximize the birds’ health improvement while dynamically learning the effectiveness of each treatment based on contextual features (e.g., measurements of the bird). A Contextual Bandit approach would be appropriate here, where features such as wing span, weight, and beak length are used to guide the decision. Algorithms like Thompson Sampling or Upper Confidence Bound (UCB) adapted for contextual data could be used. These algorithms would balance exploration (trying different treatments to gather more information) and exploitation (choosing the treatment that is currently estimated to be the most effective).Student’s name: Please go on to the next page. . . Machine Learning Page 8 of 9Exercise 7 (4marks) Consider an algorithm for a MAB with binary rewards, choosing the armA t∈ { a 1, a 2} and obtaining the following rewardsR tat each round tover a time horizon ofT= 10 rounds. The table showsR t(recall that only the reward for the chosen arm is revealed to the algorithm):t1 2 3 4 5 6 7 8 9 10 Rewards from a 11 0 1 1 0 1 Rewards from a 21 0 1 0 1.Knowing that the expected reward of a 1is µ 1= 0 .7 and that ofa 2is µ 2= 0 .3, compute the expected regret over the time horizonT. 2.Compute the regret over the time horizonT, or say what extra information you would need to compute it. Recall that the regret is computed over realizations, while the expected regret is computed over expected rewards. 3.What arm would UCB1 play in the next roundt= 11? Consider thatqln(10) 3 ≃ 0.87 and qln(10) 2 ≃ 1.07. Motivate your answer. 4.What can you say about the arm that Thompson Sampling (initialized with a uniform priorint= 0) would play in the next roundt= 11? 1.The first arm is optimal, so the expected regret can be computed asP 10 t=1( µ 1− µ i) = 0 + 0 + 0.4 + 0 + 0 + 0.4 + 0 + 0.4 + 0 + 0.4 = 1.6, or more simply, since the only suboptimal arm has been played 4 times, as 4×(µ 1− µ 2) = 1 .6. 2.To compute the (random) regret, I would need, at each round, the realization of the rewardof the arm that was not played. 3.The indexes for the two arms are: U1=46 +r2 ln(10) 6 ≃ 1.54 U2=24 +r2 ln(10) 4 ≃ 1.57, so UCB1 would play the second arm. 4.I cannot know with certainty the arm that TS will play. The posterior distributions on theexpected reward will be Beta(α= 5, β= 3) for arma 1and Beta( α= 3, β= 3) fora 2. The first is more skewed towards high rewards, so it is more likely that TS will playa 1.Student’s name: Please go on to the next page. . . Machine Learning Page 9 of 9Exercise 8 (4marks) Consider a dataset with 2-dimensional feature vectorsx i= ( x i1, x i2) and corresponding binary labelsy i∈ {− 1,1}fori∈ {1,2,3,4}: D={((1,1),−1),((2,3),1),((3,3),−1),((4,1),1)} You are required to apply kernelized K-Nearest Neighbors (KNN) withK= 3 to classify a new data pointx new= (2 ,2) using the following kernels: 1.Linear Kernel:k(x i, x j) = x⊤ ix j; 2.Polynomial Kernel:k(x i, x j) = ( x⊤ ix j+ 1)2 ; 3.Gaussian Kernel (RBF):k(x i, x j) = exp −∥ x i− x j∥22 σ2 , withσ= 1. Report and comment on all the steps.1.Linear Kernel: k(x new, x 1) = x⊤ newx 1= (2)(1) + (2)(1) = 4 k(x new, x 2) = x⊤ newx 2= (2)(2) + (2)(3) = 10 k(x new, x 3) = x⊤ newx 3= (2)(3) + (2)(3) = 12 k(x new, x 4) = x⊤ newx 4= (2)(4) + (2)(1) = 10 The similarities with the training points are: 4,10,10,12. The three closest neighbors arex 3, x 4, and x2, with labels −1,1,1 respectively. Therefore, the predicted label forx newis 1 (ma jority label). 2.Polynomial Kernel: k(x new, x 1) = ((2)(1) + (2)(1) + 1)2 = (4 + 1)2 = 25 k(x new, x 2) = ((2)(2) + (2)(3) + 1)2 = (10 + 1)2 = 121 k(x new, x 3) = ((2)(3) + (2)(3) + 1)2 = (12 + 1)2 = 169 k(x new, x 4) = ((2)(4) + (2)(1) + 1)2 = (10 + 1)2 = 121 The similarities with the training points are: 25,121,121,169. The three closest neighbors arex 3, x 2, andx 4, with labels −1,1,1 respectively. Therefore, the predicted label forx newis 1 (ma jority label). 3.Gaussian Kernel: k(x new, x 1) = exp −(2 −1)2 + (2−1)22(1) 2 = exp −1 + 12 = exp(−1)≈0.3679 k(x new, x 2) = exp −(2 −2)2 + (2−3)22(1) 2 = exp −12 ≈0.6065 k(x new, x 3) = exp −(2 −3)2 + (2−3)22(1) 2 = exp −22 = exp(−1)≈0.3679 k(x new, x 4) = exp −(2 −4)2 + (2−1)22(1) 2 = exp −52 ≈0.0821 The similarities with the training points are: 0.3679,0.6065,0.3679,0.0821. The three closest neighbors arex 2, x 1, and x 3, with labels 1 ,−1,−1 respectively. Therefore, the predicted label forx newis −1 (ma jority label).Student’s name: End of exam