Welcome

Our website is made possible by displaying online advertisements to our visitors.
Please disable your ad blocker to continue.

Computer Engineering - Machine Learning

Full exam

DIPARTIMENTO DI ELETTRONICA, INFORMAZIONE E BIOINGEGNERIA Politecnico di Milano Machine Learning (Code: 097683)January 8, 2025 Surname: Name: Student ID: Row: Column:Time:2hoursMaximumMarks:33 The following exam is composed of8 exercises(one per page). The first page must be filled with yourname, surname, and student ID. The following pages should be used only in the large squarespresent on each page. Any solution provided either outside these spaces orwithout a motivationwill not be considered for the final mark. During this exam, you arenot allowed to use electronic devices, such as laptops, smartphones, tablets, and/or similar. As well, you are not allowed to bring with you any notes, books, written schemes, and/or similar. You are also not allowed to communicate with other students during the exam. The first reported violation of the above-mentioned rules will be annotated on the exam and will be considered for the final mark decision. The second reported violation of the above-mentioned rules will imply the immediate expulsion of the student from the exam room and theannulment of the exam. You can write the exam either with a pen (black or blue) or a pencil. It is your responsibility to provide a readable solution. We will not be held accountable for accidental partial or total cancellation of the exam. The exam can be written either inEnglishorItalian. You are allowed to withdraw from the exam at any time without any penalty. You are allowed to leave the room no earlier than half the time of the duration of the exam. You are not allowed to keep the exam text with you while leaving the room. Three of the points will be given based on how quick you are in solving the exam.If you finish earlier than 30 min before the end of the exam, you will get 3 points; if you finish earlier than 20 min, you will get 2 points; and if you finish earlier than 10 min, you will get 1 point (the points cannot be accumulated). The box on Page 10 can only be used to complete the Exercises 7, and/or 8.Ex. 1Ex. 2Ex. 3Ex. 4Ex. 5Ex. 6Ex. 7Ex. 8TimeTot. / 7/ 7/ 2/ 2/ 2/ 2/ 4/ 4/ 3/ 33 Student’s name: Please go on to the next page. . . Machine Learning Page 2 of 10Exercise 1 (7marks) Derive theBias-Variance Decompositionfor regression under the squared error loss, considering a target variabley, a prediction modelˆ f(x), and the true functionf(x). Explain how bias and variance contribute to the overall expected error of the model.Student’s name: Please go on to the next page. . . Machine Learning Page 3 of 10Exercise 2 (7marks) What is theVC dimensionof a hypothesis space? What can it be used for?Student’s name: Please go on to the next page. . . Machine Learning Page 4 of 10Exercise 3 (3marks) Consider the following snippet of code:1 X 1 = z s c o r e ( d a t a s e t [ [ ’ s e p a l - l e n g t h ’ , ’ s e p a l - w i d t h ’ ] ] . v a l u e s )2 X 2 = z s c o r e ( d a t a s e t [ [ ’ s e p a l - l e n g t h ’ , ’ s e p a l - w i d t h ’ , ’ p e t a l - l e n g t h ’ ] ] . v a l u e s )3 t = d a t a s e t [ ’ c l a s s ’ ] . v a l u e s = = ’ I r i s - s e t o s a ’4 5 m o d e l 1 = L o g i s t i c R e g r e s s i o n ( p e n a l t y = ’ n o n e ’ )6 m o d e l 1 . f i t ( X 1 , t )7 a c c u r a c y 1 = m o d e l 1 . s c o r e ( X 1 , t )8 9 m o d e l 2 = L o g i s t i c R e g r e s s i o n ( p e n a l t y = ’ n o n e ’ )10 m o d e l 2 . f i t ( X 2 , t )11 a c c u r a c y 2 = m o d e l 2 . s c o r e ( X 2 , t )12 13 b e s t _ m o d e l = m o d e l 1 i f a c c u r a c y 1 > a c c u r a c y 2 e l s e m o d e l 214 1.Describe the procedure performed in the code above, detailingline-by-linethe operations and their purpose. 2.Is the process used in the code correct? Explain why or why not. 3.Can Ordinary Least Squares (OLS) be applied to the same problem? Discuss whether OLSis suitable for classification tasks and explain its limitations compared to logistic regression. 1.The code implements two logistic regression models, one using two input variables(sepal-lengthandsepal-width) and the other using three (sepal-length,sepal-width, andpetal-length). Both input matrices are normalized usingzscore(lines 1 and 2). The target variable (t) represents a binary classification problem for identifyingIris-setosa. Each model is trained on its respective dataset, and training accuracy is computed (lines 5–8, 9–11). The model with the higher accuracy is selected as the best model (line 13). 2.The process of using training set accuracy to select the best model is flawed. It does notgeneralize well to unseen data, as the accuracy on the training set often overestimates the true performance of the model. Instead, cross-validation or a separate validation set should be used to select the best model. 3.OLS can technically be applied to the same problem by interpreting the binary classificationas a regression problem where the target values are 0 and 1. However, OLS is not a good idea for classification, as it does not model the probability distribution of the target variable correctly. Logistic regression is specifically designed for binary classification and provides probabilistic outputs, making it a better choice.Student’s name: Please go on to the next page. . . Machine Learning Page 5 of 10Exercise 4 (2marks) Determine whether the following statements about the Principal Component Analysis (PCA) procedure are true or false. Motivate your answers. 1.The set of the Principal Components vectors forms an orthonormal basis for the originalfeature space. 2.Pro jecting the original features onto the Principal Components and using these pro jectionsas features for regression or classification reduces the likelihood of overfitting. 3.The percentage of the variance explained by a Principal Component is inversely proportionalto the value of the corresponding eigenvalue. 4.The procedure to apply PCA to a dataset is deterministic. 1.TRUE: The procedure for the PCA looks for the direction in the dataset which is providingthe most variance and extracts the (first) Principal Component (PC) as the unit vector identifying that direction. Iteratively, it checks the direction, orthogonal to the previous one, with the most variance and extracts another (second) PC. The process iterates over a number of PC equal to the number of dimensions of the dataset. This produces an orthogonal base to the initial dataset. 2.FALSE: only selecting the firstKPC one has the chance to remove the noise from the data and avoid overfitting. If we are keeping all the PC, we would have a linear transformation of the original dataset, which is likely to behave as good as the original one for the supervised task. 3.FALSE: The variance explained by each PC is directly proportional to the correspondingeigenvalue. Indeed, its formula isλ iP iλ i, where λ iis the eigenvalues corresponding to the i-th PC. 4.TRUE: The procedure includes an input normalization procedure and computing theeigenvectors of the covariance matrix, which are both deterministic procedures.Student’s name: Please go on to the next page. . . Machine Learning Page 6 of 10Exercise 5 (2marks) Determine whether the following statements about the Multi-Armed Bandit (MAB) problem are true or false. Provide adequate motivations. 1.In a stochastic MAB setting, the reward distribution of each arm remains fixed throughoutthe learning procedure. 2.MAB techniques can be directly applied to problems where there are dependencies betweenarms. 3.The exploration-exploitation trade-off is a fundamental challenge in MAB problems. 4.The Thompson Sampling (TS) algorithm requires the knowledge of the reward distributionsof the arms in advance. 1.TRUE: In the stochastic MAB setting, the reward distribution of each arm does not changeover time, and the learner’s task is to learn these distributions. 2.FALSE: Standard MAB techniques assume independent arms. Dependencies between armsrequire specialized algorithms, such as those for structured bandit problems. 3.TRUE: The core challenge in MAB is balancing exploration (to discover better arms) andexploitation (to maximize reward based on current knowledge). 4.FALSE: Thompson Sampling does not require prior knowledge of reward distributions butinstead uses a prior belief updated with observed rewards to sample actions.Student’s name: Please go on to the next page. . . Machine Learning Page 7 of 10Exercise 6 (2marks) You are an ML specialist working for an agricultural robotics company that develops drones to monitor and spray crops in a large 2D field. Consider the following problems: 1.Predict the time required for a drone to reach a specific area in the field based on its currentposition and wind speed using a dataset of past drone flights. 2.Learn a drone navigation policy that minimizes the time and fuel consumption needed tospray all the crops in the field, assuming you can simulate the drone’s operation. Model the two problems described above as ML problems, specifying the data (features and target or states, actions, and reward), the problem class, and a specific method to solve them. 1.SUPERVISED LEARNING - REGRESSION. The features are the current position of thedrone (x, y), wind speed and direction, and the position of the target area. The target is the time needed to reach the target area. A method to solve it is Support Vector Regression (SVR). 2.REINFORCEMENT LEARNING - CONTROL. The state consists of the drone’s currentposition, remaining fuel, and the positions of unsprayed crops. The actions are the direction and velocity of movement. The reward could be a negative constant for each time step, plus a larger penalty for running out of fuel. The problem is episodic, ending when all crops are sprayed. A possible method to solve this problem is Q-Learning or SARSA.Student’s name: Please go on to the next page. . . Machine Learning Page 8 of 10Exercise 7 (4marks) Consider the following tra jectory obtained by an agent interacting with an MDP having three statesS={X, Y , Z}and two actionsA={yes,no}, (X,yes,3)→(Y ,no,−1)→(Z,yes,2)→(X,no,4)→(Y ,yes,0)→(Z,no,1)→(Y ,no,−2)→(X,yes,5). Consider initial state-action valuesQ(s, a) = 0 for every state-action pair, a learning rateα= 0.2, and a discount factorγ= 1. Answer the following questions, providing adequate motivations. 1. Execute theQ-learningalgorithm on the given tra jectory (break ties using the first action listed in the action set). 2. Execute theSARSAalgorithm on the given tra jectory. 3. Derive the best policy for each state according to the output ofSARSA. Is it different from the one provided byQ-learning? 1. The formula for Q-learning isQ(s, a)←Q(s, a) +α(R+γmax a′ Q(s′ , a′ )−Q(s, a)). Applying it step by step:Q (X, yes) = 0 + 0.2(3 + 1 · 0 − 0) = 0.6 Q(Y , no) = 0 + 0.2(−1 + 1 · 0 − 0) = −0.2 Q(Z, yes) = 0 + 0.2(2 + 1 · 0.6 − 0) = 0.52 Q(X, no) = 0 + 0.2(4 + 1 · 0 − 0) = 0.8 Q(Y, yes) = 0 + 0.2(0 + 1 · 0.52 − 0) = 0.104 Q(Z, no) = 0 + 0.2(1 + 1 · 0.104 − 0) = 0.2208 Q(Y, no) = −0.2 + 0.2(−2 + 1 · 0.8 − (−0.2)) = −0.42. The formula for SARSA is Q(s, a)←Q(s, a) +α(R+γ Q(s′ , a′ )−Q(s, a)), wherea′ is the action taken in the next step. Applying it step by step:Q (X, yes) = 0 + 0.2(3 + 1*0 - 0) = 0.6 Q(Y, no) = 0 + 0.2(−1 + 1 · 0 − 0) = −0.2 Q(Z, yes) = 0 + 0.2(2 + 1 · 0 − 0) = 0.4 Q(X, no) = 0 + 0.2(4 + 1 · 0 − 0) = 0.8 Q(Y, yes) = 0 + 0.2(0 + 1 · 0 − 0) = 0 Q(Z, no) = 0 + 0.2(1 + 1 · (−0.2) − 0) = 0.16 Q(Y, no) = −0.2 + 0.2(−2 + 1 ·0.6 − (−0.2)) = −0.443. The policy prescribed by SARSA is: X→no,Y→yes,Z→yes. The policy from Q-learningis: X → no, Y → yes, Z → yes. Therefore, the policies differ.Student’s name: Please go on to the next page. . . Machine Learning Page 9 of 10Exercise 8 (4marks) Consider a linear binary SVM classifier that, after training, has weightsw= [2,−1] and bias b= 1. Motivate all answers and write down any necessary computation. 1. How is the point [2,3] classified? 2. Provide an example of a support vector for this SVM. 3. Suppose the point [3,1] is added to the dataset with a positive label. Do we need to retrain the SVM? 4. Can we discard the point [0,−2] from the dataset if all we care about is predicting the class of new points?1. f([2, 3]) = 2 · 2 − 1 · 3 + 1 = 2 > 0, so the point is classified as positive. 2.An example is [0, 1], since f([0, 1]) = 2 · 0 − 1 · 1 + 1 = 0, so it is within the boundary. Any vector v with |f(v)| ≤ 1 is a good example. 3.No, since f([3, 1]) = 2 · 3 − 1 · 1 + 1 = 6 > 1: the point is correctly classified and is not a support vector, so the classifier would not change. 4.With a linear SVM, we can discard the whole dataset; the weights are enough! If we are using a non-parametric version (e.g., a kernel-based implementation with a linear kernel), we have to check if the point is a support vector. f([0, −2]) = 2 · 0 − 1 · (−2) + 1 = 3 > 1, so it is not a support vector, and it can be discarded anyway.Student’s name: Please go on to the next page. . . Machine Learning Page 10 of 10Student’s name: End of exam