logo
  • userLoginStatus

Welcome

Our website is made possible by displaying online advertisements to our visitors.
Please disable your ad blocker to continue.

Current View

Computer Engineering - Machine Learning

Full exam

DIPARTIMENTO DI ELETTRONICA, INFORMAZIONE E BIOINGEGNERIA Politecnico di Milano Machine Learning (Code: 097683)January 25, 2024 Surname: Name: Student ID: Row: Column:Time:2hoursMaximumMarks:33 ˆThe following exam is composed of8 exercises(one per page). The first page needs to be filled with yourname, surname and student ID. The following pages should be used only in the large squarespresent on each page. Any solution provided either outside these spaces orwithout a motivationwill not be considered for the final mark. ˆDuring this exam you arenot allowed to use electronic devices, such as laptops, smartphones, tablets and/or similar. As well, you are not allowed to bring with you any kind of note, book, written scheme, and/or similar. You are also not allowed to communicate with other students during the exam. ˆThe first reported violation of the above-mentioned rules will be annotated on the exam and will be considered for the final mark decision. The second reported violation of the above-mentioned rules will imply the immediate expulsion of the student from the exam room and theannulment of the exam. ˆYou are allowed to write the exam either with a pen (black or blue) or a pencil. It is your responsibility to provide a readable solution. We will not be held accountable for accidental partial or total cancellation of the exam. ˆThe exam can be written either inEnglishorItalian. ˆYou are allowed to withdraw from the exam at any time without any penalty. You are allowed to leave the room not early than half the time of the duration of the exam. You are not allowed to keep the text of the exam with you while leaving the room. ˆThree of the points will be given on the basis on how quick you are in solving the exam.If you finish earlier than 30 min before the end of the exam you will get 3 points, if you finish earlier than 20 min you will get 2 points and if you finish earlier than 10 min you will get 1 point (the points cannot be accumulated). ˆThe box on Page 10 can only be used to complete the Exercises 7, and/or 8.Ex. 1Ex. 2Ex. 3Ex. 4Ex. 5Ex. 6Ex. 7Ex. 8TimeTot. / 7/ 7/ 2/ 2/ 2/ 2/ 4/ 4/ 3/ 33 Student’s name: Please go on to the next page. . . Computer Science Computer Science — Machine Learning Page 2 of 10 Exercise 1 (7marks) Define PAC-Learning in the context of supervised learning and explain how it is related to the concept of sample complexity. Please, make sure to define also each relevant term introduced in your answer.Student’s name: Please go on to the next page. . . Computer Science Computer Science — Machine Learning Page 3 of 10 Exercise 2 (7marks) Explain the principles behind boosting, how does it work and when it is useful.Student’s name: Please go on to the next page. . . Computer Science Computer Science — Machine Learning Page 4 of 10 Exercise 3 (2marks) Consider the following snippet of code: 1 X = z s c o r e ( d a t a s e t [ [ ’ s e p a l=l e n g t h ’ , ’ s e p a l=w i d t h ’ ] ] . v a l u e s ) 2 t = d a t a s e t [ ’ c l a s s ’ ] . v a l u e s == ’ I r i s=s e t o s a ’ 3 4 m o d e l = L o g i s t i c R e g r e s s i o n ( p e n a l t y= ’ n o n e ’ ) 5 m o d e l . f i t ( X , t ) 1.Tell which procedure is performed in the previous lines of codes. Detailline-by-linethe operations that has been used in the previous code. Do you think there are some errors in the code? 2.Which problem are we solving with this procedure? Are there other methods that can solvethe same problem? List at least other 3 methods suitable for such a problem. 3.Which kind of metrics would you use to evaluate the performance of the previous method?Describe the metrics and the eventual procedure you would use to understand if the adopted approach is effective. 1.We are training a logistic regression classifier. The input data are normalized, the target areextracted (iris setosa vs. others), then a standard logistic regressor is initialized and fit over the entire dataset. The procedure provided is correct. 2.We are solving a classification problem. Methods for binary classification are the perceptron,K-nearest neighbour and Naive Bayes. 3.For such problems we might use accuracy, precision, recall or F1 score. They should beevaluated on an independent set to be corresponding to the ones we would have on new data (test approach) or computed using some Crossvalidation or leave-one-out approach.Student’s name: Please go on to the next page. . . Computer Science Computer Science — Machine Learning Page 5 of 10 Exercise 4 (2marks) Tell if the following statements about Multi-Armed Bandit (MAB) are true or false. Provide adequate motivations. 1.After pulling an arm, the learner gets a feedback on the reward from all the available arms. 2.An algorithm designed for MAB setting should minimize the regret suffered during thelearning procedure. 3.A problem modeled as an MDP with a single state and an infinite number of actions can behandled with MAB techniques. 4.UCB1 and TS are designed to solve the same class of problems. 1.FALSE: we get feedback only about the arm we pull. In the case we would have informationabout the other arms we would be in an expert setting. 2.TRUE: the goal is to minimize the loss due to the lack of knowledge about the optimal option.This can be done maximizing the expected reward or, equivalently, minimizing the regret. 3.FALSE: the MAB problem have a finite set of arms (corresponding to actions). 4.TRUE: they both provide theoretical guarantees on the stochastic MAB problem.Student’s name: Please go on to the next page. . . Computer Science Computer Science — Machine Learning Page 6 of 10 Exercise 5 (2marks) Indicate whether the following statements are true or false. Justify your answers. 1.Backward feature selection evaluates a number of models that is at most quadratic in thenumber of features. 2.Forward feature selection is guaranteed to deliver the best subset of features. 3.Wrapper methods should use the training error to guide the selection of the features. 4.Filter methods are more appropriate than wrapper methods when training the underlyingmodel is expensive. 1.TRUE. Indeed, backward feature selection removes one feature at a time. Thus, in the firstiteration it evaluatesmmodels (each associated with each of themcandidate features), then, in the second iterationm−1, and so on. Thus, the maximum number of evaluated models isP m i=1i =m(m−1)/2 =O(m2 ). 2.FALSE. Indeed, just like backward feature selection, it evaluatesm(m−1)/2 models, while all subset of features is 2m ≫m(m−1)/2. 3.FALSE. As any model selection technique, it should avoid overfitting by using a proper modelevaluation approach, such as validation set, cross-validation, or adjusting techniques. 4.TRUE. Indeed, filter methods do not require training the underlying model, while wrappermethods do for every candidate subset of features.Student’s name: Please go on to the next page. . . Computer Science Computer Science — Machine Learning Page 7 of 10 Exercise 6 (2marks) General anesthesia is a reversible condition of unconsciousness induced by systemic administration of one or more drugs to make the patients insensible to the stress of surgical trauma. Consider the following problems. 1.Predicting whether the patient blood pressure goes below a fixed known threshold as an effectof the concentration levels of the administrated drugs. 2.Learning the interventions (increase or decrease the amount of drugs administrated) theanesthesiologist should perform in order to keep the patient blood pressure within a certain known range during the whole surgical procedure. Classify the problems and, for each of them, specify all the elements needed to fully characterize it (e.g., input, output, states, actions, reward, ...). Furthermore, propose one technique to solve it. 1.The problem is classified assupervised learning, specifically,classification. The input variables are the concentration of the administrated drugs and the output variable is 1 if the blood pressure is below the threshold. A technique to solve it is SVM. 2.The problem is classified asreinforcement learning(control). The state is (at least) the blood pressure (reasonably, all indicators that are available to the anesthesiologist through the monitors), the action is the amount of each drug administrated in the current time instant, the reward function is−1 if the current blood pressure falls outside the range (other solutions are possible), and the discount factor is 1 since the surgical procedure has finite duration. We can solve the problem of optimizing the drug dosage using q-learning or SARSA.Student’s name: Please go on to the next page. . . Computer Science Computer Science — Machine Learning Page 8 of 10 Exercise 7 (4marks) Given the following dataset: x1= (2 ,3,4)⊤ , y1= 1 x 2= (0 ,1,2)⊤ , y2= 0 x3= (1 ,2,5)⊤ , y2= 1 x 4= (1 ,4,3)⊤ , y4= 0 x5= (0 ,3,1)⊤ , y5= 0 x 6= (1 ,2,2)⊤ , y6= 0 x7= (3 ,1,4)⊤ , y7= 1 x 8= (4 ,2,5)⊤ , y8= 1 x9= (1 ,3,3)⊤ , y9= 0 x 10= (1 ,2,4)⊤ , y10= 1 ˆClassify the pointx 11= (0 ,1,2)⊤ according to a KNN classifier trained on the given dataset withK= 3 using the Manhattan distance (d(x i, x j) :=P h| x ih− x j h| ); ˆWhat happens if we useK= 2 instead? Do you think it is a good idea to choose such a value for the parameterK?; ˆWhat would you do in the case you want to apply regularization to this classifier? 1.The distance ofx 11from the training points are: d1= |2−0|+|3−1|+|4−2|= 6d 2= 0 + 0 + 0 = 0 d3= 1 + 1 + 3 = 5 d 4= 1 + 3 + 1 = 5 d5= 0 + 2 + 1 = 3 d 6= 1 + 1 + 0 = 2 d7= 3 + 0 + 2 = 5 d 8= 4 + 1 + 3 = 8 d9= 1 + 2 + 1 = 5 d 10 = 1 + 1 + 2 = 4 The 3 closest points tox 11according to the Manhattan distance are x 2, x 6, and x 5. Therefore, the class predicted by the KNN is the negative class (0). 2.If we useK= 2, we would have had the same outcome for the pointx 11. However, it is not a good idea to choose an even value for the parameterKsince there would be some cases in which we need to arbitrarily break the ties. This does not happen if we have two classes and an odd value forK. 3.Kcan be regarded as a regularization parameter of the model. The larger the value ofK the more the model resort to regularization.Student’s name: Please go on to the next page. . . Computer Science Computer Science — Machine Learning Page 9 of 10 Exercise 8 (4marks) Consider the following Markov reward process (i.e., Markov decision process with a policy):Astart BCD 1 /21 /21 /211 /21 /21 /2where the discount factor is γ∈[0,1) and the reward functionRis defined as follows: R(A) =R(B) =R(C) = 0 andR(D) =1 −γ1 −γ /2. 1.Compute the value functionVπ (s) from every states∈ {A, B, C, D}. 2.Suppose that the agent observes the (infinite-length) episodeA→B→B→C→C→ C→C→D→D→. . .. Compute the probability that such an episode occurs. 3.Say if Monte Carlo (both first-visit and every-visit) and Temporal Difference evaluation can beemployed for estimating the value function in this problem. If yes, provide the value function estimate that the technique would deliver when executed with the episode of question 2 and the estimated value function is initialized to zero and learning rateα >0. Motivate your answers.Student’s name: Please go on to the next page. . . Computer Science Computer Science — Machine Learning Page 10 of 10 1.Given the simple structure of the MRP, we can write down the Bellman expectation equations:           V (A) = 0 +γ /2V(A) +γ /2V(B) V(B) = 0 +γ /2V(B) +γ /2V(C) V(C) = 0 +γ /2V(C) +γ /2V(D) V(D) =1 −γ1 −γ /2+ γ V(D)= ⇒            V (A) =( γ /2)3(1 −γ /2)4 V(B) =( γ /2)2(1 −γ /2)3 V(C) =γ / 2(1 −γ /2)2 V(D) =11 −γ /2, by solving from the last equation up. 2.We just multiply the probability of each transition which are all equal to 1/2 apart those of the formD→Dthat have probability 1: Pr(A→B→B→C→C→C→C→D→D→. . .) = 1/2·1/2·1/2·1/2·1/2·1/2·1/2·1·1·1· · · ·= (1/2)7 . 3.Monte Carlo evaluation (both first-visit and every-visit) cannot be applied since the MRPis infinite-horizon. Temporal Difference evaluation can be applied since it works for infinite- horizon problems too. TD applied to the episode of question 2 with estimated value function initialized to zero and learning rateα >0, will deliverˆ V(A) =ˆ V(B) =ˆ V(C) = 0 since the immediate reward for these states is zero and the estimated value function for the next states in the episode will always be zero. For stateD, instead, we will have the sequence of updates: ˆ V(D)←ˆ V(D) +α(R(D) +γˆ V(D)−ˆ V(D)) =ˆ V(D)(1−α+αγ) +αR(D) =ˆ V(D)(1−α+αγ) +α1 −γ1 −γ /2. Note that forγ 0, the quantity 1−α+αγ