logo
  • userLoginStatus

Welcome

Our website is made possible by displaying online advertisements to our visitors.
Please disable your ad blocker to continue.

Current View

Biomedical Engineering - Model Identification and Machine Learning

Completed notes of the course

Complete course

Machine learning Lecture 1: Data prepara-on Business intelligence systems and mathema1cal models for decision making can achieve accurate and effec1ve results only when the input data are highly reliable. However, the data extracted from the available primary sources may have several anomalies whi ch analysts must iden1fy and correct. Several techniques are employed to reach this goal. Data valida)on : the quality of input data may prove unsa1sfactory due to: • Incompleteness : some records may contain missing values corresponding to one or more aCributes. Data may be missing because of malfunc1oning recording devices. It is also possible that some data were deliberately removed during previous stages of the gathering process. Incompleteness may also derive from a failure to transfer data from the opera1on al databases to a data mart used for a specific business analysis or simply to the fact that that data was not mandatory . • Noise : data may contain erroneous or anomalous values, which are usually referred to as “outliers”. Other possible causes o f noise are to be sought in malfunc1oning devices for data measurement, recording and transmission. • Inconsistency : some1mes data contain discrepancies due to changes in the coding system used for their representa1on, and therefore may appear inconsistent. The purpose of data valida1on techniques is to iden1fy and implement correc1ve ac1ons in case of incomplete and inconsistence data or data affected by noise. Incomplete data T o par1ally correct incomplete data, there are several techniques : • Elimina1on : it is possible to discard all records for which the values of one or more aCributes are missing. This is done if at least 10% of the data is missing, we can eliminate a whole column or row. This is a dras1c and not thorough method because we erase data from a row or a column without knowing which was the cause of the missing data. • Inspec1on : it is possible to opt for an inspec1on of each missing value. This approach suffers from a high degree of arbitrariness and subjec1vity. In this way we understand why the data is missing but, in some cases, it takes too much 1me. • Iden1fica1on : a conven1onal value might be used to encode and iden1fy missing values, making it unnecessary to remove en1re records from the given dataset. Some1me a missing value carries an important informa1on. • Subs1tu1on : several criteria exist for the automa1c replacement of missing data, although most of them appear somehow arbitrary. For instance, missing values of an aCribute may be replaced with : o M ean of the aCribute calculated for the remaining observa1ons à technique can only be applied for numerical aCributes . o M ean of the aCribute only for those records having the same target class à rather complex procedure and 1me - consuming for a large dataset with a high percentage of missing data. o Va l u e e s1 m ate d u s i n g sta1 s1 ca l i nfe re n c e . By using these methods, we introduce noise in the data because we use an average and not the actual value. Now that we have a table with complete values we have to go to the next step: abnormali1es and strange values (which always exists due to errors). Data affected by noise T he term noise refers to a random perturba1on within the values of a numerical aCribute, usually resul1ng in no1ceable anomalies. First, the outliers in a dataset need to be iden1fied, so that subsequently either they can be corrected and regularized or en1re records containing them are eliminated. In this sec1on we will describe a few simple techniques for iden1fying and regularizing data affected by noise. Dispersion The easiest way to iden1fy outliers is based on the sta1s1cal concept of dispersion . The sample mean ! ̅! and the sample variance # ̅2 ! of the numerical aCribute $ ! are calculated. If the aCribute follows a distribu1on that is not too far from the normal (bell shape resolu1on) , the values falling outside an appropriate interval centered around the mean value ! ̅! , are iden1fied as outliers. More precisely, with a confidence of 100(1 − * )% it is possible to consider as outliers those values tha t fall outside the interval: ( ! ̅! − - ! 2 #. ! , ! ̅! + - ! 2 #. ! ) , w here - ! 2 is the quan1le of the standard normal distribu1on. A value is considered strange if it is ± 2 #. ! and an outlier if ± 3 # ! . This technique is simple to use, although it has the drawback of relying on the cri1cal assump1on that the distribu1on of the values of the aCribute il bell - shaped and roughly normal. Chebyshev’s theorem However, with Chebyshev’s theorem it is possible to obtain analogous bounds independent from the distribu1on, with intervals that are only slightly less stringent. Once the outliers have been iden1fied, it is possible to correct them with values that are deemed more plausible, or to remove an en1re record containing them. The Chebyshev theorem says that a percentage at least 1 − # $ ! of the observa1ons, with 4 > 1 falls in the interval ! ̅ ± 4 #. . Outside these values we have outliers. Z - index method for outlier detec:on If we have data without knowing the distribu1on we cannot know if a certain value 6 % ,! is an outlier or not, so we have to normalize the value: - % ,! %'( = ) " ,$ * +, $ -, $ . If 8 - % ,! %'( 8 > 3 the value is considered suspicious, if 8 - % ,! %'( 8 ≫ 3 the value is considered an outlier. This is an automa1c way of assessing an outlier with single devia1on. Clustering method An alterna1ve technique, illustrated in the following figure, is based on the distance between observa1ons and the use of clustering methods. Once the clusters have been iden1fied, represen1ng sets of records having a mutual distance that is less than the distance from the records included in other groups, the observa1ons that are not placed in any of the clusters are iden1fied as outliers. Clustering techniques offer the advantage of simultaneously considering several aCributes. A variant of cluster ing methods, also based on the distances between the observa1ons, detects the outliers through two parametric values, : and ; , to be assigned by the user. An observa1on < $ is iden1fied as an outlier if at least a percentage : of the observa1ons in the dataset are found at a distance greater than ; from < $. Unlike the above methods whose aim is to iden1fy and correct each single anomaly, there also exist regulariza1on techniques, which automa1cally correct anomalous data. For example: simple or mul1ple regression models predict the value of the aCribute $ ! that one wishes to regularize based on other variables exis1ng in the dataset. Once the regression model has been developed, and the corresponding confidence interval around the predic1on curve has been calculated, it is possible to subs1tute the value computed along the predic1on curve for the values of the aCribute $ ! that fall outside the interval . Data transformation In most data mining analyses, it is appropriate to apply a few transforma1ons to the dataset in order to improve the accuracy of the learning models subsequently developed. Standardiza)on m ost learning models benefits from a preven1ve standardiza1on of the data, also called normaliza)on . The most popular standardiza1on techniques include the decimal scaling method, the min - max method, and the z - index method. • Decimal scaling : decimal scaling is based on the transforma1on : 6 % ,! . = ) " ,$ #/ % where ℎ is a given parameter which determines the scaling intensity. In prac1ce, decimal scaling corresponds to shi[ing the decimal point by ℎ posi1ons toward the le[. In general, ℎ is fixed at a value that gives transformed in the range [−1,1] . • Min - max : it is achieved through the transforma1on : 6 % ,! . = ) " ,$ * ) &"' ,$ ) &() ,$ * ) &"' ,$ @ 6 01) ,! . − 6 0%' ,! . A + 6 0%' ,! . , where 6 0%' ,! and 6 01) ,! are the minimum and the maximum values of the aCribute B ! before transforma1on, while 6 ′ &$' ,! and 6 ′ &)* ,! are the minimum and the maximum values that we wish to obtain a[er transforma1on. In general, the extreme values of the range are defined so that 6 0%' ,! . = − 1 or 6 0%' ,! . = 0 and 6 01) ,! . = 1 . This technique is the most used. • Z - index : it uses the transforma1on : 6 % ,! . = ) " ,$ * +, $ -, $ but this gives more unpredictable results . ! ̅! and # ̅! are respec1vely the sample mean and the sample standard devia1on of the aCribute $ ! . If the distribu1on of values of the aCribute $ ! is roughly normal, the z - index based transforma1on generated values that are almost certainly within the range ( - 3, 3). Feature extrac1on: t he aim of standardiza1on techniques is to replace the values of an aCribute with values obtaine d through an appropriate transforma1on. However, there are situa1ons in which more complex transforma1ons are used to generate new aCributes that represent a set of addi1onal columns in the matrix C represen1ng the dataset D . transforma1ons of this kind are usually referred to as feature extrac1on. ACribute extrac1on may also consist of the crea1on of new variables that summarize within themselves the relevant informa1on contained in a subset of the original aCributes. Data reduction W hen dealing with a small dataset, the transforma1ons described above are usually adequate to prepare input data for a data mining analysis. However, when facing a large dataset, it is also appropriate to reduce its size, in order to make learning algorith ms more efficient, without sacrificing the quality of the result obtained. There are three main criteria to determine whether a data reduc1on technique should be used: efficiency, accuracy and simplicity of the models generated. • Efficiency : the applica1on of learning algorithms to a dataset smaller than the original one usually means a shorter computa1on 1me and a reduc1on in processing 1me allow the analyses to be carried out more quickly. • Accuracy : the accuracy of the models generated represents a cri1cal success factor, and it is therefore the main criterion followed in order to select one class of learning methods over another. As a consequence, data reduc1on techniques should not significantly compromise the accuracy of the model generated. • Simplicity : in some data mining applica1ons, concerned more with interpreta1on than with predic1on, it is important that the models generated be easily translated into simple rules that can be understood. Data reduc1on o[en represents an effec1ve technique for deriving models that are more easily interpretable. Data reduc1on can be pursued in three dis1nct direc1ons: a reduc1on in the number of observa1ons through sampling, a reduc1on in the number of aCributes through selec1on and projec1on, and a reduc1on in the number of values through discre1za1on and aggrega1on. • Sampling : a further reduc1on in the size of the original dataset can be achieved by extrac1ng a sample of obser va1ons that is significant from a sta1s1cal standpoint (less rows) . This type of reduc1on is based on classical inferen1al reasoning. It is therefore necessary to determine the size of the sample that guarantees the level of accuracy required by the subsequent learning algorithms and to define an adequate sampling proc edure. Generally speaking, a sample comprising a few thousand observa1ons is adequate to train most learning models. It is also useful to set up several independent samples, each of a predetermined size, to which learning algorithms should be applied. In this way, computa1on 1mes increase linearly with the number of samples d etermined, and it is possible to compare the different models generated, in order to assess the robustness of each model and the quality of the knowledge extracted from data against the random fluctua1ons exis1ng in the sample. It is obvious that the con clusions obtained can be regarded as robust when the models and the rules generated remain rela1vely stable as the sample set used for training varies. We can have simple sampling if we select a random number of samples or stra1fied sampling if we select data also preserving the func1on. • Feature selec)on : the purpose of feature selec1on, also called “feature reduc1on”, is to eliminate from the dataset a subset of variables which are not deemed relevant for the purpose of the data mining ac1vi1es. Feature reduc1on has several poten1al advantages. Due to the presence of fewer columns, learning algorithms can be run more quickly on the reduced dataset than on the original one. Moreover, the models generated a[er the elimina1on from the dataset of uninfluen1al aCributes are o[en mor e accurate and easier to understand. Feature selec1on methods can be classified into three main categories: filter methods, wrapper methods and embedded methods. o Filter methods : they select the relevant aCribute before moving on to the subsequent learning phase and are therefore independent form the specific algorithm being used. The aCributes deemed most significant are selected for learning, while the rest are excluded. The simplest filter method to apply for supervised learning involves the assessment of each single aCribute based on its level of correla1on with the target. Consequently, this leads to the selec1on of the aCributes that appear mostly correla ted with the target. o Wrapped methods : i f the purpose of the data mining inves1ga1on is classifica1on or regression, and consequently performances are assessed mainly in terms of accuracy, the selec1on of predic1ve variables should be based not only on the level of relevance of each single aCribute but also on the specific learning algorithm being u1lized. Wrapper methods are able to meet this need, since they assess a group of variables using the same classifica1on or regression algorithm used to predict the value of th e target variable. Each 1me, the algorithm uses a different subset of aCributes for learning, iden1fied by a search engine that works on the en1re set of all possible combina1ons of variables, and selects the set of aCributes that guarantees the best result in terms of accuracy. Wrapper methods are usually burdensome from a computa1onal standpoint, since the assessment of every possible combina1on iden1fied by the search engine requires one to deal with the en1re training phase of the learning algorithm. o Embedded methods : f or these methods, the aCribute selec1on process lies inside the learning algorithm, so that the selec1on of the op1mal set of aCributes is directly made during the phase of model genera1on. Filter methods are the best choice when dealing with very large datasets, whose observa1ons ar e described by a large number of aCributes. In these cases, the applica1on of wrapper methods is inappropriate due to very long computa1on 1mes. Moreover, filter methods are flexible and in principle can be associated with any learning algorithm. However , when the size of the problem at hand is moderate, it is preferable to turn to wrapper or embedded methods which afford in most cases accuracy levels that are higher compared to filter methods . As described abov e, wrapper methods select the aCributes according to a search scheme that inspects in sequence several subsets of aCributes and applies the learning algorithm to each subset in order to assess the resul1ng accuracy of the corresponding model. The proced ure for selec1ng the aCributes for wrapper methods is usually of a heuris1c nature, based in most cases on a greedy logic which evaluates for each aCribute a relevance indicator adequately defined and then selects the aCributes based on their level of relevance. In par1cular, three dis1nct myopic search schemes can be followed: forward, backward, and forward – backward search: o Forward : the explora1on starts with an empty set of aCributes and subsequently introduces he aCributes one at a 1me based on the ranking induced by the relevance indicator. The algorithm stops when the relevance index of all the aCributes s1ll excluded is l ower than a prefixed threshold. o Backward : the explora1on starts by selec1ng all the aCributes and then eliminates them one at a 1me based on the preferred relevance indicator. The algorithm stops when the relevance index of all the aCributes s1ll included in the model is higher than a prefi xed threshold. o Forward - Backward : t he forward – backward method represents a trade - off between the previous schemes, in the sense that at each step the best aCribute among those excluded is introduced and the worst aCribute among those included is eliminated. Also, in this case, threshold v alues for the included and excluded aCributes determine the stopping criterion. The various wrapper methods differ in the choice of the relevance measure as well as well as the threshold preset values for the stopping rule of the algorithm. • Principal component analysis (PCA): PCA is the most widely known technique of aCribute reduc1on by means of projec1on. The purpose of this method is to obtain a projec1ve transforma1on that replaces a subset of the original numerical aCributes with a lower number of new aCributes obtained ad their linear combina1on, without this change causing a loss of informa1on. Expe rience shows that a transforma1on of the aCributes may lead in many instances to beCer accuracy in the learning models subsequen tly developed. Before applying this method, it is expedient to standardize the data, to obtain for all the aCributes the same range of values, usually represented by the interval [ - 1, 1]. Moreover, the mean of each aCribute $ ! is made equal to 0 by applying the transforma1on: 6D % ,! = 6 % ,! − # 0 ∑ 6 % ,! 0 % 2 # . Let C denote the matrix resul1ng from applying the transforma1on above to the original data and let F = C ′ C be the covariance matrix of the aCributes. If the correla1on matrix is used to develop the PCA method instead of the covariance matrix, the transforma1on above is not required. Star1ng from the H aCributes in the original dataset, represented by the matrix C , the PCA method derives H orthogonal vectors, namely the principal components , which cons1tute a new basis of the space ℝ ' . Principal components are beCer suited than the original aCributes to explain fluctua1ons in the data, in the sense that usually a subset consis1ng of J principal components, with J < H , has an informa1on content that is almost equivalent to that of the original dataset. Therefore , the original data are projected into a lower - dimensional space of dimension J having the same explanatory capabili1es. Principal components are generated in sequence by means of an itera1ve algorithm. The first component is determined by solving an appropriate op1miza1on problem, in order to explain the highest percentage of va ria1o n in the data. At each itera1on, the next principal component is selected, among those vectors that are orthogonal to all components already determined, as the one which explains the maximum percentage of variance not yet explained by the previously gener ated components. At the end of the procedure the principal components are ranked in nonincreasing order with respect to the amount of variance that they are able to explain. Let L ! ( M ∈ O ) , denote the H principal components, each of them being obtained as a linear combina1on L ! CP ! of the available aCributes, where the weights P ! have to be determined. The projec1on of a generic example < $ in the direc1on of the weights vector P ! is given by P ′ ! < $. It can easily be seen that its variance is given by: Q R S ! . 6 % − Q T S ! . 6 % U V 3 = Q [ @ S ! . ( 6 % − Q [ 6 % ] ) ) 3 U = S ! . Q [ ( 6 % − Q [ 6 % ] ) . ( 6 % − Q [ 6 % ] ) ] S ! = S ! . W S ! The first principal component ! 1 represents a vector in the direc1on of maximum variance in the space of the original aCributes and therefore its weights may be obtained by solving the quadra1c constrained maximiza1on problem: max 4 + { S # . W S # : S # . S # = 1 } . Where the unit norm constraint for " 1 is introduced in order to derive a well - posed problem. By introducing the Lagrangian func)on : ^ ( # 1 , $ 1 ) = # 1 ′ W # 1 − _ # ( # 1 ′ # 1 − 1 ) and applying the Karush - Khun - Tu c ke r con d i ) on s , the solu1on of the maximiza1on problem reduces to the solu1on of the system: Which can be rewriCen as: ( W − $ 1 ` ) # 1 = 0 s ubject to the unit norm condi1on " ′ 1 " 1 = 1 , where a is the iden1ty matrix. The solu1on of the maximiza1on problem is therefore given by " 1 = % 1 , where % 1 is the eigenvector of unit norm associated with the maximum eigenvalue $ 1 of F , through the rela1on ! . = (% .. The second principal component may be determined by solving an op1miza1on problem, adding the condi1on of orthogonality to the previously obtained principal component, expressed by the constraint: " ′ 2 % 1 = 0 . Proceeding in an itera1ve way, it is possible to derive the H principal components star1ng from the eigenvectors % . ( MbO ) of F ordered by non - increasing eigenvalues $ 1 ≥ $ 2 ≥ ⋯ ≥ $ 0 , through equali1es ! . = (% .. The variance of the principal component ! . is given by WBe ( ! .) = $ .. The H principal components cons1tute a new basis in the space ℝ ' , since the vectors are orthogonal to each other. Therefore, they are also uncorrelated and can be ordered according to a relevance indicator expressed by the corresponding eigenvalue. The index : ` 5 = 6 + 7 6 ! 7 ⋯ 7 6 1 6 + 7 6 ! 7 ⋯ 7 6 ' e xpresses the percentage of total variance explained by the first * principal components and provides an indica1on of the amount of informa1on preserved by the first * components. In order to determine the number of principal components to be appropriately used, it is possible to go on un1l the level of overall importance a , of the considered components exceeds a threshold a &$' deemed reasonable, in rela1on of the proper1es of the dataset. The number of principal components is therefore determined as the smallest value * such that a , > a &$' (example 85%. Advantages: very accurate method, disadvantages: very hard to explain and understand. • Data discre)za)on : t he general purpose of data reduc1on methods is to obtain a decrease in the number of dis1nct values assumed by one or more aCributes. Data discre1za1on is the primary reduc1on method. On the one hand, it reduces con1nuous aCributes to categorical a Cributes characterized by a limited number of dis1nct values. On the other hand, its aim is to significantly reduce the number of dis1nct values assumed by the categorical aCributes. The model that can be generated on the reduced dataset are likely to be more intui1ve and less arbitrary. Among the most popular discre1za1on techniques are subjec1ve subdivision, subdivision into classes and hierarchical discre1za1on. o Subjec1ve subdivision : is the most popular and intui1ve method. Classes are defined based on the experience and judgment of experts in the applica1on domain. o Subdivision into classes : Subdivision into categorical classes may be achieved in an automated way using the techniques described below. In par1cular, the subdivision can be based on classes of equal size or equal width. The automated procedure of subdivision into classes consists of ordering in a nondecreasing way the values of the aCribute + . and grouping them into a predetermined number K of con1guous classes. It is possible to form the classes of either equal size (no, forget it, we have a flat his togram) or same width. o Hierarchical discre1za1on : it is based on hierarchical rela1onship between concepts and may be applied to categorical aCributes, just as for the hierarchical rela1onships between provinces and regions. In general, given a hierarchical rela1onship of the one to - many kinds , it is possible to replace each value of an aCribute with the corresponding value found at a higher level in the hierarchy of concepts. Lecture 2: Exploratory data analysis The primary purpose of exploratory data analysis is to highlight the relevant features of each aCribute contained in a dataset, using graphical methods, and calcula1ng summary sta1s1cs, and to iden1fy the intensity of the underlying rela1onships amon g the aCributes . Exploratory data analysis includes three main phases: • Univariate Analysis , in which the proper1es of each single aCribute of a dataset are inves1gated. • Bivariate Analysis , in which pairs of aCributes are considered, to measure the intensity of the rela1onship exis1ng between them. • Mul)variate Analysis , in which the rela1onships holding within a subset of aCributes are inves1gated. Univariate analysis Univariate analysis is used to study the behavior of each aCribute, considered as an en1ty independent of the other variables of the dataset. It is of interest to assess the tendency of the values of a given aCribute to arrange themselves around a specific central value (loca1on), to measure the propensity of the variable to assume a more or less wide range of values (dispersion) and to extract informa1on on the underlying probability distribu1on. On the one hand, some learning models make specific st a1s1cal hypotheses regarding the distribu1on of the variables being examined, and it is therefore necessary to verify the validity of such assump1ons before proceeding with the subsequent inves1ga1on. On the other hand, univariate analysis intui1vely draws conclusions concerning the informa1on content that each aCribute may provide. Moreover, univariate analysis plays a key role in poin1ng out anomalies and non - standard values, that is, in iden1fying the outliers. Suppose that a given dataset f contains g observa1ons and denote by $M the generic aCribute being analyzed. Since for the purpose of univariate analysis a single aCribute at a 1me is being considered, we will denote $M with $ (j has been suppressed). Therefore, the vector of g observa1ons will be denoted by $ = ( 6 1, 6 2, … , 6g ). Graphical representa1ons are o[en a star1ng point for exploratory data analysis in order to gain insights into the proper1es of a single aCributes. For the purpose of a graphical representa1on, it is necessary to make a dis1nc1on between categorical and numerical aCributes. Graphical analysis of categorical a>ributes A categorical aCribute may be graphically analyzed by resor1ng to various representa1ons for the empirical distribu1on of the observa1ons, that is, the rela1ve frequencies with which the different values occur. Denote by W = { h # , h 3 , … , h 9 } the set of H dis1nct values that are taken by the categorical aCribute $ , and let ℋ = {1,2, … , k } . The most natural representa1on for the graphical analysis of a categorical aCribute is a ver1cal bar chart, which indicates along the ver1cal axis or ordinate the empirical frequencies, that is the number of observa1ons of the dataset corresponding to each of the values assumed by the aCribute a, indicated along the horizontal axis . Some1mes, a horizontal bar chart is preferable in place of a ve r1cal bar chart. Here the posi1ons of the ver1cal and horizontal axes are swapped. It is also possible to calculate the rela1ve empirical frequency, or empirical density : l : = ; % 0 , ℎ ∈ ℋ which each value h ℎ is assumed by the aCribute $ . For a sample of adequate size, by virtue of the central limit theorem, the rela1ve empirical frequency represents a good approxima1on of the probability density of aCribute a. More precisely, it can be claimed that for a sample of sufficiently large size the approxima1on: l : ≈ : : = ne { 6 = h : } , ℎ ∈ ℋ holds, where : ℎ represents the probability that the aCribute $ at a new observa1on 6 will assume the value h ℎ . The empirical density func1on may be represented by a chart similar to that for frequencies, the only difference being that the heights of the bars are divided by the number g of observa1ons in the sample. As a consequence, it is necessary to introduce a change in the scale along the ver1cal axis. Some1mes, it is preferable to use a pie chart to represent the empirical density func1 on. In this case, the amplitude of each sector in the circle is equal to the density associated with the corresponding value assumed by the categorical aCribute. However, no1ce that the representa1on by pie charts is not so effec1ve when one wishes to catch the differences between rela1ve frequencies. Measures of heterogeneity for categorical a>ributes For categorical aCributes, the foregoing measures of central tendency, dispersion and rela1ve loca1on cannot be used. For a categorical aCribute $ it is preferable to define some measures that express the regularity of the arrangement of the data { 6 1 ,6 2 , . . . , 6 & } within the set of ℋ dis1nct values taken by the aCribute. We will now describe two measures of heterogeneity for categorical aCributes: the Gini index and the entropy index. • Gini index : it is defined as: p = 1 − ∑ l : 3 9 : 2 # . I ts value is equal to 0 in the case of lowest heterogeneity, that is, when a class is assumed at a frequency equal to 1 and all the other classes are never assumed. In contrast, when all classes have the same rela1ve empirical frequency and the highest het erogeneity is recorded, the Gini index achieves its maximum value 9 * # 9 . • Entropy index : it is defined as: Q = − ∑ l : log 3 l : 9 : 2 # . Its value is equal to 0 in the case of lowest heterogeneity, that is, when a class is assumed at a frequency equal to 1 and all the other classes are never assumed. In contrast, when all classes have the same rela1ve empirical frequency and the highest he terogeneity is recorded, the entropy index achieves its maximum value tuv 2 k . Graphical analysis of numerical attributes For discrete numerical aCributes assuming a finite and limited number of values, it is possible to resort to a bar chart representa1on, just as in the case of categorical aCributes. In the presence of con1nuous or discrete aCributes that might assume infinite dis1nct values, this type of representa1on cannot be used, as it would require an infinite number of ver1cal bars. We must therefore subdivide the horizontal axis corresponding to the values assumed by the aCribute, into a finite and moderate number of intervals, usually of equal width, which in prac1ce are considered as dis1nct classes. In essence, this is a discre1za1on procedure which inevitably introduces a degree of approxima1on, since all the observa1ons that fall within the same in terval are considered equivalent and indis1nguishable among themselves. As a consequence, it is appropriate to carry out a par11on into R intervals which should be as narrow as possible, while at the same 1me keeping low the number of dis1nct classes so generated. Once the par11on has been performed, the number of observa1ons w . , e = 1,2, … , x , falling into each interval is counted, and a chart consis1ng of con1guous rectangles is generated. The following procedure describes the genera1on of the intervals for the calcula1on of the empirical density: • Histogram for the empirical density : o The number R of classes loosely depends on the number m of observa1ons in the sample and on the uniformity of the data. Usually, the goal is to obtain between 5 and 20 classes, making sure that in each class the frequency is higher than 5. o The total range and the width t . of each class is then defined. Usually the total range, given by the difference between the highest value and the lowest value of the aCribute, is divided by the number of classes, so as to obtain intervals of equal width . o The boundaries of each class are properly assigned so as to keep the classes disjoint, making sure that no value falls simultaneously into con1guous classes. o Finally, the number of observa1ons in each interval is counted and the corresponding rectangle is assigned a height equal to the empirical density : . defined as : < = ; 2 0 = 2 . Observe that the total area of the rectangles, included under the empirical density curve, is equal to 1 . The main difference between a ver1cal bar chart and a histogram lies in the order of the classes along the horizontal axis. This is exactly determined by the sequence of adjacent intervals in the case of histograms , while it is arbitrary in bar charts. The proper1es already highlighted with regard to the rela1ve frequencies for categorical aCributes also apply to empirical densi1es calculated for numerical aCributes. These proper1es provide two interpreta1ons for the area of the rectangles of the histogram: on the one hand, they express for each interval the percentage o f observa1ons of the dataset falling inside the interval itself; on the other hand, by virtue of the central limit theorem, the height of a rectangle approximates the probability that a new observa1on extracted from the popula1on will fall within the as sociated interval. An example of an empirical density histogram is shown in the following figure Measures of central tendency for numerical a>ributes We now describe the main loca)on sta)s)cs , also called measures of central tendency : • Mean : the best - known measure of loca1on used to describe a numerical aCribute is certainly the sample arithme1c mean, defined by the expression: ! ̅ = ) + 7 ) ! 7 ⋯ 7 ) & 0 = # 0 ∑ 6 % 0 % 2 # . Since all the observa1ons of the aCribute are used to calculate the sample mean, its value is strongly affected by the extreme values in the sample. As a consequence, the sample mean value reflects the presence of outliers and therefore is not very robus t. It may be that the mean significantly differs from each observa1on exis1ng in the sample, making it appear a somewhat abstract concept. If the size of the dataset is sufficiently large, the sample arithme1c mean ! ̅ approximates the theore1cal mean ! of the aCribute $ for the en1re popula1on. • Median : the median of g observa1ons can be defined as the central value, assuming that the observa1ons have been ordered in a non - decreasing way . The median is affected by the number of elements in the series but not by the extreme values, and consequently it is more robust than the sample mean. Therefore, the median is suitable for asymmetric distribu1ons and distribu1ons with open extreme classe s. However, if the data are not concentrated in the central part of the distribu1on, the median value lose s any sta1s1cal significance. • Mode : it is defined as the value that corresponds to the peak of the empirical density curve for the aCribute $ . If the empirical density curve has been calculated by a par11on into intervals, as shown for graphical methods, each value of the interval that corresponds to the maximum empirical frequency can be assumed as the mode. Measures of dispersion for numerical a>ributes The loca1on measures gave an indica1on of the central part of the observed values of a numerical aCribute. However, it is necessary to define other indicators that describe the dispersion of the data, represen1ng the level of variability expressed by the observa1ons with respect to central values. • Range : the simplest measure of dispersion in the range, which is defined as the difference between the maximum and the minimum of the observa1ons: 6 ributes Measures of rela1ve loca1on for a numerical aCribute are used to examine the localiza1on of a value with respect to other values in the sample. • Quan1les : suppose we arranged the g values { 6 1 ,6 2 , … , 6 & } of an aCribute $ in non - decreasing order. Given any value : , with 0 ≤ : ≤ 1 , the : - order quan,le is the value J / such that :g observa1ons will fall on the le[ of J / and the remaining (1 − : ) g on its right. Some1mes, : quan1les are called 100 : th percen,les . It should be clear that the 0.5 - order quan1le coincides with the median. Other noteworthy quan1les include the 0.25 - and 0.75 - order quan1les, respec1vely called the lower and upper quar1les, and denoted by J 0 and J 1 . The tw o quar1les and the median divide the observa1ons into four por1ons of equal size, except for the small rounding required when m is not divisible by 4. Therefore, in a density histogram the quar1les and the median split the axis of the observa1ons into four parts, such that the area under the density curve for each por1on is equal to 0.25, as shown in the following figure . Some1mes it is useful to refer to the interquar1le range, which is defined as the difference between the upper and the lower quar 1les: f 5 = J ? − J @ = J / .BC − J / .3C . Measure of central tendency based on quan1les: i n order to reduce the dependence of the mean on the extreme values of the sample, one may resort to measures of central tendency based on quan1les, which in general are more robust. • M ea d - mean : t he mid - mean is obtained by calcula1ng the mean of the values of the aCribute $ falling between the lower and the upper quar1les. • Tr i m m e d m e a n : t he trimmed mean is a generaliza1on of the m ea d - mean, since only the values falling between quan1les of order : and (1 − : ) are used to calculate the mean. Usually, : = 0.05 is the preferred choice, in order to exclude from the calcula1on, the lowest 5% and highest 5% of the values of $ . • Winsorized mean : the winsorized mean modifies the value falling in the tails according to an intui1ve rule: the values lower than the : - order quan1le are increased to J / , while the values higher than the (1 − : ) − order quan1le are decreased to J 1− / . Iden:fica:on of outliers for numerical a>ributes A way to iden1fy outliers is based on the use of box plots, some1mes called box - and - whisker plots, in which the median and the lower and upper quar1les are represented on the axis where the observa1ons are placed. An observa1on is iden1fied as an out lier if it falls outside four threshold values, called edges, defined as: external lower edge = J 0 − 3 f , internal lower edge = J 0 − 1.5 f , internal upper edge = J 1 + 1.5 f , external upper edge = J 1 + 3 f , . Using a box plot to iden1fy outliers has the advantage of being basically independent of the extreme values of the observa1ons, since both the median and the quar1les enjoy such independence. We can see that the box lies between the lower and the upper quar1les. The horizontal lines depar1ng from the box end in thick marks, called whiskers, which correspond respec1vely to the minimum and the maximum values of the aCribute falling inside the internal edges. In general, the values lying outside the external edges are considered outliers. Analysis of empirical density As shown at the beginning of the sec1on, the rela1ve empirical frequency histogram is a valuable tool for the graphical analysis of both categorical and numerical aCributes. It is therefore convenient to make use of summary indicators in order to inves1 gate the proper1es of the empirical density curve. Asymmetry of the density curve a density curve is called symmetric if the mean coincides with the median. From a graphical standpoint, in a symmetric density the two halves of the curve on either side of the mean, and therefore of the median, are mirror images. A density curve is said to be asymmetric when the mean and the median do not coincide. When the mean is greater than the median, we say that the curve is skewed to the right , whereas when the median is greater than the mean, the curve is said skewed to the le9 . The symmetry or asymmetry of an empirical density will be apparent from graphical analysis of its histogram. The figure below shows three examples of empirical density curves, namely a curve skewed to the le[ (a), a symmetric curve (b) and a curve skewe d to the right (c): The ver1cal lines in the charts show the quar1les (solid line) and the mean (dashed). Box plots reflect the type and level of asymmetry in an empirical density in a par1cularly intui1ve way. If the density is symmetric (b), the median is equally distant from the lower and upper quar1les. When it is ske wed to the le[ (a), the median is closer to the upper quar1le, while when it is skewed to the right (c) the median is closer to the lower quar1le. In addi1on to graphical analysis, we may also make use of an index of asymmetry, the sample skewness, bas ed on the third sample moment. The third sample moment is given by: ! ̅D = # 0 ∑ ( 6 % − ! ) D 0 % 2 # .The sample skewness is therefore defined as: ` 1E = +, 3 -, 3 . If the density curve is symmetric, then ` 1E = 0 . If instead ` 1E > 0 the density is skewed to the right. Finally, if ` 1E < 0 the density is skewed to the le[. Kurtosis of the density curve : a further significant problem regarding the density histogram is the iden1fica1on of the type of theore1cal probability distribu1on, usually unknown beforehand, from which the observa1ons are drawn. We observe that the problem of es1ma1ng an unknown distribu1on based on sampled data is rather complex in its general form. However, in the case of the normal distribu1on, a con1nuous distribu1on which occurs again and again in applica1ons, easy graphical and summary cr iteria can be used to assess the level of approxima1on for a given empirical density. The first criterion is graphical and is based on a visual comparison between the empirical density histogram and a normal curve having mean ! ̅ and standard devia1on # ̅ coinciding with those of the given density. The following figure shows a histogram with a good approxima1on to the normal density . The index of kurtosis expresses in a concise way the level of approxima1on of an empirical density to the normal curve, a nd it uses the fourth sample moment: ! ̅F = # 0 ∑ ( 6 % − ! ) F 0 % 2 # . The kurtosis is therefore defined as: ` GHributes As in the case of univariate analysis, besides graphical methods it is useful to introduce summary indicators that express the nature and intensity of the rela1onship between numerical aCributes. • Covariance : given the pair of aCributes $ ! and $ 3 , let ! ̅! and ! ̅3 be the corresponding sample means. The sample covariance is defined as: h !G = �uh @ B ! , B G A = # 0 * 3 ∑ ( 6 %! − ! ! ) ( 6 %G − ! G ) 0 % 2 # . The sample covariance can be easily interpreted. Indeed, if values the aCribute $ ! greater than the mean ! ̅! are associated with values of the aCribute $ 3 also greater than the mean ! ̅3 , the two elements of the product in the summa1on above will agree in sign and therefore they will provide a posi1ve contribu1on to the sum. By the same token, if values the aCribute $ ! lower than the mean ! ̅! are associated with values of the aCribute $ 3 lower than the mean ! ̅3 , the two elements of the product in the summa1on above will s1ll agree in sign and again will provide a posi1ve contribu1on to the sum. In these cases, the aCributes $ ! and $ 3 are said to be in concordance with one another. Conversely, if values the aCribute $ ! greater than the mean ! ̅! are associated with values of the aCribute $ 3 lower than the mean ! ̅3 , the two elements of the product in the summa1on above will not agree in sign and therefore they will provide a nega1ve contribu1on to the sum. The same occurs if values of the aCribute $ ! lower than the mean ! ̅! are associated with values of the aCribute $ 3 greater than the mean ! ̅3 . In these cases, the aCributes $ ! and $ 3 are said to be discordant with one another. We can therefore con clude that posi1ve covariance values indicate that the aCributes $ ! and $ 3 are concordant, while nega1ve covariance values indicate that the aCributes are discordant. • Correla)on : the covariance is usually a number ranging on a variable scale and is therefore inadequate to assess the intensity of the rela1onship between the two aCributes. For this reason, the linear correla1on coefficient between two aCributes, also termed the Pearson coefficient , is more useful. It is defined as: e !G = �uee @ B ! , B G A = J $5 -, $ -, 5 where # ̅! and # ̅3 are the sample standard devia1ons of $ ! and $ 3 , respec1vely. It can be proven that the maximum value achievable by cov( $ ! , $ 3 ) is equal to the product # ̅! # ̅3 of the sample standard devia1ons, while the minimum value is equal to − # ̅! # ̅3 . As a result, the linear correla1on coefficient e !3 always lies in the interval [−1,1] , and represents a rela1ve index expressing the intensity of a possible linear rela1onship between the aCributes $ ! and $ 3 . The main proper1es of the linear correla1on coefficient e !3 can be summarized as follows: o If e !3 > 0 the aCributes are concordant. This means that if the pairs of observa1ons are represented on a scaCer plot, they will show a trend consis1ng of a straight line with a posi1ve slope. The approxima1on to the line increases as e !3 gets closer to 1. If e !3 = 1 the points will lie exactly on a straight line. o If e !3 < 0 the aCributes are discordant. In this case the pairs of observa1ons represented on a scaCer plot will tend to lie on a line with a nega1ve slope. The approxima1on to the line increases as e !3 gets closer to −1. If e !3 = −1 the points will lie exactly on a straight line. o If e !3 = 0 , or at least e !3 ≈ 0 , no linear rela1onship exists between the two aCributes. In this case, the pairs of values on a scaCer plot either are placed in a random way or tend to lie on a nonlinear curve. The figure above shows the linear correla1on coefficients for different sets of data. In par1cular, no1ce that the linear regression coefficient can be close to zero when a strong nonlinear rela1onship exists between the two aCributes. Con:ngency tables for categorical a>ributes: When dealing with a pair of categorical aCributes $ ! and $ 3 , let W = { h 1 , h 2 , … , h 6 }, � { � 1 ,� 2 , … , � 7 } d enote the sets of dis1nct values respec1vely assumed by each of them. A con1ngency table is defined as a matrix � whose generic element � .8 indicates the frequency with which the pair of values { 6 $! = h . } and { 6 $3 = � 8 } appears in the records of the dataset f . Two aCributes $ ! and $ 3 are said to be sta1s1cally independent if the following condi1ons occur: I 2 + > + = I 2 ! > ! = ⋯ = I 26 > 6 , e = 1 ,2 , … , � which in turn can be used to define the concept of independence between categorical aCributes. Intui1vely, the two aCributes are independent if the analysis of $ ! in rela1on to the second aCribute is $ 3 equivalent to the univariate analysis of $ ! . Multivariate analysis The purpose of mul1variate analysis is to extend the concepts introduced for the bivariate case in order to assess the rela1onships exis1ng among mul1ple aCributes in a dataset. Graphic analysis: No1ce that all methods for graphical analysis described in this sec1on exclusively apply to numerical aCributes. • ScaQer plot matrix : s ince scaCer plots show in an intui1ve way the rela1onships between pairs of numerical aCributes, in the case of mul1variate analysis it is natural to consider matrices of plots evaluated for every pair of numerical variables. In this way, it is possib le to visualize the nature and intensity of the pairwise rela1onships in a single chart. • Star plots : belong to the broader class of icon - based charts. They show in an intui1ve way the differences among values of the aCributes for the records of a dataset. To be effec1ve, they should be applied to a limited number of observa1ons, say no more than a fe w dozen, and the comparison should be based on a small number of aCributes. The basic concept involves matching each record with a star - shaped icon, from the center of which depart as many rays as the number of aCributes . The length of each ray is equal to the value of the corresponding aCribute, normalized to fall in the interval [0,1] so as to give a consistent representa1on of the various aCributes. • Spider web chart : are grids where the main rays correspond to the aCributes analyzed. For every record, the posi1on is calculated on each ray, based on the value of the corresponding aCribute. Finally, the points so obtained on the rays for each record are sequen1ally c onnected to each other, thus crea1ng a circuit for every record in the dataset. Measures of correla:on for numerical a>ributes: For mul1variate analysis of numerical aCributes, covariance and correla1on matrices are calculated among all pairs of aCributes. For nota1onal convenience, we will suppose that all the H aCributes of the dataset f are numerical, having removed for the 1me being any categorical aCribute. Let F and � be the two H × H square matrices whose elements are represented by the covariance values and by the correla1ons. Both matrices F and � are symmetric and posi1ve definite. No1ce that the covaria nce matrix F contains on its main diagonal the sample covariance values of each single aCribute, and for this reason it is also called the variance – covariance matrix . In order to devise a summary indicator that expresses the total variability of the dataset, and compares different datasets, the trace of the matrix F can be used, defined as the sum of the elements along its main diagonal . Lecture 3: Clustering By defining appropriate metrics and the induced no1ons of distance and similarity between pairs of observa1ons, the purpose of clustering methods is the iden1fica1on of homogeneous groups of records called clusters . With respect to the specific distance selected, the observa1ons belonging to each cluster must be close to one another and far from those included in other clusters. C lustering methods The aim of clustering models is to subdivide the records of a dataset into homogeneous groups of observa1ons, called clusters, so that observa1ons belonging to one group are similar to one another and dissimilar from observa1ons included in other groups . Grouping objects by affinity is a typical reasoning paCern applied by the human brain. Also , for this reason clustering models have long been used in various disciplines, such as social sciences, biology, astronomy, sta1s1cs, image recogni1on, proces sing of digital informa1on, marke1ng, and data mining. Clustering models are useful for various purposes. In some applica1ons, the clusters generated may provide a meaningful interpreta1on of the phenomenon of interest . Furthermore, subdivision into clusters may be the preliminary phase of a data mining project that will be followed by the applica1on of other methodologies within each cluster. Finally, grouping into clusters may prove useful during exploratory data analysis to highlight outliers and to iden1fy an observa1on that might represent on its own an en1re cluster, in order to reduce the size of the dataset . Clustering methods must fulfill a few general requirements, as indicated below. • Flexibility : s ome clustering methods can be applied to numerical aCributes only, for which it is possible to use the Euclidean metrics to calculate the distances between observa1ons. However, a flexible clustering algorithm should also be able to analyze datasets cont aining categorical aCributes. Algorithms based on the Euclidean metrics tend to generate spherical clusters and have difficulty in iden1fying more complex geometrical forms. • Robustness : t he robustness of an algorithm manifests itself through the stability of the clusters generated with respect to small changes in the values of the aCributes of each observa1on. This property ensures that the given clustering method is basically unaffected by the noise possibly exis1ng in the data. Moreover, the clusters generated must be stable with respect to the order of appearance of the observa1ons in the dataset • Efficiency : i n some applica1ons the number of observa1ons is quite large and therefore clustering algorithms must generate clusters efficiently in order to guarantee reasonable compu1ng 1mes for large problems. In the case of massive datasets, one may also resort t o the extrac1on of samples of reduced size in order to generate clusters more efficiently. However, this approach inevitably implies a lower robustness for the clusters so generated. Clustering algorithms must also prove efficient with respec t to the number of aCributes exis1ng in the dataset. Taxonomy of clustering methods Clustering methods can be classified into a few main types based on the logic used for deriving the clusters: par11on methods, hierarchical methods, density - based methods, and grid methods . • Par))on methods develop a subdivision of the given dataset into a predetermined number � of non - empty subsets. They are suited to obtaining groupings of a spherical or at most convex shape and can be applied to datasets of small or medium size. • Hierarchical methods : carry out mul1ple subdivisions into subsets, based on a tree structure and characterized by different homogeneity thresholds within each cluster and inhomogeneity thresholds between dis1nct clusters. Unlike par11on methods, hierarchical algorithms do not require the number of clusters to be predetermined. • Density - based methods : w hereas the two previous classes of algorithms are founded on the no1on of distance between observa1ons and between clusters, density - based methods derive clusters from the number of observa1ons locally falling in a neighborhood of each observa1on. More precisely, for each record belonging to a specific cluster, a neighborhood with a specified diameter must contain a number of observa1ons which should not be lower than a minimum threshold value. Density - based methods can iden1fy clusters of non - convex shape and effec1vely isolate any possible outliers. • Grid methods : first derive a discre1za1on of the space of the observa1ons, obtaining a grid structure consis1ng of cells. Subsequent clustering opera1ons are developed with respect to the grid structure and generally achieve reduced compu1ng 1mes, despite a low er accuracy in the clusters generated. A second dis1nc1on is concerned with the methods for assigning the observa1ons to each single cluster. It is possible to include each observa1on exclusively in a single cluster or to place it by superposi)on into mul1ple clusters. Furthermore, fuzzy methods have been developed which assign the observa1ons to the clusters with a weight between 0 (the observa1on is totally extraneous to the cluster) and 1 (the observa1on exclusively belongs to the cluster), with the addi1onal condi1on that the sum of the weights over all clusters be equal to 1. Finally, a dis1nc1on should be made between complete clustering methods, which assign each observa1on to at least one cluster, and par)al methods, which may leave some observa1ons outside the clusters. The laCer methods are useful for iden1fying outliers. Most clustering methods are heuris1c in nature, in the sense that they generate a subdivision into clusters of good quality although not necessarily op1mal. Actually, an exhaus1ve method based on a complete enumera1on of the possible subdivisions of m observa1ons into � clusters would require # K ! ∑ � � ℎ � ℎ 0 K : 2 # possible combina1ons to be examined. Therefore, it would be inapplicable even to small datasets, due to the exponen1al growth in compu1ng 1me. In terms of computa1onal complexity, clustering problems actually belong to the class of difficult (NP - hard) problems whenever � ≥ 3 . Affinity measures Clustering models are usually based on a measure of similarity between observa1ons. In many instances this can be obtained by defining an appropriate no1on of distance between each pair of observa1ons. Given a dataset f , we can represent the g observa1ons by means of H - dimensional vectors of aCributes, so as to formally represent the data using a matrix C with g rows and H columns. Let be the symmetric g × g matrix f distances between pairs of observa1ons, obtained by seung: ; %G = ;��� ( 6 % , 6 G ) = ;��� ( 6 G , 6 % ) , � , � ∈ ℳ where ;��� ( 6 % , 6 G ) denotes the distance between observa1ons 6 % and 6 G . No1ce that it is possible to transform the distance ; %G between two observa1ons into a similarity measure � %G , by alterna1vely using: � %G = # # 7 ( "5 or � %G = ( &() * ( "5 ( &() , i n which ; &)* = max $,3 ; $3 denotes the maximum distance between the observa1ons in the dataset f . The defini1on of an appropriate no1on of distance depends on the nature of the aCributes that make up the dataset D, which may be numerical, binary, nominal categorical, ordinal categorical or of mixed composi1on Numerical a>ributes If all H aCributes in a dataset are numerical, we may turn to the Euclidean distance between the vectors associated with the pair of observa1ons < $ = ( 6 $1 , 6 $2 , … , 6 $' ) and < 3 = ( 6 3 1 , 6 3 2 , … , 6 3' ) in H - dimensional space, which is defined as: ;��� ( 6 % , 6 G ) = � � @ 6 %! − 6 G! A 3 M ! 2 # = � ( 6 % # − 6 G # ) 3 + ( 6 % 3 − 6 G 3 ) 3 + ⋯ + ( 6 %' − 6 G' ) 3 As an alterna1ve we may consider the ManhaQan distance : ;��� ( 6 % , 6 G ) = ∑ 8 6 %! − 6 G! 8 = | 6 % # − 6 G # | ' ! 2 # + | 6 % 3 − 6 G 3 | + ⋯ + | 6 %' − 6 G' | which is so called because in order to reach one point from another we have to travel along two sides of a rectangle having the two points as its opposite ver1ces. The figure below shows the difference between Euclidean and ManhaCan metrics, using a two - dimensional example. A third op1on, which generalizes both the Euclidean and ManhaCan metrics, is the Minkowski distance , defined as: ;��� ( 6 % , 6 G ) = J � ∑ 8 6 %! − 6 G! 8 5 ' ! 2 # for some posi1ve integer J . The Minkowski distance reduces to the ManhaCan distance when J =1 , and to the Euclidean distance when J =2 . A further generaliza1on of the Euclidean distance can be obtained through the Mahalanobis distance , defined as: ;��� ( 6 % , 6 G ) = � ( 6 % − 6 G ) W %G * # ( 6 % − 6 G ) . , where W %G * # is the inverse of the covariance matrix of the pair of observa1ons < $ and < 3 . If the observa1ons < $ and < 3 are independent, so that the c