logo
  • userLoginStatus

Welcome

Our website is made possible by displaying online advertisements to our visitors.
Please disable your ad blocker to continue.

Current View

Management Engineering - Quality Data Analysis

Completed notes of the course

Complete course

1 Random variables • RANDOM VARIABLE: a variable characterized by a single (different) numerical value associated to each outcome of an experiment (or a measurement) • ➔ random variables are stochastic variables described by a statistic distribution o Random variables can be of two different types: ▪ Continuous (ex. electric power, length, pressure, temperature, weight) ▪ Discrete (ex. number of scratches on a surface, number of nonconforming parts in a sample) • PROPERT IES: given x as a random variable: o R is the domain of X → P (X ϵ R) = 1 o The probability that x belongs to any subset of R is 0 ≤ P (X ϵ E) ≤ 1 for each E ⡎ R o If E ₁, E₂, E₃, … En are mutual exclusive then P (X ϵ (E ₁ ⢍ E₂ ⢍ E₃ … ⢍ Ek) = P (X ϵ E₁) + P (X ϵ E₂) + P (X ϵ E₃) + … P(X ϵ Ek) Descriptive statistic Numerical summary of data • ➔ given a sample of observations x ₁, x ₂, x ₃, … xn with X as a random variable • SAMPLE MEAN ➔ ⥟⚲= ◎ ⷶ⻟ ⻤⻟⹃⸸ⷬ • SAMPLE VARIANCE ➔ ⥚ⵁ= ⸋⻟⹃⸸⻤ (ⷶ⻟ⵊⷶ⚲)⸹ ⷬⵊⵀ • SAMPLE STANDARD DEVIATION ➔ ⥚= √⸋⻟⹃⸸⻤ (ⷶ⻟ⵊⷶ⚲)⸹ ⷬⵊⵀ • MEDIAN (only for continuous probability functions) ➔ P(X ≤ m) = P(X ≥ m) = ½ • QUARTILES: correspond to 3 points (Q ₁, median, Q ₃) that divide the dataset in 4 equal groups (each group contains a quarter of the data) Other summaries of data • MOVING AVERAGE: is another method of batching data, but instead of considering separate batches that do not overlap I can consider the moving average of windows of size b → I have j batches of size b and for each of them I consider the moving average as ⥟⚲ⷨ= ⿘ ⷶ(⻠⹂⸸)⹁⻟ ⻘ ⻟⹃⸸ⷠ o Example: I have 1000 observations that I want to batch in samples of 10 data each batch → with this method I get 999 batches of 10 data each o In the new dataset I will get 999 values that correspond to the sample means of the overlappi ng batches o ➔ moving average doesn’t eliminate autocorrelation at all but can give me informations on the dataset Graphical representation of data • HISTOGRAM: values of the variable and frequency of them • ➔ when we look at an histogram we loose informations about time dimension we only see the values but not the patt ern • On we have the frequency → how many times the measure corresponds to the number of X • ➔ the right number of bins should be approximately equal to the square root of the number of observations 2 • TIME SERIES PLOT: moment in which the value has been measured and values • ➔ we need both variability in amplitude and time → the first step in the analysis of process data must be always the construction of a time -series plot (Alwan) • BOXPLOT: median and quartiles of the distribution Probability distributions • SAMPLE: a sample is a collection of measurements selected from some larger source or population → statistical methods allow us to study a sample and to draw conclusions about their source (the whole population) • ➔ sample must be random, in order to represent well the population → the population is the process (quality tests are not done on the whole population but just on a sample of it, in order to see if the population is good or not, but they have to be enough to represent the whole) • PROBABILITY DISTRIBUTION: a probability distribution is a mathematical mode l that relates the value of the variable with the probability of occurrence of the value in the population • ➔ we have two types of distributions Continuous distribution • CONTINUOUS DISTRIBUTION: variable expressed on a continuous scale • ➔ it’s characterised by two probability functions • UPPER WHISKER: E xtends to the maximum data point within 1.5 box heights from the top of the box • LOWER WHISKER: Extends to the minimum data point within 1.5 box heights from the bottom of the box o ➔ OUTLIERS: values are considered outliers when xi > q3 +1.5*(Q3 -Q1) or xi < Q 1 -1.5 * (Q3 -Q1) → outliers are “seen” in this way by the programme, sometimes they are not even outliers 3 Probability density function Cumulative distribution function Parameters • ➔ if X is a random variable w ith probability density f(x): • MEAN ➔ ⧯= ⤲(⥅)= ⟷ ⥟⥍ (⥟)⥋⥟ ⵉⷁ ⵊⷁ • VARIANCE ➔ ⧵ⵁ= ⥃(⥅)= ⟷ (⥟− ⧯)ⵁ⥍(⥟)⥋⥟ ⵉⷁ ⵊⷁ • STANDARD DEVIATION ➔ ⧵= ⾰⥃(⥅) Discrete distribution • DISCRETE DISTRIBUTION: variable can take only certain values • ➔ it’s characterised by two probability functions •➔ if X is a continuous variable then for every x ₁ 0) if its probability density function is ⥍(⥟)= ⵀ ◉ⵁ⸢⸤ ⥌ ⹂(⻮⹂⼖)⸹ ⸹⼜⸹ ⥍⥖⥙ − ∞ < ⥟< ∞ o PARAMETERS: ▪ ⤲(⥟)= ⧯ ▪ ⥃(⥟)= ⧵ⵁ •PROBABILITY MASS FUNCTION: for a discrete variable X with possible values x1, x2, … xn the probability mass function is ➔ ⥍(⥟ⷧ )= ⤽(⥅ = ⥟ⷧ) •CUMULATIVE DISTRIBUTION FUNCTION: the cumulative distribution function of a discrete random variable X is ⤳(⥟)= ⤽(⥅ ≤ ⥟) = ◎ ⥟⥐ ≤ ⥟ ⥍(⥟⥐) ⷶ⻟ⷃⷶ 5 • CHEBYSHEV INEQUALITY: the probability that the distance between a certain value and the mean is higher than a multiple of the standard deviation depends only on the multiple → ⤽(␌⥟− ⧯␌≥ ⥒⧵ )≤ ╾␋⥒ⵁ • STANDARD NORMAL VARIABLE (Z): a normal random variable ha s specific characteristics o PARAMETERS: ▪ ⧯ⷞ= ╽ ▪ ⧵ⷞⵁ= ╾ o RELATION OF PROBABILITY: suppose X is a normal random variable with mean μ and variance σ 2 we have → ⤽(⥅ ≤ ⥟)= ⤽(ⷜⵊ⸞ ⸤ ≤ ⷶⵊ⸞ ⸤ )= ⤽(⥇≤ ⥡)= ␻(⥡) with: Z = standard normal variable ⥡= ⷶⵊ⸞ ⸤ = z-value obtained by standardi zing x Properties of the normal distribution • COMBINATION OF NORMAL VARIABLES: o If ⥅ is normally distributed with mean ⧯ and variance ⧵╿, then the variable ⥆ = ⥈⥅ + ⥉, for any real numbers ⥈ and ⥉, is also normally distributed, with : ▪ ⧯ⷝ= ⥈⧯ + ⥉ ▪ ⧵ⷝⵁ= ⥈ⵁ⧵ⵁ o If ⥅1 and ⥅2 are two independent normal random variables, with means ⧯ⵀ⏬⧯ⵁ and variances ⧵ⵀⵁ⏬⧵ⵁⵁ , then their sum ⥆= ⥅╾ + ⥅╿ will also be normally distributed with : ▪ ⧯ⷝ= ⧯ⵀ+ ⧯ⵁ ▪ ⧵ⷝⵁ= ⧵ⵀⵁ+ ⧵ⵁⵁ • ➔ any linear combination of independent normal variables is a normal variable • INDEPENDENT VARIABLES: t he random variables ⥟╾⏬⥟╿⏬⏰ ⏬⥟⥐ ⏬⏰ ⏬⥟⥕ are independent if: ⤽(⥟╾ ⟛ ⤲╾⏬⥟╿ ⟛ ⤲╿⏬⏰ ⏬⥟⥕ ⟛ ⤲⥕ ) = ⤽(⥟╾ ⟛ ⤲╾) ⤽(⥟╿ ⟛ ⤲╿)⏰ ⤽(⥟⥕ ⟛ ⤲⥕ ) for any sets ⤲1, ⤲2, …, ⤲n → the realization of one does not affect the probability distribution of the othe r o If two variables are independent their covariance is zer o (independence → covariance = 0) o ➔ if variables are normally distributed the condition of 0 covariance is sufficient and necessary for independence: ⪚⪟⪕⪖⪡⪖⪟⪕⪖⪟⪔⪖ ➩ ⪔⪠⪧⪒⪣⪚⪒⪟⪔⪖ = ⳤ • CENTRAL LIMIT THEOREM: If ⥟╾⏬⥟╿⏬⏰ ⏬⥟⥐ ⏬⏰ ⏬⥟⥕ are independent random variables with mean ⧯⥐ and variance ⧵⥐ 2 , and If ⥠ = ⥟╾ + ⥟╿ + ⏰ + ⥟⥐ + ⢹ + ⥟⥕ then the distribution of ⷷⵊ◎ ⸞⻟ ⻤⻟⹃⸸ √◎ ⸤⻟⸹) ⻤⻟⹃⸸ approa che s a standard normal distribution N(0,1) as ⥕ approaches infinity → ⷷⵊ◎ ⸞⻟ ⻤⻟⹃⸸ √◎ ⸤⻟⸹) ⻤⻟⹃⸸ ┼ ⤻(╽⏬╾) ⥍⥖⥙ ⥕❧ ◆ o IMPLICATIONS: t he sum (or average) of a large number n of independently distributed random variables is approximately normal, regardless of the distribution of the individual variables → the sampling distribution of ⥅̅ is approximately normal (for a large enough n), regardless of the distribution of ⥅ Inference and introduction to hypothesis testing Statistical inference • ➔ we want to infer proper ties of the source population by analysing data that are sampled from that distribution → we study a sample to get conclusions on the entire population where we got the sample from • POINT ESTIMATORS: a point estimate of some population parameter ⧫ is a sin gle numerical value ⧫̂of a statistic ␮̂ → the point estimator (⧫) is the machine that generates a number (⧫̂) that should describe the statistic parameter (⧑̂) Standardization 6 o ➔ the point estimator ␮̂ is an UNBIASED ESTIMATOR of the parameter ⧫ if the expected value of the estimator equals the parameter (⤲(␮̂)= ⧫) o ➔ If the estimator is not unbiased, then the difference ⤲(␮̂)− ⧫ is called bias of the estimator ␮̂ o DEMONSTRATIONS: given a sample of ⥕ independent and identically distributed observations ▪ The sample mean is unbiased (⤲(⥅̅)= ⧯) → ⤲(⥅̅)= ⤲ (ⵀ ⷬ◎ ⥟ⷧ ) ⷬⷧⵋⵀ = ⵀ ⷬ ◎ ⤲(⥟ⷧ)= ⵀ ⷬ⥕⧯ = ⧯ ⷬⷧⵋⵀ ▪ The sample variance is unbiased (⤲(⥀ⵁ)= ⧵ⵁ) → ⤲(⥀ⵁ)= ⤲( ⵀ ⷬⵊⵀ◎ (⥟ⷧ− ⥟⚲)ⵁ)= ⵀ ⷬⵊⵀ⤲(◎ (⥟ⷧⵁ+ ⷬⷧⵋⵀ ⷬⷧⵋⵀ ⥟⚲ⵁ− ╿⥟ⷧ⥟⚲))= ⵀ ⷬⵊⵀ ⤲(◎ ⥟ⷧⵁ+ ⥕⥟⚲ ⷬⷧⵋⵀ ⵁ− ╿⥕⥟⚲ⵁ)= ⵀ ⷬⵊⵀ[◎ ⤲(⥟ⷧⵁ)− ⥕⤲ (⥟⚲ⵁ)] ⷬⷧⵋⵀ → since: V(⊥⵲)= E ((⊥⵲− ⑕)ⵁ)= ⑛ⵁ= ⤲(⥟ⷧⵁ)− ⧯ⵁ❧ ⤲(⥟ⷧⵁ)= ⧯ⵁ+ ⧵ⵁ→ ⥃(⥟⚲)= ⸤⸹ ⷬ = ⤲(⥟⚲ⵁ)− ⧯ⵁ❧ ⤲(⥟⚲ⵁ)= ⧯ⵁ+ ⸤⸹ ⷬ ➔ ⤲(⥚ⵁ)= ⵀ ⷬⵊⵀ[⥕(⧯ⵁ+ ⧵ⵁ)− ⥕(⧯ⵁ+ ⸤⸹ ⷬ)]= ⵀ ⷬⵊⵀ[(⥕− ╾)⧵ⵁ]= ⧵ⵁ • PROPERTIES OF POINT STATISTIC: ⤲(⥅̅)= ⧯ ⥃(⥅̅)= ⧵ⵁ ⥕ ⤲(⥀ⵁ)= ⧵ⵁ ⧵ⷜ̅= ⧵ ◉⥕ Hypothesis testing • ➔ a hypothesis testing approach is used when we want to draw conclusions about a parameter we don’t know → we assume a value for the parameter and verify if the hypothesis is true or not General procedure • From the problem context identify the parameter of intere st: we have to understand what type of parameter we want to analyse of the population • Compute the statistical hypotheses : we have to assume the values of the parameter of interest o STATISTICAL HYPOTHESIS: is a statement either about the parameters of a probability distribution or the parameters of a model → there are two types of hypothesis in a test: ▪ Null hypothesis H0 : our initial hypothesis on the parameter (the one we have to verify if it’s statistically acceptable) ▪ Alternative hypothesis H1 : comprehends all the other possible values of the parameter we are testing • The real parameter could be >, < or ≠ than our initial hypothesis • Choose a significant level of α : alpha is the probability of committing a 1 st type error and it has to be chosen because we can never be 100% sure that the hypothesis we made is true or fal se, so we set the probability of committing a first type error in order to be sure enough that what we are stating corresponds to reality o TYPES OF ERROR: Reject H0 Accept H0 H0 is true 1st type error: ⧤ Correct: ⤽ = ╾− ⧤ H0 is false Correct: ⤽ = ╾− ⧥ 2nd type error: ⧥ o 1st TYPE ERROR: ⧤= ⤽(⥙⥌⥑⥌⥊⥛ ⤵╽␌⤵╽= ⥛⥙⥜⥌ ) → rejection region in the test o 2nd TYPE ERROR: ⧥ = ⤽(⥍⥈⥐⥓ ⥛⥖ ⥙⥌⥑⥌⥊⥛ ⤵╽␌⤵╽= ⥍⥈⥓⥚⥌ ) → acceptance region in the test • State an appropriate test statistic : in order to see if H0 is true we have to compute a test statistic: o TEST STATISTIC: is a quantity that you can compute if the null hypothesis is true → under the null hypothesis we have a certain distribution from which we can compute the test statistic on the parameter • State the rejection region for the statistic : we have to specify the set of values for the tests statistic that that leads to the rejection of H0 (confidence interval is the opposite of the rejection region) Unknown parameter θ Statistic ⲱ̂ Point estimate ⫍̂ ⧯ ⥅̅= ◎⥅ⷧ ⥕ ⥟⚲ ⧵ⵁ ⥀ⵁ= ◎(⥅ⷧ− ⥅̅)ⵁ ⥕− ╾ ⥚ⵁ ⧯ⵀ− ⧯ⵁ ⥅ⵀ̅̅̅− ⥅ⵁ̅̅̅= ◎⥅ⵀⷧ ⥕ⵀ − ◎⥅ⵁⷧ ⥕ⵁ ⥟ⵀ̅̅̅− ⥟ⵁ̅̅̅ ⥗ⵀ− ⥗ⵁ ⤽ⵀ̂ − ⤽ⵁ̂ = ⥅ⵀ ⥕ⵀ− ⥅ⵁ ⥕ⵁ ⥗ⵀ̂ − ⥗ⵁ̂ 7 • ➔ once I have defined the distribution under the null hypothesis I can compute, given the level α, the rejection region accordingly to the probability given by α • Compute any necessary sample quantities, substitute these into the equation for the test statis tic and compute th at value • Decide whether or not H0 should be rejected and report the problem context o P-VALUE: the p -value is the smallest level of significance that would lead to rejection of the null hypothesis H0 → it is the probability that the test s tatistic will take on a vale that is at least as extreme as the observed value of the statistic when the null hypothesis is true ▪ If ⥗− ⥝⥈⥓⥜⥌ < ⧤ → reject H0 ▪ If ⥗− ⥝⥈⥓⥜⥌ > ⧤ → accept H0 • ➔ POWER OF THE TEST: the power of the test is the probability of rejecting the null hypothesis when the alternative hypothesis is true: ⥗⥖⥞⥌⥙ = ╾− ⧥ = ⤽(⥙⥌⥑⥌⥊⥛ ⤵╽␌⤵╽= ⥍⥈⥓⥚⥌ ) o ➔ β depends strongly on H1 • EXAMPLE: hypothesis testing on population mean → w e want to design a procedure that, based on a finite sample, allows drawing conclusions about the mean of the source population distribution o ➔ in the example we have: ▪ H0: μ = 50 cm/s ▪ H1: μ ≠ 50 cm/s ▪ ⥅ ┼ ⤻ (⧯⏬╿⏯▂ⵁ) and n= 10 o Compute the tests statistic and the rejection region with ⧤= ▂⏬▃▄▂△ Types of tests One sample test Two sample tests Test for mean (known variance) One sample z -test Test for mean difference (known variance) Two sample z -test Test for mean (unknown variance) One sample t -test Test for mean difference (unknown variance) Two sample t -test Test for variance Chi -squared test (variance) Test for mean of paired data (unknown variance) Paired t -test Test for equality of variances F-test (variances) 8 Tests One sample tests Tests for the mean One sample z-test known variance • ASSUMPTIONS: o ⥅╾⏬⥅╿⏬⏰ ⏬⥅⥕ is a random sample of size ⥕ from a population ▪ No dependence in the observation (no autocorrelation) o Population is normal (or central limit theorem applies) o The variance of the population is known • Z-TEST: under these assumptions we have the z test: ⥇= ⷜ̅ⵊ⸞ ⸤␋◉ⷬ ┼⤻(╽⏬╾) → Z is the standardised distribution of X • PARAMETERS OF THE TEST: for test on the mean, variance known o Null hypothesis → ⤵╽⏮ ⧯ = ⧯▖ o Test statistic → ⥇ⴿ= ⷜ̅ⵊ⸞⸷ ⸤◉ⷬ▯ o Confidence interval → ⥅̅− ⥇⼋⸹ ⸤ ◉ⷬ≤ ⧯≤ ⥅̅+ ⥇⼋⸹ ⸤ ◉ⷬ o Rejection region according to the alternative hypothesis: o Probability of 2 nd type error → ⧥ = ␻ (⥡⼋⸹− ⸖◉ⷬ ⸤ )− ␻ (−⥡⼋⸹− ⸖◉ⷬ ⸤ ) with: ⧯ⵀ= ⧯ⴿ+ ⧧ → the bigger is δ the smaller is β One sample t -test unknown variance • ASSUMPTIONS: o ⥅╾⏬⥅╿⏬⏰ ⏬⥅⥕ is a random sample of size ⥕ from a populatio n ▪ No dependence in the observation (no autocorrelation) o Population is normal (or central limit theorem applies) o The variance of the population is unknown • T-TEST: under those assumption we have the t test: ⥁= ⷜ̅ⵊ⸞ ⷗◉ⷬ▯ ┼ ⥛ⷬⵊⵀ with: o ⥛ⷬⵊⵀ= student -t distribution with n -1 degrees of freedom o S = sample standard deviation = √⸋⻟⹃⸸⻤ (ⷶ⻟ⵊⷶ⚲)⸹ ⷬⵊⵀ • PARAMETERS OF THE TEST: o Null hypothesis → ⤵╽⏮ ⧯ = ⧯▖ Alternative hypothesis Rejection region P-value ⤵╾⏮⧯≠ ⧯ⴿ ⥡ⴿ< −⥡⼋⸹ or ⥡ⴿ> ⥡⼋⸹ ╿[╾− ␻(␌⥡ⴿ␌)] ⤵╾⏮⧯> ⧯ⴿ ⥡ⴿ> ⥡⸓ ╾− ␻(⥡ⴿ) ⤵╾⏮ ⧯< ⧯ⴿ ⥡ⴿ< −⥡⸓ ␻(⥡ⴿ) 9 o Test statistic → ⥁ⴿ= (ⷜ̅ⵊ⸞⸷) ⷗◉ⷬ▯ o Confidence interval → ⥅̅− ⥛⼋⸹⏬ⷬⵊⵀ ⷗ ◉ⷬ≤ ⧯≤ ⥅̅+ ⥛⼋⸹⏬ⷬⵊⵀ ⷗ ◉ⷬ o Rejection region according to the alternative hypothesis : Tests for the variance Chi -squared test for the variance mean unknown • ASSUMPTIONS: o ⥅╾⏬⥅╿⏬⏰ ⏬⥅⥕ is a random sample of size ⥕ from a population ▪ No dependence in the observation (no autocorrelation) o Population is normal (or central limit theorem applies) o The mean of the population is unknown • CHI -SQUARED TEST: under those assumptions we have the chi -squared test: ⥅ⵁ= (ⷬⵊⵀ)⷗⸹ ⸤⸹ ┼⧺ⷬⵊⵀ ⵁ with: o ⧺ⷬⵊⵀ ⵁ = chi -squared distribution w ith n -1 degrees of freedom → ⥍(⥟)= ⵀ ⵁ⻤⹂⸸⸹▯ⶆ(ⷬⵊⵀⵁ▯)⥟(ⷬⵊⵀⵁ)▯ ⵊⵀ⥌ⵊ(⻮⸹) ⥟> ╽ o S = sample standard deviation = √⸋⻟⹃⸸⻤ (ⷶ⻟ⵊⷶ⚲)⸹ ⷬⵊⵀ • PARAMETERS OF THE TEST: o Null hypothesis → ⤵╽⏮ ⧵ⵁ = ⧵ⴿⵁ o Test statistic → ⧺ⴿⵁ= (ⷬⵊⵀ)⷗⸹ ⸤⸷⸹ o Confidence interval → (ⷬⵊⵀ)⷗⸹ ⸩⼋⸹⏬⻤⹂⸸ ⸹ ≤ ⧵ⵁ≤ (ⷬⵊⵀ)⷗⸹ ⸩⸸⹂ ⼋⸹⏬⻤⹂⸸ ⸹ → the chi -squared distribution is not symmetr ic o Rejection region according to the alternative hypothesis: Two sample tests Tests for mean difference Two sample z -test known variance • ASSUMPTIONS: o ⥅╾╾ ⏬⥅╾╿ ⏬⏰ ⏬⥅╾⥕1 is a random sample of size ⥕1 from population 1 Al ternative hypothesis Rejection region P-value ⤵╾⏮⧯≠ ⧯ⴿ ⥛ⴿ< −⥛⼋⸹⏬ⷬⵊⵀ or ⥛ⴿ> ⥛⼋⸹⏬ⷬⵊⵀ ╿[╾− ␻(␌⥛ⴿ␌)] ⤵╾⏮⧯> ⧯ⴿ ⥛ⴿ> ⥛⸓⏬ⷬⵊⵀ ╾− ␻(⥛ⴿ) ⤵╾⏮ ⧯< ⧯ⴿ ⥛ⴿ< −⥛⸓⏬ⷬⵊⵀ ␻(⥛ⴿ) Alternative hypothesis Rejection region ⤵╾⏮⧵ⵁ≠ ⧵ⴿⵁ ⧺ⴿⵁ< ⧺ⵀⵊ⼋⸹⏬ⷬⵊⵀ ⵁ or ⧺ⴿⵁ> ⧺⼋⸹⏬ⷬⵊⵀ ⵁ ⤵╾⏮⧵ⵁ> ⧵ⴿⵁ ⧺ⴿⵁ> ⧺⸓⏬ⷬⵊⵀ ⵁ ⤵╾⏮ ⧵ⵁ< ⧵ⴿⵁ ⧺ⴿⵁ< ⧺ⵀⵊ⸓⏬ⷬⵊⵀ ⵁ 10 o ⥅╿╾ ⏬⥅╿╿ ⏬⏰ ⏬⥅╿⥕╿ is a random sample of size ⥕2 from population 2 o The two populations are independent o Both population s are normal (or central limit theorem applies) o The variances of the populations are known • TWO SAMPLE Z -TEST: under those assumptions we have the two sample z -test: ⥇= ⷜ̅⸸ⵊⷜ̅⸹ⵊ(⸞⸸ⵊ⸞⸹) √⼜⸸⸹ ⻤⸸ⵉ⼜⸹⸹ ⻤⸹ ┼⤻(╽⏬╾) • PARAMETERS OF THE TEST: o Null hypothesis → ⤵╽⏮ ⧯ⵀ− ⧯ⵁ= ␪ⴿ o Test statistic → ⥇ⴿ= ⷜ̅⸸ⵊⷜ̅⸹ⵊⶇ⸷ √⼜⸸⸹ ⻤⸸ⵉ⼜⸹⸹ ⻤⸹ o Confidence interval → ⥅̅ⵀ− ⥅̅ⵁ− ⥇⼋⸹√⸤⸸⸹ ⷬ⸸+ ⸤⸹⸹ ⷬ⸹≤ ⧯ⵀ− ⧯ⵁ≤ ⥅̅ⵀ− ⥅̅ⵁ+ ⥇⼋⸹√⸤⸸⸹ ⷬ⸸+ ⸤⸹⸹ ⷬ⸹ o Rejection region according to the alternative hypothesis: Two sample t -test unknown variances In case of variances unknown and equal ( ⧵ⵀⵁ= ⧵ⵁⵁ= ⧵ⵁ) • ASSUMPTIONS: o ⥅╾╾ ⏬⥅╾╿ ⏬⏰ ⏬⥅╾⥕1 is a random sample of size ⥕1 from population 1 o ⥅╿╾ ⏬⥅╿╿ ⏬⏰ ⏬⥅╿⥕╿ is a random sample of size ⥕2 from population 2 o The two populations are independent o Both populations are normal (or central limit theorem applies) o The variances of the populations are unknown and equal • TWO SAMPLE T-TEST: under those assumptions we have the two sample t -test: ⥁= ⷜ̅⸸ⵊⷜ̅⸹ⵊ(⸞⸸ⵊ⸞⸹) ⷗⻦√⸸⻤⸸ⵉ⸸⻤⸹ ┼⥛ⷬ⸸ⵉⷬ⸹ⵊⵁ with: o ⥀ⷮⵁ= (ⷬ⸸ⵊⵀ)⷗⸸⸹ⵉ(ⷬ⸹ⵊⵀ)⷗⸹⸹ ⷬ⸸ⵉⷬ⸹ⵊⵁ = pooled variance • PARAMETERS OF THE TEST: o Null hypot hesis → ⤵╽⏮ ⧯ⵀ− ⧯ⵁ= ␪ⴿ o Test statistic → ⥁ⴿ= ⷜ̅⸸ⵊⷜ̅⸹ⵊⶇ⸷ ⷗⻦√⸸⻤⸸ⵉ⸸⻤⸹ o Rejection region according to the alternative hypothesis: In case of variances unknown and not equal ( ⧵ⵀⵁ≠ ⧵ⵁⵁ) • ASSUMPTIONS: Alternative hypothesis Rejection region P-value ⤵╾⏮⧯ⵀ− ⧯ⵁ≠ ␪ⴿ ⥡ⴿ< −⥡⼋⸹ or ⥡ⴿ> ⥡⼋⸹ ╿[╾− ␻(␌⥡ⴿ␌)] ⤵╾⏮⧯ⵀ− ⧯ⵁ> ⧍ⴿ ⥡ⴿ> ⥡⸓ ╾− ␻(⥡ⴿ) ⤵╾⏮ ⧯ⵀ− ⧯ⵁ< ⧍ⴿ ⥡ⴿ< −⥡⸓ ␻(⥡ⴿ) Alternative hypothesis Rejection region ⤵╾⏮⧯ⵀ− ⧯ⵁ≠ ␪ⴿ ⥛ⴿ< −⥛⼋⸹⏬ⷬ⸸ⵉⷬ⸹ⵊⵁ or ⥛ⴿ> ⥛⼋⸹⏬ⷬ⸸ⵉⷬ⸹ⵊⵁ ⤵╾⏮⧯ⵀ− ⧯ⵁ> ⧍ⴿ ⥛ⴿ> ⥛⸓⏬ⷬ⸸ⵉⷬ⸹ⵊⵁ ⤵╾⏮ ⧯ⵀ− ⧯ⵁ< ⧍ⴿ ⥛ⴿ< −⥛⸓⏬ⷬ⸸ⵉⷬ⸹ⵊⵁ 11 o ⥅╾╾ ⏬⥅╾╿ ⏬⏰ ⏬⥅╾⥕1 is a random sample of size ⥕1 from population 1 o ⥅╿╾ ⏬⥅╿╿ ⏬⏰ ⏬⥅╿⥕╿ is a random sample of size ⥕2 from popu lation 2 o The two populations are independent o Both populations are normal (or central limit theorem applies) o The variances of the populations are unknown and not equal • SAMPLE T -TEST: under those assumptions we have the two sample t -test: ⥁= ⷜ̅⸸ⵊⷜ̅⸹ⵊ(⸞⸸ⵊ⸞⸹) √⻏⸸⸹ ⻤⸸ⵉ⻏⸹⸹ ⻤⸹ ┼⥛ⷴ with: o ⥝= (√⻏⸸⸹ ⻤⸸ⵉ⻏⸹⸹ ⻤⸹) ⸹ (⻏⸸⸹⻤⸸⁄ )⸹ ⻤⸸⹂⸸ ⵉ(⻏⸹⸹⻤⸹⁄ )⸹ ⻤⸹⹂⸸ = degr ees of freedom • PARAMETERS OF THE TEST : o Null hypothesis → ⤵╽⏮ ⧯ⵀ− ⧯ⵁ= ␪ⴿ o Test statistic → ⥁ⴿ= ⷜ̅⸸ⵊⷜ̅⸹ⵊ(⸞⸸ⵊ⸞⸹) √⻏⸸⸹ ⻤⸸ⵉ⻏⸹⸹ ⻤⸹ o Rejection region according to the alternative hypothesis: Test for mean of paired data Paired t -test unknown variance • ASSUMPTIONS: o ⥅╾╾ ⏬⥅╾╿ ⏬⏰ ⏬⥅╾⥕1 is a random sample of size ⥕1 from population 1 o ⥅╿╾ ⏬⥅╿╿ ⏬⏰ ⏬⥅╿⥕╿ is a random sample of size ⥕2 from population 2 o The differences between pairs ⤱ⷨ= ⥅ⵀ⏬ⷨ− ⥅ⵁ⏬ⷨ are normal (or central limit theorem applies) o The variance of the differences between pairs is unknown • PAIRED T -TEST: under those assumptions we have the paired t -test: ⥁= ⷈ̅ⵊ⸞⻀ ⷗⻀ ◉ⷬ▯ ┼⥛ⷬⵊⵀ with: o ⤱ⷨ= ⥅ⵀ⏬ⷨ− ⥅ⵁ⏬ⷨ ┼⤻(⧯ⷈ⏬⧵ⷈⵁ) ❧ we are assuming that the differences to be random normal variables • PARAMETERS OF THE TEST : o Null hypothesis → ⤵╽⏮ ⧯ⷈ= ␪ⴿ o Test statistic → ⥁ⴿ= ⷈ̅ⵊⶇ⸷ ⷗⻀ ◉ⷬ▯ o Confidence interval → ⤱̅− ⷲ⼋⸹⏬⻤⹂⸸⷗⻀ ◉ⷬ ≤ ⧯ⷈ≤ ⤱̅+ ⷲ⼋⸹⏬⻤⹂⸸⷗⻀ ◉ⷬ → ⥀ⷈ is lower than the pooled standard deviation: paired test is more precise o Rejection region according to the alternative hypothesis: Alternative hypothesis Rejection region ⤵╾⏮⧯ⵀ− ⧯ⵁ≠ ␪ⴿ ⥛ⴿ< −⥛⼋⸹⏬ⷴ or ⥛ⴿ> ⥛⼋⸹⏬ⷴ ⤵╾⏮⧯ⵀ− ⧯ⵁ> ⧍ⴿ ⥛ⴿ> ⥛⸓⏬ⷴ ⤵╾⏮ ⧯ⵀ− ⧯ⵁ< ⧍ⴿ ⥛ⴿ< −⥛⸓⏬ⷴ Alternative hypothesis Rejection region P-value ⤵╾⏮⧯ⷈ≠ ␪ⴿ ⥛ⴿ< −⥛⼋⸹⏬ⷬⵊⵀ or ⥛ⴿ> ⥛⼋⸹⏬ⷬⵊⵀ ╿[╾− ␻(␌⥛ⴿ␌)] ⤵╾⏮⧯ⷈ> ⧍ⴿ ⥛ⴿ> ⥛⸓⏬ⷬⵊⵀ ╾− ␻(⥛ⴿ) ⤵╾⏮ ⧯ⷈ< ⧍ⴿ ⥛ⴿ< −⥛⸓⏬ⷬⵊⵀ ␻(⥛ⴿ) 12 Test for equality of variances F-test • FISCHER DISTRIBUTION: assuming W and Y to be independent chi -squared random variables with u and v degrees of freedom (⥄ ┼␼ⷳⵁ⏭⥆┼␼ⷴⵁ) then the ratio ⤳ = ⷛ ⷳ▯ ⷝⷴ▯ follows an F distribution with u and v degrees of freedom → ⤳ = ⷛ ⷳ▯ ⷝⷴ▯ ┼⤳ⷳ⏬ⷴ o The F distribution has two different degrees of freedom given by the degrees of the chi squared distributions o OBSERVATION: ▪ We know from the chi -squared test for the variance that ⥅ⵁ= (ⷬⵊⵀ)⷗⸹ ⸤⸹ ┼⧺ⷬⵊⵀ ⵁ ▪ Since this quantity follows a chi -squared distribution → if we have a ratio between two variances we are looking at something that follows an F distribution ▪ If we want to know the difference between two variances we can look a t the ratio of them: if it’s one they are equal • ASSUMPTIONS: o ⥅╾╾ ⏬⥅╾╿ ⏬⏰ ⏬⥅╾⥕1 is a random sample of size ⥕1 from population 1 o ⥅╿╾ ⏬⥅╿╿ ⏬⏰ ⏬⥅╿⥕╿ is a random sample of size ⥕2 from population 2 o The two populations are normal (or central limit theorem applies) o The two populations are independent • F-TEST: under those assumptions we have the F-test: ⤳ = ⷗⸸⸹⸤⸸⸹▯ ⷗⸹⸹⸤⸹⸹⁄ ┼ ⤳(ⷬ⸸ⵊⵀ)⏬(ⷬ⸹ⵊⵀ) • PARAMETERS OF THE TEST: o Null hypothesis → ⤵╽⏮ ⧵ⵀⵁ= ⧵ⵁⵁ o Test statistic → ⤳ⴿ= ⷗⸸⸹ ⷗⸹⸹ o Confidence interval → ⷗⸸⸹ ⷗⸹⸹⥍ⵀⵊ⼋⸹⏬ⷬ⸹ⵊⵀ⏬ⷬ⸸ⵊⵀ≤ ⸤⸸⸹ ⸤⸹⸹≤ ⷗⸸⸹ ⷗⸹⸹⥍⼋⸹⏬ⷬ⸹ⵊⵀ⏬ⷬ⸸ⵊⵀ (the degrees of freedom are inverted as we can see from the picture → when plotting on Minitab the degrees of freedom have to be inverted ) o Rejection region according to the alternative hypoth esis: Alternative hypothesis Rejection region ⤵╾⏮⧵ⵀⵁ≠ ⧵ⵁⵁ ⥍ⴿ< ⥍ⵀⵊ⼋⸹⏬ⷬ⸸ⵊⵀ⏬ⷬ⸹ⵊⵀor ⥍ⴿ> ⥍⼋⸹⏬ⷬ⸸ⵊⵀ⏬ⷬ⸹ⵊⵀ ⤵╾⏮⧵ⵀⵁ> ⧵ⵁⵁ ⥍ⴿ> ⥍⸓⏬ⷬ⸸ⵊⵀ⏬ⷬ⸹ⵊⵀ ⤵╾⏮ ⧵ⵀⵁ< ⧵ⵁⵁ ⥍ⴿ< ⥍ⵀⵊ⸓⏬ⷬ⸸ⵊⵀ⏬ⷬ⸹ⵊⵀ 13 Introduction Overlook • NEW TRENDS IN DATA: o Data are becoming more and more important o Shifting from quality of the final product to quality of the process → see errors as soon as the y appear instead of waiting till the product is finished o Passing from black box of AI to a grey one → need to understand what AI does with data to avoid using wrong methods • CHARACTERISTICS OF DATA: the introduction of big data o Volume : vast amount of data generated every second → data sets to large to store and analyse with traditional database technology o ➔ with big data technology we can store and use data sets with the help of distributed systems where part of the data is stored in different locations and brought together by software o Velocity : speed at which data are generated and moved around o ➔ big data technology allows us to analyse data while it is being gen erated without waiting to put theme in databases o Variety : we can now analyse different types of data and not only structured ones o ➔ with big data technology we can use and store both structured and unstructured data o Veracity : messiness or trustworthiness of data → with different types of data collected we are not able to control properly quality and accuracy o ➔ volume of biga data makes up for the lack of those characteristics o Value : data are so important that they have value for companies and can become a core value of them Quality of design and conformance • COSTUMER SATISFACTION : the lack of it originates from gaps between the company and the costumer → costumer expectations ≠ interpretations of those made by the company • QUALITY OF DESIGN: is the quality which the supplier is intending to offer to the client • ➔ designers should take in consideration the costumer’s requirements o SPECIFICATIONS: targets and tolerances determined by the designer of a product ▪ Targets: ideal values for which the product is expected to strive ▪ Tolerances: acceptable deviations from the targets in which we can still consider the product as good (they a re intervals that include the target value) • QUALITY OF CONFORMANCE: level of quality of product produced and delivered • ➔ when the quality of a product entirely conforms to the specifications (design) the quality of conformance is excellent • Example: production line of pins → quality in production described by the diameter of pins Data model: ⤻(⧯⏬⧵ⵁ) Probability of violating LSL or USL (having a defective pin) 14 o Probability of violating LSL: ⧦ⷐ= ⤽(⤱ ≤ ⤹⥀⤹ ␌⤱┼⤻(⧯⏬⧵ⵁ))= ⤽(ⷈⵊ⸞ ⸤ ≤ ⷐ⷗ⷐ ⵊ⸞ ⸤ ␌ ⤱┼⤻(⧯⏬⧵ⵁ))= ⤽(⥇≤ ⷐ⷗ⷐ ⵊ⸞ ⸤ |⥇┼⤻(╽⏬╾))= ␻ (ⷐ⷗ⷐ ⵊ⸞ ⸤ ) with: ␻ = cumulative distribution function of Z o Probability of violating USL: ⧦ⷙ= ╾− ⤽(⤱ ≤ ⥂⥀⤹ |⤱┼⤻(⧯⏬⧵ⵁ))= ╾− ⧟ (ⷙ⷗ⷐ ⵊ⸞ ⸤ ) • ➔ total probability of having a defective pin : ⧦= ╾− ⧟ (ⷙ⷗ⷐ ⵊ⸞ ⸤ )+ ⧟ (ⷐ⷗ⷐ ⵊ⸞ ⸤ ) • Six sigma quality performance: process characterised by a very low non -conforming rate • OBSERVATIONS: o To reduce the non -conforming rate the mean has to be on the target value (process is tuned) o To reduce the number of non -conforming pieces we have to make the band larger → the design of the product must be perfect o We have a trade off between the costs of improving the process to reduce waste and the cost of producing defective pieces Measurement system • PROCESS: collec tion of activities to achieve some results (creating added value for the costumer) • ➔ there are two different ways to describe the output of a process: o Output as a function of input and controllable parameters: ⤼ = ⥍(⥐⏬⥝)+ ⧨ with : ⧨┼⤻(⧯⸗⏬⧵⸗ⵁ) = noise ▪ Used for: • Design of experiments → decide the right set of experiments and analyse results to check the effect of controllable parameters to the quality output • Process optimization → select the “best” setting of the controllable parameters to optimize the quality data o Output as a function of time with ⥐⏬⥝ set at the target value: ⤼ = ⥍(⥛⏬⥜⥕⥌⥟⥗⥌⥊⥛⥌⥋ ⥌⥝⥌⥕⥛ )+ ⧨ o ➔ the output is fixed since we are not changing any variable in the process but it can vary in time ▪ Used fo r: • Statistical quality monitoring: (process monitoring) or statistical process control (SPC) that checks the stability of the system’s output over time • MEASUREMENT SYSTEM PERFORMANCES: there are some indicators for the performances: o TRUENESS: difference b etween the average of repeated measurements and the true value of the measured feature (reference quantity value) o ➔ trueness has inverse proportionality relation with systematic error or bias (≠ random measurement error) → high trueness = low systematic er ror o PRECISION: : ability of the measurement system to replicate the reading of the same item o ➔ precision has inverse proportionality relation with the standard deviation → high precision = low standard deviation (dispersion) o ACCURACY: difference between the measured quantity value and the real value of the measurand o REPEATABILITY: dispersion of measurements of the same measurand, when measurements are acquired in the same conditions o REPRODUCIBILITY: dispersion when one or more of the u sage conditions are changing Features of output 15 • ➔ before performing analysis is important to check the measurement system Data modelling Main assumptions • REFERENCE MODEL: we are looking at the mean of different samples collected over time → model of data at tim e instant t: ⥆ⷲ┼⤻(⧯ⷲ⏬⧵ⷲⵁ) • ASSUMPTIONS: the assumptions under which we have ⥆ⷲ┼⤻(⧯ⷲ⏬⧵ⷲⵁ) are: o Independence → data must be independent and identically distributed o Normally distributed data → we need data distributed following a normal (many di fferent causes of variability, none of them dominant) o Constant variance → each sample is supposed to have the same ⧵ⵁ o Constant mean → ⧯ⷲ must be constant over time o Absence of bias → the measurement system has to work properly ( ⧨ⷲ ┼⤻(╽⏬⧵⸗ⵁ)) o ➔ if the variance is unstable together with the mean we can’t perform analysis o ➔ if the mean is unstable but follows a trend we can do predictions on future data • OBSERVATIONS: if we call the output of the process Y and we want to measure it since we know it should be always equal to a mean value, since the variables are not changing then we have ⥆ⷲ= ⧯+ ⧨ⷲ where ⧯ is constant and ⧨ⷲ┼⤻(╽⏬⧵⸗ⵁ) → constant + normal distribution = normal distribution ⥆ⷲ┼⤻(⧯⸕⏬⧵⸕ⵁ) o ⧯⸕= ⤲(⧯+ ⧨ⷲ)= ⤲(⧯)+ ⤲(⧨ⷲ)= ⧯ o ⧵⸕ⵁ= ⥃(⧯+ ⧨ⷲ)= ╽+ ⥃(⧨ⷲ)= ⧵⸗ⵁ o ➔ ⥆ⷲ┼⤻(⧯⏬⧵⸗ⵁ) → best prediction: ⥠̂ⷲ= ⧯ Randomness and correlation • TIME SERIES PLOT: the time series plot gives important informations about data that we would loose by looking at just the histogram → in the histogram we only see the frequency and data are not related to the time they were taken • NON RANDOM PROCESSES: we can have many different types of non -random processes that can happen if: o The mean of the process is not constant : o There is a systematic pattern and the process mean is not the better prediction for future data : •All distributions are the same •We collect the single measurements and create a model based on those •➔ we have distributions because every time we collect a sample of measurement of the same measurand → data should be on the mean and not on the tails TREND: the mean moves constantly → regression with time LEVEL SHIFT: there is a neat jump in the value of the data → regression with dummy STATIONARY MEANDERING: overall mean stays the same but if an observation is above μ the next will be too → ARMA model NONSTATIONARY MEANDERING: the mean moves and if an observation is above μ the next will be too → ARIMA model with differencing operator 16 o ➔ the overall mean can be stable but the data are still not random o ➔ typical pattern of autocorrelation o Dispersion around mean value is not constant • AUTOCORRELATION: we have autocorrelation when the value we measure at t influences the value at t+1→ ⥠ⷲⵉⵀ= ⥍(⥠ⷲ) o Positive autocorrelation → if a measure ment is over the mean the next will be over the mean o Negative autocorrelation → if a measurement is below the mean the next will be over the mean • LAGGING OF ONE VARIABLE: create a second variable such that the observation at time t is close to the observation of the same time series at time t -k (with lag k) • AUTOCOVARIANCE FUNCTION: the Montgomery autocovariance function relates the variances of the variable xt and xt -1: ⧦ⷲ⏬ⷩ= ⤰⥖⥝ (⥅ⷲ⏬⥅ⷲⵊⷩ)= ⤲[(⥅ⷲ− ⧯ⷲ)(⥅ⷲⵊⷩ− ⧯ⷲⵊⷩ)] with: ⥒⟛⤻ = lag o ➔ if ⥅ⷲ is larger than ⧯ the signs are positive and the same is if they are both negative o For stationary processes ( ⧯ⷲ= ⧯) we have: ⧦ⷲ⏬ⷩ= ⧦ⷩ ▪ For ⥒= ╽ → ⧦ⴿ= ⤲[(⥅ⷲ− ⧯)ⵁ]= ⥃(⥅ⷲ)= ⧵ⵁ o SAMPLE AUTOCOVARIANCE FUNCTION: ⥊ⷩ= ⧦̂ⷩ= ⵀ ⷬ◎ (⥟ⷲ− ⥟⚲)(⥟ⷲⵉⷩ− ⥟⚲) ⷬⵊⷩ ⷲⵋⵀ • AUTOCORRELATION FUNCTION: ⧴ⷩ= ⸕⻡ ⸕⸷= ⷇ⷭⷴ (ⷜ⻪⏬ⷜ⻪⹂⻡) ⷚ⷟ⷰ (ⷜ⻪) → −╾≤ ⧴ⷩ≤ ╾ o We have the plot of the function only for ⥒≥ ╽ o If ⧦ⷩ= ⧦ⵊⷩ → ⧴ⷩ= ⧴ⵊⷩ o SAMPLE AUTOCORRELATION FUNCTION: ⥙ⷩ= ⧴̂ⷩ= ⸕̂⻡ ⸕⸷̂ = ◎ (ⷶ⻪ⵊⷶ⚲)(ⷶ⻪⹁⻡ⵊⷶ⚲) ⻤⹂⻡⻪⹃⸸◎ (ⷶ⻪ⵊⷶ⚲)⸹ ⻤⻪⹃⸸ with: ⥒≤ ⥕␋▁ • OBSERVATIONS: o Correlation and independence ▪ If ⥟ⷲ and ⥟ⷲⵊⷩ are independent → they are uncorrelated ( ⧴ⷩ= ╽ ⥍⥖⥙ ⥈⥕⥠ ⥒) OSCILLATING: one data above and the other below the mean → negative autocorrelation INCREASING VARIATIO N: the dispersion around the mean increases over tie 17 • If they are uncorrelated they could still be dependent → correlation is a measure of linear dependence but there are other types of dependence ▪ If ⥟ⷲ and ⥟ⷲⵊⷩ are correlated (⧴ⷩ≠ ╽ ⥍⥖⥙ ⥚⥖⥔⥌ ⥒) → they are dependent o Correlation and causality: correlation doesn’t necessary mean that there’s a relationship cause -effect between the variables Tests for the assump tions • We can perform different types of tests to decide if data are appropriate or must be corrected Assumptions Hypothesis test Remedy in case of violation Independence (random pattern) • Runs test • Bartlett’s test • LBQ’s test • Gapping • Batching • Linear regression • Time series (ARIMA) Normal distribution Normality test Transform data Independence tests Runs test for independence • OBJECTIVE: checks if the pattern is not random • It’s a non -parametric test → no specific assumption of distribution requ ired • STEPS: o The test classifies the data as lying above (+) and below ( -) the sample mean o Counts the number of runs (⥆= ⥕⥜⥔⥉⥌⥙ ⥖⥍ ⥙⥜⥕⥚ ) → if the number of runs is close to an extreme situation the pattern is not random ▪ RUN: sequence of successive and equal symbols that precedes a different symbol (example: +++ --- ++ -- = 4 runs → we could have 2 extreme situations that are + - + - + - + - + = 8 runs or +++++ ----- = 2 runs ) o Hypothesis testing: ▪ H0: the process is random so the number of runs is random → ⥆ ┼⤻(⤲(⥆)⏬⥃(⥆)) with: • ⤲(⥆)= ⵁⷫ(ⷬⵊⷫ) ⷬ + ╾ with n = number of data and m = number of + • ⥃(⥆)= √ⵁⷫ(ⷬⵊⷫ)[ⵁⷫ(ⷬⵊⷫ)ⵊⷬ] ⷬ⸹(⻤⹂⸸) ▪ H1: the process is not random and the number of runs is probably not under the normal distribution o Set the level of alpha and see if the hypothesis are to be rejected or not → if the number of runs is random or not Bartlett’s test for autocorrelation • OBJ ECTIVE: check if there’s autocorrelation in the data • CONCEPT: for a random process (iid) we have that the sample autocorrelation function is distributed as a normal distribution with mean 0 and variance 1/n: ⥙ⷩ┼ ⤻ (╽⏬ⵀ ⷬ) ⟕⥒ • STEPS: o Test statistic (sampl e autocorrelation function): ⥙ⷩ= ⧴̂ⷩ= ⸕̂⻡ ⸕⸷̂ = ◎ (ⷶ⻪ⵊⷶ⚲)(ⷶ⻪⹁⻡ⵊⷶ⚲) ⻤⹂⻡⻪⹃⸸◎ (ⷶ⻪ⵊⷶ⚲)⸹ ⻤⻪⹃⸸ o Set the hypotheses : ▪ H0: ⧴ⷩ= ╽ → there’s no autocorrelation ▪ H1: ⧴ⷩ≠ ╽ → there is a correlation for a certain lag o Calculate the rejection region: ␌⥙ⷩ␌= ⷞ⼋⸹⁄ ◉ⷬ o Set ⧤= ╽⏯╽▂ and see if the value of ⥙ⷩ falls in the rejection region • OBSERVATION: the test can’t be used for different lags at the same time o ➔ when conducting multiple analyses on the same dependent variable, the chance of committing a Type I error increases, thus increasing the likelihood of coming about a significant result by pure chance → Bonferroni correction • BONFERRONI INEQUALITY: we assume that we have N hypothesis tests (i=1,2…N) o Each test has its own probability to reject ⤵╽⥐ when it is tr ue → ⧤ⷧ o The family wise first type error is ⧤✭ 18 o ➔ the probability of rejecting at least one null hypothesis when they are all true is ⧤′≤ ◎ ⧤ⷧ ⷧⵋⵀ⏬⏰ⷒ → Bonferroni inequality o ➔ for independent tests it can eb shown that ╾− ⧤′= ◍ (╾− ⧤ⷧ) ⷧⵋⵀ⏬ⵁ⏰ⷒ ▪ If we set the same ⧤ for all the tests (⧤ⷧ= ⧤ ⟕⥐) → ⧤′= ╾− (╾− ⧤)ⷒ → ⧤= ╾− (╾− ⧤′)ⵀ␋ⷒ o ➔ we can so build intervals to constrain the family error rate: ▪ Chose the nominal family error rate ⧤ⷬⷭⷫ′ ▪ For each N tests to be performed (using the same set of data) choose ⧤ⷧ= ⸓⻤⻥⻣′ ⷒ ⟕⥐= ╾⏰ ⤻ → to have an overall first type error like ⧤′≤ ◎ ⧤ⷧ ⷧⵋⵀ⏰ⷒ = ⧤ⷬⷭⷫ′ • BARTLETT’S TEST FOR MORE LAGS: if we have L different lags (k= 1…L) we set ⧤ⷧ= ⧤ⷬⷭⷫ′ ␋⤹ o Rejection region: ␌⥙ⷩ␌> ⷞ⼋⻤⻥⻣′ ⸹⻈⁄ ◉ⷬ ⟕⥒= ╾⏰ ⤹ LBQ’s test for autocorrelation (ljung box pierce) • OBJECTIVE: check if there’s autocorrelation in the data • CONCEPT : for a random process (iid) we have ⤾ = ⥕(⥕+ ╿)◎ ⷰ⻡⸹ ⷬⵊⷩ ⷐⷩⵋⵀ ┼⧺ⷐⵁ → in genera l we have ⤹≤ ◉⥕ • STEPS: o Set the hypotheses ▪ H0: ⧴ⷧ= ╽ with ⥐= ╾⏰ ⤹ → there’s no autocorrelation ▪ H1: ⟗⥐⟛[╾⏭⤹] ⥛⏯⥊⏯⧴ⷧ≠ ╽ → there is a correlation at least for one lag o Calculate the rejection region: ⤾ > ⧺⸓⏬ⷐⵁ • ➔ link bartlett’s test and LBQ: Normal distribution test • We can use graphical tests to get what type of distribution the data are following: o Histogram → symmetric o Boxplot → to see if the distribution is symmetric o ➔ this is not an assurance of normality, it could be just a symmetric distribution → need quantitative tests • GOODNESS OF FIT TEST: the goodness of fit (GOF) tests measure the agreement of a random sample wi th a theoretical probability distribution function → we can make tests for each type of distribution we think it might be o ➔ the procedure consists of defining a test statistic: a random variable that is calculated from sample data to determine whether to r eject the null hypothesis → the test statistic compares your data with what is expected under the null hypothesis 19 o ANDERSON DARLING FOR NORMALITY: very specific and so more precise Chi squared test for any distribution • OBJECTIVE: determine if the data foll ow a certain distribution → very flexible • CONCEPT: the idea is to compare bin by bin the height of the bin to the one that comes from the model: the more the two heights are similar the more the fitting is well approaching the histogram → this test is appl ied to binned data with k=number of bins ( ⥒= ╾+ ⊙⊜⊔ ⵁ⤻) • STEPS: o Define: ▪ ⤼ⷧ = observed frequency in class i (each class is a bin) ▪ ⤲ⷧ = expected frequency in class i • ➔ probability that our variable Y is in the limits (probability to be lower than the upper limit of the i bin – probability to be lower than the lower limit): ⤲ⷧ= ⤻ ⟦(⤽(⥆≤ ⥆ⷳ⏬⥐)− ⤽(⥆≤ ⥆ⷪ⏬⥐))= ⤻ (⤳(⥆ⷳ⏬⥐)− ⤳(⥆ⷪ⏬⥐)) with: o ⤻ = sample size o ⥆ⷳ⏬ⷧ = upper limit for the i -th class o ⥆ⷪ⏬ⷧ = lower limit for the i -th class o Test statistic: ⧺ⵁ= ◎ (ⷓ⻟ⵊⷉ⻟)⸹ ⷉ⻟ ⷏ⷧⵋⵀ o Set the hypotheses : ▪ H0: ⧺ⵁ┼⧺⷏ⵊⷡ ⵁ with: c = number of estimated parameters → data follow a given F distribution (F can be whatever distribution we want) (we have a normal distribution with c=2) ▪ H1: data do not follow the F distribution o Calculate the rejection region ⧺ⵁ> ⧺⸓⏬⷏ⵊⷡ ⵁ • ➔ graphically: compares binned data (as in histogram) to the line of the distribution Anderson -Darling for normality • ➔ ver y specific and so more precise • OBJECTIVE: i t is a statistical test whether or not a dataset comes from the normal distribution • STEPS: o Test statistic: ⤮ⵁ= ◎ (ⵁⷧⵊⵀ)[ⷪⷬⷊ (ⷶ[⻟])ⵉ⵵⵷(ⵀⵊⷊ(ⷶ[⻤⹁⸸⹂⻟]))] ⻟⹃⸸⏬⻤ ⷬ − ⥕ with: ▪ ⥕ = sample size ▪ ⤳(⥟) = cumulative distribut ion function for the specified distribution ▪ ⥐ = the i -th sample when the data is sorted in ascending order →[i] observations are ordered from smaller to biggest o ➔ for small sample size we have: ⤮ⵁ⟦= ⤮ⵁ(╾+ ⴿ⏯ⵆⵄ ⷬ + ⵁ⏯ⵁⵄ ⷬ⸹) → the test can be distorted and we need a corrective factor o Define hypothesis: ▪ H0: The data follows the normal distribution ▪ H1: The data do not follow the normal distribution • ➔ if we use a software to do this we have 2 outputs : o Quantitative (value for the test statistic and p -value)→ if the p value is very small we can reject the null hypothesis 20 o Qualitative → graphical part: normal probability plot (special case of the probability plot) Remedies in case of non -independence Gapping • GAPPING: reducin g the sampling frequency (subsampling) → if I have autocorrelation between ⥟ⷲ and ⥟(ⷲⵊⷩ) (k=lag) I can subsample my dataset by taking one data each k • OBSERVATION: we risk to lose some informations and not having normality because we are reducing the dataset (central limit theorem violated) Batching • BATCHING: to remove autocorrelation we can divide the dataset in sequential batches that are not overlapping and for each of them consider the sample mean → I have j batches of b observations f or each batch I consider just the value ⥟⚲ⷨ= ⿘ ⷶ(⻠⹂⸸)⻘⹁⻟ ⻘ ⻟⹃⸸ⷠ • OBSERVATIONS: o With this method I can get rid of the non -randomness and the non -normality of the data because we can apply the central limit theorem to the batched dataset o Disadvantage: difficulty to define the appropriate value of b (batch size) → empirical approaches • Empirical approach to determine the batch size: 1. Initialise ⥉= ╾ 2. Compute th e autocorrelation coefficient at the first lag 3. If the coefficient is smaller than 0.1 go to step 5; else step 4 4. Set ⥉= ╿⟦⥉ and go to step 2 5. End • OBSERVATIONS ON BATCHING AND GAPPING : o Both the approaches are applicable to stationary processes (constant mean ) o Both the approaches induce loss of information o These are approaches which do not tackle the autocorrelation issue instead of dealing with it •On x we have my values in the sample, but from smaller to larger •On y we have the estimated cumulative probability → if we compute the ecp on numbers and the numbers come from a normal distribution the ecp will stand on a straight line •➔ the more the observation deviate the more we have a graph information that I don’t have a normal distribution •If data are below the red line → I have too few data in the corresponding tail of the distribution •If data are above the red line → I have too many data in the corresponding part of the distribution •Example: we can have ⥟ⵀ⏬⥟ⵁ⏬⥟ⵂ⏬⥟ⵃ⏬⥟ⵄ⏬⥟ⵅ⏬⥟ⵆ⏬⥟ⵇ⏬⥟ⵈ⏬⥟ⵀⴿ⏬⥟ⵀⵀ⏬⥟ⵀⵁ⏬⏰ ⥟ⷬ and we have high autocorrelation at lag 10 we can take one data out of 10 → the new dataset will be ⥟ⵀⴿ⏬⥟ⵁⴿ⏬⥟ⵂⴿ⏬⏰ ⥟ⷬ • Example: I have 1000 observations that I batch each 10 → I get 100 batches of 10 observations each • In the new dataset I will hav e 100 values that correspond to the sample mean of each batch 21 Remedies in case of non -normality • MIXTURE: data come from two different distribution → it can occur in a change in settings of an operations or collecting from different machines of measurements from different vendors (2 hills in the histogram) o ➔ if data are not normal doesn’t mean they don’t have a distribution → w e need a different model for that phenomenon • CAUSES OF NON -NORMALITY: it can be intrinsically related to the process: o Distortion measurement (eccentricity, roughness) o Electrical phenomena : capacitance, insulation resistance o Small levels of substances in the material: porosity, contaminants; o Other physical properties (ultimate tensile stress , time to failure) o Waiting time o Km/day for a sale representative o Time to repair • SOLUTIONS TO NON -NOR MALITY : o Use the real distribution instead of the normal one o Manage data sampling to deal with sample average instead of dealing with single data → apply the central limit theorem to get a normal distribution o Nonparametric methods (runs test) → use differ ent method that do not require normality: do not rely on distributional assumptions ▪ Observation: they are robust to outliers and can’t detect them → pay attention to outliers since if there is an outlier (see from the graph) it could belong to a different population, which means that we are not under the assumption of identically distributed variables o Transform data : simply pass from a distribution x that you don’t know what is to a function of the distribution that is normal → ⥎(⥅) ┼ ⤻ o ➔ example : ⊔(X)= ⊙⊜⊔ (X) Box -cox • POWER TRANSFORMATION (BOX -COX): ⥎(⥟) is calculated as a power of x for every x ≠ 0 and as → ⥎(⥅)= ⥟⸝ :we just have to find λ and for this we use the box cox plot • ➔ in this example if we use λ ⧤ to remove) the associated regressor should be removed from the model 30 Steps • We can consider a dataset as an example: mistakes in the medication of medical centre → the target is t see if there is an improvement: o Types of errors: ▪ Missing medication ▪ Wrong medication ▪ Incorrect dosage • POSSIBLE REGRESSORS: the regressors could be ⥛⏬⥛ⵁ⏬ⵀ ⷲ → we need to choose the ones to include • STEP 1a FORWARD SELECTION: in this case we need to add the first regressors and we can compute three different models each wit