# Engineering Probability Class 25 Mon 2019-04-15

Table of contents

## 1 Iclicker

All these questions concern math SAT scores, which we assume have a mean of 500 and standard deviation of 100.

- What is the probability that one particular score is between 400 and 600?
- .34
- .68
- .96
- .98
- .9974

- I take a random sample of 4 students, and compute the mean of their 4 scores.
What is the probability that that mean is between 400 and 600?
- .34
- .68
- .96
- .98
- .9974

- I take a random sample of 9 students, and compute the mean of their 9 scores.
What is the probability that that mean is between 400 and 600?
- .34
- .68
- .96
- .98
- .9974

- What is the standard deviation of the 4-student sample?
- 25
- 33
- 50
- 100
- 200

- What is the standard deviation of the 9-student sample?
- 25
- 33
- 50
- 100
- 200

## 2 Statistics

Now, we learn **statistics**. That means, determining parameters of a population by sampling it. In **probability**, we already know that parameters, and calculate things from them.

We'll start with Leon-Garcia Chapter 8, and add stuff to it.

This course module fits with RPI's goal of a data dexterity requirement for undergrads. Pres Jackson mentioned this at the spring town meet; see https://president.rpi.edu/speeches/2019/remarks-spring-town-meeting .

Disclosure: Prof Dave Mendonca and I are co-chairs of the SoE Data Dexterity Task Force, working out details of this.

### 2.1 Hypothesis testing, from text (plus extras)

- Say we want to test whether the average height of an RPI student (called the population) is 2m.
- We assume that the distribution is Gaussian (normal) and that the standard deviation of heights is, say, 0.2m.
- However we don't know the mean.
- We do an experiment and measure the heights of n=100 random students. Their mean height is, say, 1.9m.
- The question on the table is, is the population mean 2m?
- This is different from the earlier question that we analyzed, which was this: What is the most likely population mean? (Answer: 1.9m.)
- Now we have a hypothesis (that the population mean is 2m) that we're testing.
- The standard way that this is handled is as follows.
- Define a null hypothesis, called H0, that the population mean is 2m.
- Define an alternate hypothesis, called HA, that the population mean is not 2m.
- Note that we observed our sample mean to be $0.5 \sigma$ below the population mean, if H0 is true.
- Each time we rerun the experiment (measure 100 students) we'll observe a different number.
- We compute the probability that, if H0 is true, our sample mean would be this far from 2m.
- Depending on what our underlying model of students is, we might use a 1-tail or a 2-tail probability.
- Perhaps we think that the population mean might be less than 2m but it's not going to be more. Then a 1-tail distribution makes sense.
- That is, our assumptions affect the results.
- The probability is Q(5), which is very small.
- Therefore we reject H0 and accept HA.
- We make a type-1 error if we reject H0 and it was really true. See http://en.wikipedia.org/wiki/Type_I_and_type_II_errors
- We make a type-2 error if we accept H0 and it was really false.
- These two errors trade off: by reducing the probability of one we increase the probability of the other, for a given sample size.
- E.g. in a criminal trial we prefer that a guilty person go free to having an innocent person convicted.
- Rejecting H0 says nothing about what the population mean really is, just that it's not likely 2m.
- Enrichment. Random sampling is hard. The US government got it wrong here: http://politics.slashdot.org/story/11/05/13/2249256/Algorithm-Glitch-Voids-Outcome-of-US-Green-Card-Lottery
- The above tests, called
**z-tests**, assumed that we know the population variance. - If we don't know the population variance, we can estimate it by sampling.
- We can combine estimating the population variance with testing the hypothesis into one test, called the
**t-test**.

### 2.2 Dr Nic's videos

Understanding the Central Limit Theorem https://www.youtube.com/watch?v=_YOr_yYPytM

Variation and Sampling Error https://www.youtube.com/watch?v=y3A0lUkpAko

Understanding Statistical Inference https://www.youtube.com/watch?v=tFRXsngz4UQ

Understanding Hypothesis testing, p-value, t-test - Statistics Help https://www.youtube.com/watch?v=0zZYBALbZgg

### 2.3 Research By Design videos

10-1 Guinness, Student, and the History of t Tests https://www.youtube.com/watch?v=bqfcFCjaE1c