Engineering Probability Class 25 Mon 2019-04-15
Table of contents
1 Iclicker
All these questions concern math SAT scores, which we assume have a mean of 500 and standard deviation of 100.
- What is the probability that one particular score is between 400 and 600?
- .34
- .68
- .96
- .98
- .9974
- I take a random sample of 4 students, and compute the mean of their 4 scores.
What is the probability that that mean is between 400 and 600?
- .34
- .68
- .96
- .98
- .9974
- I take a random sample of 9 students, and compute the mean of their 9 scores.
What is the probability that that mean is between 400 and 600?
- .34
- .68
- .96
- .98
- .9974
- What is the standard deviation of the 4-student sample?
- 25
- 33
- 50
- 100
- 200
- What is the standard deviation of the 9-student sample?
- 25
- 33
- 50
- 100
- 200
2 Statistics
Now, we learn statistics. That means, determining parameters of a population by sampling it. In probability, we already know that parameters, and calculate things from them.
We'll start with Leon-Garcia Chapter 8, and add stuff to it.
This course module fits with RPI's goal of a data dexterity requirement for undergrads. Pres Jackson mentioned this at the spring town meet; see https://president.rpi.edu/speeches/2019/remarks-spring-town-meeting .
Disclosure: Prof Dave Mendonca and I are co-chairs of the SoE Data Dexterity Task Force, working out details of this.
2.1 Hypothesis testing, from text (plus extras)
- Say we want to test whether the average height of an RPI student (called the population) is 2m.
- We assume that the distribution is Gaussian (normal) and that the standard deviation of heights is, say, 0.2m.
- However we don't know the mean.
- We do an experiment and measure the heights of n=100 random students. Their mean height is, say, 1.9m.
- The question on the table is, is the population mean 2m?
- This is different from the earlier question that we analyzed, which was this: What is the most likely population mean? (Answer: 1.9m.)
- Now we have a hypothesis (that the population mean is 2m) that we're testing.
- The standard way that this is handled is as follows.
- Define a null hypothesis, called H0, that the population mean is 2m.
- Define an alternate hypothesis, called HA, that the population mean is not 2m.
- Note that we observed our sample mean to be $0.5 \sigma$ below the population mean, if H0 is true.
- Each time we rerun the experiment (measure 100 students) we'll observe a different number.
- We compute the probability that, if H0 is true, our sample mean would be this far from 2m.
- Depending on what our underlying model of students is, we might use a 1-tail or a 2-tail probability.
- Perhaps we think that the population mean might be less than 2m but it's not going to be more. Then a 1-tail distribution makes sense.
- That is, our assumptions affect the results.
- The probability is Q(5), which is very small.
- Therefore we reject H0 and accept HA.
- We make a type-1 error if we reject H0 and it was really true. See http://en.wikipedia.org/wiki/Type_I_and_type_II_errors
- We make a type-2 error if we accept H0 and it was really false.
- These two errors trade off: by reducing the probability of one we increase the probability of the other, for a given sample size.
- E.g. in a criminal trial we prefer that a guilty person go free to having an innocent person convicted.
- Rejecting H0 says nothing about what the population mean really is, just that it's not likely 2m.
- Enrichment. Random sampling is hard. The US government got it wrong here: http://politics.slashdot.org/story/11/05/13/2249256/Algorithm-Glitch-Voids-Outcome-of-US-Green-Card-Lottery
- The above tests, called z-tests, assumed that we know the population variance.
- If we don't know the population variance, we can estimate it by sampling.
- We can combine estimating the population variance with testing the hypothesis into one test, called the t-test.
2.2 Dr Nic's videos
Understanding the Central Limit Theorem https://www.youtube.com/watch?v=_YOr_yYPytM
Variation and Sampling Error https://www.youtube.com/watch?v=y3A0lUkpAko
Understanding Statistical Inference https://www.youtube.com/watch?v=tFRXsngz4UQ
Understanding Hypothesis testing, p-value, t-test - Statistics Help https://www.youtube.com/watch?v=0zZYBALbZgg
2.3 Research By Design videos
10-1 Guinness, Student, and the History of t Tests https://www.youtube.com/watch?v=bqfcFCjaE1c