Engineering Probability Class 25 Mon 2019-04-15

1   Iclicker

All these questions concern math SAT scores, which we assume have a mean of 500 and standard deviation of 100.

  1. What is the probability that one particular score is between 400 and 600?
    1. .34
    2. .68
    3. .96
    4. .98
    5. .9974
  2. I take a random sample of 4 students, and compute the mean of their 4 scores. What is the probability that that mean is between 400 and 600?
    1. .34
    2. .68
    3. .96
    4. .98
    5. .9974
  3. I take a random sample of 9 students, and compute the mean of their 9 scores. What is the probability that that mean is between 400 and 600?
    1. .34
    2. .68
    3. .96
    4. .98
    5. .9974
  4. What is the standard deviation of the 4-student sample?
    1. 25
    2. 33
    3. 50
    4. 100
    5. 200
  5. What is the standard deviation of the 9-student sample?
    1. 25
    2. 33
    3. 50
    4. 100
    5. 200

2   Statistics

Now, we learn statistics. That means, determining parameters of a population by sampling it. In probability, we already know that parameters, and calculate things from them.

We'll start with Leon-Garcia Chapter 8, and add stuff to it.

This course module fits with RPI's goal of a data dexterity requirement for undergrads. Pres Jackson mentioned this at the spring town meet; see https://president.rpi.edu/speeches/2019/remarks-spring-town-meeting .

Disclosure: Prof Dave Mendonca and I are co-chairs of the SoE Data Dexterity Task Force, working out details of this.

2.1   Hypothesis testing, from text (plus extras)

  1. Say we want to test whether the average height of an RPI student (called the population) is 2m.
  2. We assume that the distribution is Gaussian (normal) and that the standard deviation of heights is, say, 0.2m.
  3. However we don't know the mean.
  4. We do an experiment and measure the heights of n=100 random students. Their mean height is, say, 1.9m.
  5. The question on the table is, is the population mean 2m?
  6. This is different from the earlier question that we analyzed, which was this: What is the most likely population mean? (Answer: 1.9m.)
  7. Now we have a hypothesis (that the population mean is 2m) that we're testing.
  8. The standard way that this is handled is as follows.
  9. Define a null hypothesis, called H0, that the population mean is 2m.
  10. Define an alternate hypothesis, called HA, that the population mean is not 2m.
  11. Note that we observed our sample mean to be $0.5 \sigma$ below the population mean, if H0 is true.
  12. Each time we rerun the experiment (measure 100 students) we'll observe a different number.
  13. We compute the probability that, if H0 is true, our sample mean would be this far from 2m.
  14. Depending on what our underlying model of students is, we might use a 1-tail or a 2-tail probability.
  15. Perhaps we think that the population mean might be less than 2m but it's not going to be more. Then a 1-tail distribution makes sense.
  16. That is, our assumptions affect the results.
  17. The probability is Q(5), which is very small.
  18. Therefore we reject H0 and accept HA.
  19. We make a type-1 error if we reject H0 and it was really true. See http://en.wikipedia.org/wiki/Type_I_and_type_II_errors
  20. We make a type-2 error if we accept H0 and it was really false.
  21. These two errors trade off: by reducing the probability of one we increase the probability of the other, for a given sample size.
  22. E.g. in a criminal trial we prefer that a guilty person go free to having an innocent person convicted.
  23. Rejecting H0 says nothing about what the population mean really is, just that it's not likely 2m.
  24. Enrichment. Random sampling is hard. The US government got it wrong here: http://politics.slashdot.org/story/11/05/13/2249256/Algorithm-Glitch-Voids-Outcome-of-US-Green-Card-Lottery
  25. The above tests, called z-tests, assumed that we know the population variance.
  26. If we don't know the population variance, we can estimate it by sampling.
  27. We can combine estimating the population variance with testing the hypothesis into one test, called the t-test.

2.2   Dr Nic's videos

Understanding the Central Limit Theorem https://www.youtube.com/watch?v=_YOr_yYPytM

Variation and Sampling Error https://www.youtube.com/watch?v=y3A0lUkpAko

Understanding Statistical Inference https://www.youtube.com/watch?v=tFRXsngz4UQ

Understanding Hypothesis testing, p-value, t-test - Statistics Help https://www.youtube.com/watch?v=0zZYBALbZgg

2.3   Research By Design videos

10-1 Guinness, Student, and the History of t Tests https://www.youtube.com/watch?v=bqfcFCjaE1c