Engineering Probability Exam 3 - Sat 2019-05-04

W Randolph Franklin (WRF), RPI

2019-05-04 00:00

Name, RPI email:

.



.

Rules:

You have 80 minutes.
You may bring three 2-sided 8.5"x11" papers with notes.
You may bring a calculator.
You may not share material with each other during the exam.
No collaboration or communication (except with the staff) is allowed.
Check that your copy of this test has all nine pages.
Each part of a question is worth 5 points.
When answering a question, don't just state your answer, prove it.
You may write FREE as your answer for two questions, and get the 5 points.

Questions:

These few questions are about the population of adult males, which has a mean of 70 inches and a standard deviation of 4 inches.
1. What is the probability that a particular person's height is between 68 and 74?
```
.











.
```
2. If we take a sample of 100, what is its mean?
```
.











.
```
3. What is its standard deviation?
```
.











.
```
These questions are about tossing 3 fair dice and looking at the 3 numbers that show. However, these dice have only 2 faces (to make this question easier). The faces are labeled 1 and 2.
1. What's the expected value of the number showing on the first die?
```
.











.
```
2. What's the pmf of the smallest die?
```
.











.
```
3. What's the expected value of the smallest die?
```
.











.
```
4. What's the probability that the smallest number is 1 given that the first number is 2?
```
.











.
```
5. What's the probability that the first number is 2 given that the smallest number is 1?
```
.











.
```
6. Are the first number and the smallest number are independent? Prove your answer.
```
.











.
```
7. What is the MAP estimator for the smallest number, given that the first number is 2?
```
.











.
```
This question is about a continuous probability distribution on 2 variables.

$$f_{XY}(x,y) = \begin{cases} c (x+y) & \text{ if } (0\le x) \ \& \ (0\le y)\ \& \ (0\le x+y \le 1) \\ 0 & \text{ otherwise}\end{cases}$$

The nonzero region is the triangle with vertices (0,0), (1,0) and (0,1).

c is some constant, but I didn't tell you what it is.
1. What is c?
```
.











.
```
2. What is $F_{XY}(x,y)$?
```
.











.
```
3. What is $f_X(x)$?
```
.











.
```
4. Are X and Y independent?
```
.











.
```
5. What is $P[X\le Y]$ ?
```
.











.
```
6. What is $E[X]$?
```
.











.
```
7. What is $COV[X,Y]$?
```
.











.
```
8. What is $\rho_{X,Y}$?
```
.











.
```
9. What is $f_Y(y|x)$?
```
.











.
```
10. What is $E[Y|x]$?
```
.











.
```
I'm comparing two types of widgets, red and blue. Assume that the probability of each widget dieing in a small interval dt, given that it was alive at the start, is independent of its age. Assume that the probability of the red widget dieing in the next hour is 0.1%, for the blue, it's 0.01%.
1. Give the pdf for the red widget's lifetime. (You have enough info to do this; there is only one possible probability distribution.)
```
.











.
```
2. If you have 100 red widgets, what's the probability that their mean lifetime is within 10% of the population mean?
```
.











.
```
3. If you start two widgets at the same time, what's the probability that the red widget will last longer?
```
.











.
```

Normal distribution:

x          f(x)      F(x)      Q(x)                x          f(x)      F(x)      Q(x)
-3.0000    0.0044    0.0013    0.9987             0.1000    0.3970    0.5398    0.4602
-2.9000    0.0060    0.0019    0.9981             0.2000    0.3910    0.5793    0.4207
-2.8000    0.0079    0.0026    0.9974             0.3000    0.3814    0.6179    0.3821
-2.7000    0.0104    0.0035    0.9965             0.4000    0.3683    0.6554    0.3446
-2.6000    0.0136    0.0047    0.9953             0.5000    0.3521    0.6915    0.3085
-2.5000    0.0175    0.0062    0.9938             0.6000    0.3332    0.7257    0.2743
-2.4000    0.0224    0.0082    0.9918             0.7000    0.3123    0.7580    0.2420
-2.3000    0.0283    0.0107    0.9893             0.8000    0.2897    0.7881    0.2119
-2.2000    0.0355    0.0139    0.9861             0.9000    0.2661    0.8159    0.1841
-2.1000    0.0440    0.0179    0.9821             1.0000    0.2420    0.8413    0.1587
-2.0000    0.0540    0.0228    0.9772             1.1000    0.2179    0.8643    0.1357
-1.9000    0.0656    0.0287    0.9713             1.2000    0.1942    0.8849    0.1151
-1.8000    0.0790    0.0359    0.9641             1.3000    0.1714    0.9032    0.0968
-1.7000    0.0940    0.0446    0.9554             1.4000    0.1497    0.9192    0.0808
-1.6000    0.1109    0.0548    0.9452             1.5000    0.1295    0.9332    0.0668
-1.5000    0.1295    0.0668    0.9332             1.6000    0.1109    0.9452    0.0548
-1.4000    0.1497    0.0808    0.9192             1.7000    0.0940    0.9554    0.0446
-1.3000    0.1714    0.0968    0.9032             1.8000    0.0790    0.9641    0.0359
-1.2000    0.1942    0.1151    0.8849             1.9000    0.0656    0.9713    0.0287
-1.1000    0.2179    0.1357    0.8643             2.0000    0.0540    0.9772    0.0228
-1.0000    0.2420    0.1587    0.8413             2.1000    0.0440    0.9821    0.0179
-0.9000    0.2661    0.1841    0.8159             2.2000    0.0355    0.9861    0.0139
-0.8000    0.2897    0.2119    0.7881             2.3000    0.0283    0.9893    0.0107
-0.7000    0.3123    0.2420    0.7580             2.4000    0.0224    0.9918    0.0082
-0.6000    0.3332    0.2743    0.7257             2.5000    0.0175    0.9938    0.0062
-0.5000    0.3521    0.3085    0.6915             2.6000    0.0136    0.9953    0.0047
-0.4000    0.3683    0.3446    0.6554             2.7000    0.0104    0.9965    0.0035
-0.3000    0.3814    0.3821    0.6179             2.8000    0.0079    0.9974    0.0026
-0.2000    0.3910    0.4207    0.5793             2.9000    0.0060    0.9981    0.0019
-0.1000    0.3970    0.4602    0.5398             3.0000    0.0044    0.9987    0.0013
      0    0.3989    0.5000    0.5000

End of exam 3, total 100 points.

Engineering Probability Exam 3 Solutions - Sat 2019-05-04

W Randolph Franklin (WRF), RPI

2019-05-04 00:00

Name, RPI email:

WRF solutions

OK to give a formula w/o working it out.

Rules:

You have 80 minutes.
You may bring three 2-sided 8.5"x11" papers with notes.
You may bring a calculator.
You may not share material with each other during the exam.
No collaboration or communication (except with the staff) is allowed.
Check that your copy of this test has all nine pages.
Each part of a question is worth 5 points.
When answering a question, don't just state your answer, prove it.

Questions:

These few questions are about the population of adult males, which has a mean of 70 inches and a standard deviation of 4 inches.
1. What is the probability that a particular person's height is between 68 and 74?
  
  68 is mean - std/2
  
  74 is mean + std
  
  Q(-.5) - Q(1) = .69 - .16 = .53
2. If we take a sample of 100, what is its mean?
  
  100.
3. What is its standard deviation?
  
  4/sqrt(100) = .4
These questions are about tossing 3 fair dice and looking at the 3 numbers that show. However, these dice have only 4 faces (to make this question easier).
1. What's the expected value of the number showing on the first die?
  
  2.5
2. What's the pmf of the smallest die?
  
  Enumeration works. 8 cases: 111, 112, 121, 122, 211, 212, 221, 222
  
  Let X = smallest die.
  
  p(X=1) = 7/8, p(X=2)=1/8
3. What's the expected value of the smallest die?
  
  1/8 * 2 + 7/8 * 1 = 9/8
4. What's the probability that the smallest number is 1 given that the first number is 2?
  
  Enumerate. p= 3/4
5. What's the probability that the first number is 2 given that the smallest number is 1?
  
  3/7
6. Are the first number and the smallest number are independent? Prove your answer.
  
  p(1st is 2) = 1/2. p(smallest is 2) = 1/8. p(1st is 2 and smallest is 2) = 1/4
  
  1/2 * 1/8 ne 1/4. Not independent
7. What is the MAP estimator for the smallest number, given that the first number is 2?
  
  P[smallest is 1|first is 2] = 3/4, so MAP = 1
This question is about a continuous probability distribution on 2 variables.

$$f_{XY}(x,y) = \begin{cases} c (x+y) & \text{ if } (0\le x) \ \& \ (0\le y)\ \& \ (0\le x+y \le 1) \\ 0 & \text{ otherwise}\end{cases}$$

The nonzero region is the triangle with vertices (0,0), (1,0) and (0,1).

c is some constant, but I didn't tell you what it is.
1. What is c?
  
  The integral is 1/3.
  
  c=3
2. What is $F_{XY}(x,y)$?
  
  $$F_{XY}(x,y)=\begin{cases} 0 & \text{ if } x\le0 \cup y\le0 \\ 1 & \text{ if } x\ge 1 \cap y\ge1 \\ 3/2 (x^2y+xy^2) & \text{ if } 0\le x \cap 0\le y \cap x+y\le1 \\ (\int_0^x\int_0^{1-x} + \int_0^{1-y}\int_{1-x}^y + \int_{1-y}^x\int_{1-x}^{1-x_0}) (3(x_0+y_0) dy_0dx_0) & \text{ otherwise}\end{cases}$$
  
  The last case above splits the nonzero integration region into two rectangles and a triangle.
  
  It's also acceptable to draw a figure and say something intelligent w/o being explicit about all the details.
3. What is $f_X(x)$?
  
  $f_X(x)= \int_0^{1-x}f_{XY}(x,y) dy = 3/2 (1-x)^2, 0\le x\le1$
  
  Note that $\int_0^1 f_X(x)=1$, which is correct.
4. Are X and Y independent?
  
  $f_X(x)= 3/2 (1-x)^2$
  
  $f_Y(y)= 3/2 (1-y)^2$
  
  $f_{XY}(x,y)= 3(x+y)\ne f_X(x) f_Y(y)$
  
  no.
5. What is $P[X\le Y]$ ?
  
  Integrate: $ \int_0^1 \int_0^{\min{x,1-x)} f(x0,y0) dy0 dx0 = 1/2$
  
  That's reasonable because X and Y are symmetric.
  
  The box around the above expression is a meaningless unwanted artifact.
6. What is $E[X]$?
  
  1/8
7. What is $COV[X,Y]$?
  
  E[XY] = 1/10. E[Y] = E[X]. COV = 1/10 - 1/8 * 1/8 = .085.
8. What is $\rho_{X,Y}$?
  
  OK to write the formula
9. What is $f_Y(y|x)$?
  
  In the nonzero triangle, $f_Y(y|x)=F_{XY}(x,y)/F_X(x)$
10. What is $E[Y|x]$?
  
  $\int yf_Y(y|x) dy$
I'm comparing two types of widgets, red and blue. Assume that the probability of each widget dieing in a small interval dt, given that it was alive at the start, is independent of its age. Assume that the probability of the red widget dieing in the next hour is 0.1%, for the blue, it's 0.01%.
1. Give the pdf for the red widget's lifetime. (You have enough info to do this; there is only one possible probability distribution.)
  
  exponential.
  
  $f(x) = l e^{-l}$
  
  From the section on reliability, l=001.
2. If you have 100 red widgets, what's the probability that their mean lifetime is within 10% of the mean?
  
  Normal approx works.
  
  Pop variance: $1/l^2$
  
  Sample variance: $1/(100 l^2)$
  
  Sample std: $1/(10 l)$
  
  Pop and sample mean: $1/l$
  
  sample mean w/i 10% of pop mean = w/i one sample std
  
  Prob: Q(-1) - Q(1) = .68
3. If you start two widgets at the same time, what's the probability that the red widget will last longer?
  
  let x be lifetime of a red widget, y blue.
  
  $l_1=0.001, l_2=0.0001$
  
  joint prob: $F(x,y) = l_1 l_2 \exp(-l_1x -l_2y)$
  
  $P[X>Y] = \int_0^\infty \int_0^x F(x,y) dy dx$

Normal distribution:

x          f(x)      F(x)      Q(x)
-3.0000    0.0044    0.0013    0.9987
-2.9000    0.0060    0.0019    0.9981
-2.8000    0.0079    0.0026    0.9974
-2.7000    0.0104    0.0035    0.9965
-2.6000    0.0136    0.0047    0.9953
-2.5000    0.0175    0.0062    0.9938
-2.4000    0.0224    0.0082    0.9918
-2.3000    0.0283    0.0107    0.9893
-2.2000    0.0355    0.0139    0.9861
-2.1000    0.0440    0.0179    0.9821
-2.0000    0.0540    0.0228    0.9772
-1.9000    0.0656    0.0287    0.9713
-1.8000    0.0790    0.0359    0.9641
-1.7000    0.0940    0.0446    0.9554
-1.6000    0.1109    0.0548    0.9452
-1.5000    0.1295    0.0668    0.9332
-1.4000    0.1497    0.0808    0.9192
-1.3000    0.1714    0.0968    0.9032
-1.2000    0.1942    0.1151    0.8849
-1.1000    0.2179    0.1357    0.8643
-1.0000    0.2420    0.1587    0.8413
-0.9000    0.2661    0.1841    0.8159
-0.8000    0.2897    0.2119    0.7881
-0.7000    0.3123    0.2420    0.7580
-0.6000    0.3332    0.2743    0.7257
-0.5000    0.3521    0.3085    0.6915
-0.4000    0.3683    0.3446    0.6554
-0.3000    0.3814    0.3821    0.6179
-0.2000    0.3910    0.4207    0.5793
-0.1000    0.3970    0.4602    0.5398
      0    0.3989    0.5000    0.5000
 0.1000    0.3970    0.5398    0.4602
 0.2000    0.3910    0.5793    0.4207
 0.3000    0.3814    0.6179    0.3821
 0.4000    0.3683    0.6554    0.3446
 0.5000    0.3521    0.6915    0.3085
 0.6000    0.3332    0.7257    0.2743
 0.7000    0.3123    0.7580    0.2420
 0.8000    0.2897    0.7881    0.2119
 0.9000    0.2661    0.8159    0.1841
 1.0000    0.2420    0.8413    0.1587
 1.1000    0.2179    0.8643    0.1357
 1.2000    0.1942    0.8849    0.1151
 1.3000    0.1714    0.9032    0.0968
 1.4000    0.1497    0.9192    0.0808
 1.5000    0.1295    0.9332    0.0668
 1.6000    0.1109    0.9452    0.0548
 1.7000    0.0940    0.9554    0.0446
 1.8000    0.0790    0.9641    0.0359
 1.9000    0.0656    0.9713    0.0287
 2.0000    0.0540    0.9772    0.0228
 2.1000    0.0440    0.9821    0.0179
 2.2000    0.0355    0.9861    0.0139
 2.3000    0.0283    0.9893    0.0107
 2.4000    0.0224    0.9918    0.0082
 2.5000    0.0175    0.9938    0.0062
 2.6000    0.0136    0.9953    0.0047
 2.7000    0.0104    0.9965    0.0035
 2.8000    0.0079    0.9974    0.0026
 2.9000    0.0060    0.9981    0.0019
 3.0000    0.0044    0.9987    0.0013

End of exam 3, total 70 points.

Engineering Probability Class 28 Thu 2019-04-25

W Randolph Franklin (WRF), RPI

2019-04-25 00:00

Table of contents

2 Grade to date

Lingyu computed piazza and iclicker grades.

For iclickers:

They were used in 13 classes.
You got 1 point for each class that you used your iclicker.
Multiply the total by 10/13.

For piazza:

For each of 3 months, 1 point per contribution, up to 2 points.
Then multiply by 10/6.

I computed a percent grade to date.

I uploaded it in column Tot1 to LMS.

It cannot fall, but may rise, because:

You got knowitall points. I haven't yet included them.
Your homework 11 grade is higher than your lowest grade from hw1-10.
You write exam 3 and it helps.

If the class wishes, I can lower the weight of the piazza grade from 10% to 5%, and scale everything else up. Do you wish?

The letter grades will be at least as generous as the syllabus shows. I may lower the cutoffs.

I believe my courses to have higher GPAs than average.

3 Piazza and iclicker

They're a mess to compute.

I use them because I believe them to be pedagogically good.

However what do you, the class, think?

Engineering Probability Class 27 Mon 2019-04-22

W Randolph Franklin (WRF), RPI

2019-04-22 00:00

Table of contents

1 Grades

The 3rd exam will be the same length as the first two: 80 minutes.
I will distribute a guaranteed minimum grade at the end of the semester. If you are satisfied with that, you do not need to write the third exam.

2 Statistics videos

Regression: Crash Course Statistics #32 (12:40) https://www.youtube.com/watch?v=WWqE7YHR4Jc

3 8.4 Confidence intervals, p 430

The earlier videos introduced you to this.

4 Worked out problems

7.14a, p 403.
8.4 p 471.

Normal probability tables were given in class 20.
8.10, p 472.
8.24, p 474.

TABLE 3.1 Discrete random variables is page 115.

TABLE 4.1 Continuous random variables is page 164.

Engineering Probability Class 26 Thu 2019-04-18

W Randolph Franklin (WRF), RPI

2019-04-18 00:00

Table of contents

1 Homework 11
2 Statistics
- 2.1 Statistics videos

1 Homework 11

is online.

Sample book problems.

8.3, p 471.

2 Statistics

Here's a sampling of this large topic. There are many other tests, each for a particular purpose.

2.1 Statistics videos

10-1 Guinness, Student, and the History of t Tests (16:58) https://www.youtube.com/watch?v=bqfcFCjaE1c
12-2 ANOVA – Variance Between and Within (12:51) https://www.youtube.com/watch?v=fK_l63PJ7Og
15-1 Why Non Parametric Statistics? (6.52) https://www.youtube.com/watch?v=xA0QcbNxENs
Regression: Crash Course Statistics #32 (12:40) https://www.youtube.com/watch?v=WWqE7YHR4Jc

Engineering Probability Homework 11 due Thu 2019-04-25

W Randolph Franklin (WRF), RPI

2019-04-18 00:00

All questions are from the text.

Each part of a question is worth 5 points, except for 8.101.

8.1 (a-e) p 471. You decide how to generate the random samples, perhaps with Matlab or Mathematica.
8.2 (a-e).
8.49 (a-b), p 478.
8.101, p 486. 10 points.

Total: 70 points.

Engineering Probability Class 25 Mon 2019-04-15

W Randolph Franklin (WRF), RPI

2019-04-15 00:00

Table of contents

1 Iclicker

All these questions concern math SAT scores, which we assume have a mean of 500 and standard deviation of 100.

What is the probability that one particular score is between 400 and 600?
1. .34
2. .68
3. .96
4. .98
5. .9974
I take a random sample of 4 students, and compute the mean of their 4 scores. What is the probability that that mean is between 400 and 600?
1. .34
2. .68
3. .96
4. .98
5. .9974
I take a random sample of 9 students, and compute the mean of their 9 scores. What is the probability that that mean is between 400 and 600?
1. .34
2. .68
3. .96
4. .98
5. .9974
What is the standard deviation of the 4-student sample?
1. 25
2. 33
3. 50
4. 100
5. 200
What is the standard deviation of the 9-student sample?
1. 25
2. 33
3. 50
4. 100
5. 200

2 Statistics

Now, we learn statistics. That means, determining parameters of a population by sampling it. In probability, we already know that parameters, and calculate things from them.

We'll start with Leon-Garcia Chapter 8, and add stuff to it.

This course module fits with RPI's goal of a data dexterity requirement for undergrads. Pres Jackson mentioned this at the spring town meet; see https://president.rpi.edu/speeches/2019/remarks-spring-town-meeting .

Disclosure: Prof Dave Mendonca and I are co-chairs of the SoE Data Dexterity Task Force, working out details of this.

2.1 Hypothesis testing, from text (plus extras)

Say we want to test whether the average height of an RPI student (called the population) is 2m.
We assume that the distribution is Gaussian (normal) and that the standard deviation of heights is, say, 0.2m.
However we don't know the mean.
We do an experiment and measure the heights of n=100 random students. Their mean height is, say, 1.9m.
The question on the table is, is the population mean 2m?
This is different from the earlier question that we analyzed, which was this: What is the most likely population mean? (Answer: 1.9m.)
Now we have a hypothesis (that the population mean is 2m) that we're testing.
The standard way that this is handled is as follows.
Define a null hypothesis, called H0, that the population mean is 2m.
Define an alternate hypothesis, called HA, that the population mean is not 2m.
Note that we observed our sample mean to be $0.5 \sigma$ below the population mean, if H0 is true.
Each time we rerun the experiment (measure 100 students) we'll observe a different number.
We compute the probability that, if H0 is true, our sample mean would be this far from 2m.
Depending on what our underlying model of students is, we might use a 1-tail or a 2-tail probability.
Perhaps we think that the population mean might be less than 2m but it's not going to be more. Then a 1-tail distribution makes sense.
That is, our assumptions affect the results.
The probability is Q(5), which is very small.
Therefore we reject H0 and accept HA.
We make a type-1 error if we reject H0 and it was really true. See http://en.wikipedia.org/wiki/Type_I_and_type_II_errors
We make a type-2 error if we accept H0 and it was really false.
These two errors trade off: by reducing the probability of one we increase the probability of the other, for a given sample size.
E.g. in a criminal trial we prefer that a guilty person go free to having an innocent person convicted.
Rejecting H0 says nothing about what the population mean really is, just that it's not likely 2m.
Enrichment. Random sampling is hard. The US government got it wrong here: http://politics.slashdot.org/story/11/05/13/2249256/Algorithm-Glitch-Voids-Outcome-of-US-Green-Card-Lottery
The above tests, called z-tests, assumed that we know the population variance.
If we don't know the population variance, we can estimate it by sampling.
We can combine estimating the population variance with testing the hypothesis into one test, called the t-test.

2.2 Dr Nic's videos

Understanding the Central Limit Theorem https://www.youtube.com/watch?v=_YOr_yYPytM

Variation and Sampling Error https://www.youtube.com/watch?v=y3A0lUkpAko

Understanding Statistical Inference https://www.youtube.com/watch?v=tFRXsngz4UQ

Understanding Hypothesis testing, p-value, t-test - Statistics Help https://www.youtube.com/watch?v=0zZYBALbZgg

2.3 Research By Design videos

10-1 Guinness, Student, and the History of t Tests https://www.youtube.com/watch?v=bqfcFCjaE1c

Engineering Probability Class 24 Thu 2019-04-11

W Randolph Franklin (WRF), RPI

2019-04-11 00:00

Table of contents

1 Iclicker questions

2 Material from text

2.1 Chapter 7, p 359, Sums of Random Variables

The long term goal of this section is to summarize information from a large group of random variables. E.g., the mean is one way. We will start with that, and go farther.

The next step is to infer the true mean of a large set of variables from a small sample.

2.2 Sums of random variables ctd

Let Z=X+Y.
$f_Z$ is convolution of $f_X$ and $f_Y$: $$f_Z(z) = (f_X * f_Y)(z)$$ $$f_Z(z) = \int f_X(x) f_Y(z-x) dx$$
Characteristic functions are useful. They are covered in Section 4.7.1 on page 184.

$$\Phi_X(\omega) = E[e^{j\omega X} ]$$
$\Phi_Z = \Phi_X \Phi_Y$.
This extends to the sum of n random variables: if $Z=\sum_i X_i$ then $\Phi_Z (\omega) = \Pi_i \Phi_{X_i} (\omega)$
E.g. Exponential with $\lambda=1$: $\Phi_1(\omega) = 1/(1-j\omega)$ (page 164).
Sum of m exponentials has $\Phi(\omega)= 1/{(1-j\omega)}^m$. That's called an m-Erlang.
Example 2: sum of n iid Bernoullis. Probability generating function is more useful for discrete random variables.
Example 3: sum of n iid Gaussians. $$\Phi_{X_1} = e^{j\mu\omega - \frac{1}{2} \sigma^2 \omega^2}$$ $$\Phi_{Z} = e^{jn\mu\omega - \frac{1}{2}n \sigma^2 \omega^2}$$ I.e., mean and variance sum.
As the number increases, no matter what distribution the initial random variance is (provided that its moments are finite), for the sum $\Phi$ starts looking like a Gaussian.
The mean $M_n$ of n random variables is itself a random variable.
As $n\rightarrow\infty$ $M_n \rightarrow \mu$.
That's a law of large numbers (LLN).
$E[ M_n ] = \mu$. It's an unbiased estimator.
$VAR[ M_n ] = n \sigma ^2$
Weak law of large numbers $$\forall \epsilon >0 \lim_{n\rightarrow\infty} P[ |M_n-\mu| < \epsilon] = 1$$
How fast does it happen? We can use Chebyshev, though that is very conservative.
Strong law of large numbers $$P [ \lim _ {n\rightarrow\infty} M_n = \mu ] =1$$
As $n\rightarrow\infty$, $F_{M_n}$ becomes Gaussian. That's the Central Limit Theorem (CLT).

3 Counterintuitive things in statistics

Statistics has some surprising examples, which would appear to be impossible. Here are some.

Average income can increase faster in a whole country than in any part of the country.
1. Consider a country with two parts: east and west.
2. Each part has 100 people.
3. Each person in the west makes \$100 per year; each person in the east \$200.
4. The total income in the west is \$10K, in the east \$20K, and in the whole country \$30K.
5. The average income in the west is \$100, in the east \$200, and in the whole country \$150.
6. Assume that next year nothing changes except that one westerner moves east and gets an average eastern job, so he now makes \$200 instead of \$100.
7. The west now has 99 people @ \$100; its average income didn't change.
8. The east now has 101 people @ \$200; its average income didn't change.
9. The whole country's income is \$30100 for an average of \$150.50; that went up.
College acceptance rate surprise.
1. Imagine that we have two groups of people: Albanians and Bostonians.
2. They're applying to two programs at the university: Engineering and Humanities.
3. Here are the numbers. The fractions are accepted/applied.
  
  city-major Engin Human Total
  
  Albanians 11/15 2/5 13/20
  
  Bostonians 4/5 7/15 11/20
  
  Total 15/20 9/20 24/40
  
  E.g, 15 Albanians applied to Engin; 11 were accepted.
4. Note that in Engineering, a smaller fraction of Albanian applicants were accepted than Bostonian applicants. (corrected)
5. Ditto in Humanities.
6. However in all, a larger fraction of Albanian applicants were accepted than Bostonian applicants.
I could go on.

city-major	Engin	Human	Total
Albanians	11/15	2/5	13/20
Bostonians	4/5	7/15	11/20
Total	15/20	9/20	24/40

4 Relevant Xkcd comics

4.1 Chapter 8, Statistics

We have a population. (E.g., voters in next election, who will vote Democrat or Republican).
We don't know the population mean. (E.g., fraction of voters who will vote Democrat).
We take several samples (observations). From them we want to estimate the population mean and standard deviation. (Ask 1000 potential voters; 520 say they will vote Democrat. Sample mean is .52)
We want error bounds on our estimates. (.52 plus or minus .04, 95 times out of 100)
Another application: testing whether 2 populations have the same mean. (Is this batch of Guiness as good as the last one?)
Observations cost money, so we want to do as few as possible.
This gets beyond this course, but the biggest problems may be non-math ones. E.g., how do you pick a random likely voter? In the past phone books were used. In a famous 1936 Presidential poll, that biased against poor people, who voted for Roosevelt.
In probability, we know the parameters (e.g., mean and standard deviation) of a distribution and use them to compute the probability of some event.

E.g., if we toss a fair coin 4 times what's the probability of exactly 4 heads? Answer: 1/16.
In statistics we do not know all the parameters, though we usually know that type the distribution is, e.g., normal. (We often know the standard deviation.)
1. We make observations about some members of the distribution, i.e., draw some samples.
2. From them we estimate the unknown parameters.
3. We often also compute a confidence interval on that estimate.
4. E.g., we toss an unknown coin 100 times and see 60 heads. A good estimate for the probability of that coin coming up heads is 0.6.
Some estimators are better than others, though that gets beyond this course.
1. Suppose I want to estimate the average height of an RPI student by measuring the heights of N random students.
2. The mean of the highest and lowest heights of my N students would converge to the population mean as N increased.
3. However the median of my sample would converge faster. Technically, the variance of the sample median is smaller than the variance of the sample hi-lo mean.
4. The mean of my whole sample would converge the fastest. Technically, the variance of the sample mean is smaller than the variance of any other estimator of the population mean. That's why we use it.
5. However perhaps the population's distribution is not normal. Then one of the other estimators might be better. It would be more robust.
(Enrichment) How to tell if the population is normal? We can do various plots of the observations and look. We can compute the probability that the observations would be this uneven if the population were normal.
An estimator may be biased. We have an distribution that is U[0,b] for unknown b. We take a sample. The max of the sample has a mean n/(n+1)b though it converges to b as n increases.
Example 8.2, page 413: One-tailed probability. This is the probability that the mean of our sample is at least so far above the population mean. $$\alpha = P[\overline{X_n}-\mu > c] = Q\left( \frac{c}{\sigma_x / \sqrt{n} } \right)$$ Q is defined on page 169: $$Q(x) = \int_x^ { \infty} \frac{1}{\sqrt{2\pi} } e^{-\frac{x^2}{2} } dx$$
Application: You sample n=100 students' verbal SAT scores, and see $ \overline{X} = 550$. You know that $\sigma=100$. If $\mu = 525$, what is the probability that $\overline{X_n} > 550$ ?

Answer: Q(2.5) = 0.006
This means that if we take 1000 random sample of students, each with 100 students, and measure each sample's mean, then, on average, 6 of those 1000 samples will have a mean over 550.
This is often worded as the probability of the population's mean being under 525 is 0.006, which is different. The problem with saying that is that presumes some probability distribution for the population mean.
The formula also works for the other tail, computing the probability that our sample mean is at least so far below the population mean.
The 2-tail probability is the probability that our sample mean is at least this far away from the sample mean in either direction. It is twice the 1-tail probability.
All this also works when you know the probability and want to know c, the cutoff.

4.2 Hypothesis testing

Say we want to test whether the average height of an RPI student (called the population) is 2m.
We assume that the distribution is Gaussian (normal) and that the standard deviation of heights is, say, 0.2m.
However we don't know the mean.
We do an experiment and measure the heights of n=100 random students. Their mean height is, say, 1.9m.
The question on the table is, is the population mean 2m?
This is different from the earlier question that we analyzed, which was this: What is the most likely population mean? (Answer: 1.9m.)
Now we have a hypothesis (that the population mean is 2m) that we're testing.
The standard way that this is handled is as follows.
Define a null hypothesis, called H0, that the population mean is 2m.
Define an alternate hypothesis, called HA, that the population mean is not 2m.
Note that we observed our sample mean to be $0.5 \sigma$ below the population mean, if H0 is true.
Each time we rerun the experiment (measure 100 students) we'll observe a different number.
We compute the probability that, if H0 is true, our sample mean would be this far from 2m.
Depending on what our underlying model of students is, we might use a 1-tail or a 2-tail probability.
Perhaps we think that the population mean might be less than 2m but it's not going to be more. Then a 1-tail distribution makes sense.
That is, our assumptions affect the results.
The probability is Q(5), which is very small.
Therefore we reject H0 and accept HA.
We make a type-1 error if we reject H0 and it was really true. See http://en.wikipedia.org/wiki/Type_I_and_type_II_errors
We make a type-2 error if we accept H0 and it was really false.
These two errors trade off: by reducing the probability of one we increase the probability of the other, for a given sample size.
E.g. in a criminal trial we prefer that a guilty person go free to having an innocent person convicted.
Rejecting H0 says nothing about what the population mean really is, just that it's not likely 2m.
(Enrichment) Random sampling is hard. The US government got it wrong here:

http://politics.slashdot.org/story/11/05/13/2249256/Algorithm-Glitch-Voids-Outcome-of-US-Green-Card-Lottery

Engineering Probability Homework 10 due Thu 2019-04-18

W Randolph Franklin (WRF), RPI

2019-04-11 00:00

All questions are from the text.

Each part of a question is worth 5 points.

6.68 (c), p 355. Use pmf (i).
6.92 (a-d), p 358.

Total: 25 points.

Engineering Probability Class 23 Mon 2019-04-08

W Randolph Franklin (WRF), RPI

2019-04-08 00:00

Table of contents

Assume that we want to know X but can only see Y, which depends on X.
This is a generalization of our long-running noisy communication channel example. We'll do things a little more precisely now.
Another application would be to estimate tomorrow's price of GOOG (X) given the prices to date (Y).
Sometimes, but not always, we have a prior probability for X.
For the communication channel we do, for GOOG, we don't.
If we do, it's a ''maximum a posteriori estimator''.
If we don't, it's a ''maximum likelihood estimator''. We effectively assume that that prior probability of X is uniform, even though that may not completely make sense.
You toss a fair coin 3 times. X is the number of heads, from 0 to 3. Y is the position of the 1st head. from 0 to 3. If there are no heads, we'll say that the first head's position is 0.

(X,Y) p(X,Y)

(0,0) 1/8

(1,1) 1/8

(1,2) 1/8

(1,3) 1/8

(2,1) 2/8

(2,2) 1/8

(3,1) 1/8

E.g., 1 head can occur 3 ways (out of 8): HTT, THT, TTH. The 1st (and only) head occurs in position 1, one of those ways. p=1/8.
Conditional probabilities:

p(x|y) y=0 y=1 y=2 y=3

x=0 1 0 0 0

x=1 0 1/4 1/2 1

x=2 0 1/2 1/2 0

x=3 0 1/4 0 0

$g_{MAP}(y)$ 0 2 1 or 2 1

$P_{error}(y)$ 0 1/2 1/2 0

p(y) 1/8 1/2 1/4 1/8

The total probability of error is 3/8.
We observe Y and want to guess X from Y. E.g., If we observe $$\small y= \begin{pmatrix}0\\1\\2\\3\end{pmatrix} \text{then } x= \begin{pmatrix}0\\ 2 \text{ most likely} \\ 1, 2 \text{ equally likely} \\ 1 \end{pmatrix}$$
There are different formulae. The above one was the MAP, maximum a posteriori probability.

$$g_{\text{MAP}} (y) = \max_x p_x(x|y) \text{ or } f_x(x|y)$$

That means, the value of $x$ that maximizes $p_x(x|y)$
What if we don't know p(x|y)? If we know p(y|x), we can use Bayes. We might measure p(y|x) experimentally, e.g., by sending many messages over the channel.
Bayes requires p(x). What if we don't know even that? E.g. we don't know the probability of the different possible transmitted messages.
Then use maximum likelihood estimator, ML. $$g_{\text{ML}} (y) = \max_x p_y(y|x) \text{ or } f_y(y|x)$$
There are other estimators for different applications. E.g., regression using least squares might attempt to predict a graduate's QPA from his/her entering SAT scores. At Saratoga in August we might attempt to predict a horse's chance of winning a race from its speed in previous races. Some years ago, an Engineering Assoc Dean would do that each summer.
Historically, IMO, some of the techniques, like least squares and logistic regression, have been used more because they're computationally easy than because they're logically justified.

(X,Y)	p(X,Y)
(0,0)	1/8
(1,1)	1/8
(1,2)	1/8
(1,3)	1/8
(2,1)	2/8
(2,2)	1/8
(3,1)	1/8

p(x\|y)	y=0	y=1	y=2	y=3
x=0	1	0	0	0
x=1	0	1/4	1/2	1
x=2	0	1/2	1/2	0
x=3	0	1/4	0	0

$g_{MAP}(y)$	0	2	1 or 2	1
$P_{error}(y)$	0	1/2	1/2	0
p(y)	1/8	1/2	1/4	1/8

3.2 Chapter 7, p 359, Sums of Random Variables

The long term goal of this section is to summarize information from a large group of random variables. E.g., the mean is one way. We will start with that, and go farther.

The next step is to infer the true mean of a large set of variables from a small sample.

3.3 Sums of random variables ctd

Let Z=X+Y.
$f_Z$ is convolution of $f_X$ and $f_Y$: $$f_Z(z) = (f_X * f_Y)(z)$$ $$f_Z(z) = \int f_X(x) f_Y(z-x) dx$$
Characteristic functions are useful. $$\Phi_X(\omega) = E[e^{j\omega X} ]$$
$\Phi_Z = \Phi_X \Phi_Y$.
This extends to the sum of n random variables: if $Z=\sum_i X_i$ then $\Phi_Z (\omega) = \Pi_i \Phi_{X_i} (\omega)$
E.g. Exponential with $\lambda=1$: $\Phi_1(\omega) = 1/(1-j\omega)$ (page 164).
Sum of m exponentials has $\Phi(\omega)= 1/{(1-j\omega)}^m$. That's called an m-Erlang.
Example 2: sum of n iid Bernoullis. Probability generating function is more useful for discrete random variables.
Example 3: sum of n iid Gaussians. $$\Phi_{X_1} = e^{j\mu\omega - \frac{1}{2} \sigma^2 \omega^2}$$ $$\Phi_{Z} = e^{jn\mu\omega - \frac{1}{2}n \sigma^2 \omega^2}$$ I.e., mean and variance sum.
As the number increases, no matter what distribution the initial random variance is (provided that its moments are finite), for the sum $\Phi$ starts looking like a Gaussian.
The mean $M_n$ of n random variables is itself a random variable.
As $n\rightarrow\infty$ $M_n \rightarrow \mu$.
That's a law of large numbers (LLN).
$E[ M_n ] = \mu$. It's an unbiased estimator.
$VAR[ M_n ] = n \sigma ^2$
Weak law of large numbers $$\forall \epsilon >0 \lim_{n\rightarrow\infty} P[ |M_n-\mu| < \epsilon] = 1$$
How fast does it happen? We can use Chebyshev, though that is very conservative.
Strong law of large numbers $$P [ \lim _ {n\rightarrow\infty} M_n = \mu ] =1$$
As $n\rightarrow\infty$, $F_{M_n}$ becomes Gaussian. That's the Central Limit Theorem (CLT).

4 Counterintuitive things in statistics

Statistics has some surprising examples, which would appear to be impossible. Here are some.

Average income can increase faster in a whole country than in any part of the country.
1. Consider a country with two parts: east and west.
2. Each part has 100 people.
3. Each person in the west makes \$100 per year; each person in the east \$200.
4. The total income in the west is \$10K, in the east \$20K, and in the whole country \$30K.
5. The average income in the west is \$100, in the east \$200, and in the whole country \$150.
6. Assume that next year nothing changes except that one westerner moves east and gets an average eastern job, so he now makes \$200 instead of \$100.
7. The west now has 99 people @ \$100; its average income didn't change.
8. The east now has 101 people @ \$200; its average income didn't change.
9. The whole country's income is \$30100 for an average of \$150.50; that went up.
College acceptance rate surprise.
1. Imagine that we have two groups of people: Albanians and Bostonians.
2. They're applying to two programs at the university: Engineering and Humanities.
3. Here are the numbers. The fractions are accepted/applied.
  
  city-major Engin Human Total
  
  Albanians 11/15 2/5 13/20
  
  Bostonians 4/5 7/15 11/20
  
  Total 15/20 9/20 24/40
  
  E.g, 15 Albanians applied to Engin; 11 were accepted.
4. Note that in Engineering, a smaller fraction of Albanian applicants were accepted than Bostonian applicants. (corrected)
5. Ditto in Humanities.
6. However in all, a larger fraction of Albanian applicants were accepted than Bostonian applicants.
I could go on.

city-major	Engin	Human	Total
Albanians	11/15	2/5	13/20
Bostonians	4/5	7/15	11/20
Total	15/20	9/20	24/40

5 Relevant Xkcd comics

5.1 Chapter 8, Statistics

We have a population. (E.g., voters in next election, who will vote Democrat or Republican).
We don't know the population mean. (E.g., fraction of voters who will vote Democrat).
We take several samples (observations). From them we want to estimate the population mean and standard deviation. (Ask 1000 potential voters; 520 say they will vote Democrat. Sample mean is .52)
We want error bounds on our estimates. (.52 plus or minus .04, 95 times out of 100)
Another application: testing whether 2 populations have the same mean. (Is this batch of Guiness as good as the last one?)
Observations cost money, so we want to do as few as possible.
This gets beyond this course, but the biggest problems may be non-math ones. E.g., how do you pick a random likely voter? In the past phone books were used. In a famous 1936 Presidential poll, that biased against poor people, who voted for Roosevelt.
In probability, we know the parameters (e.g., mean and standard deviation) of a distribution and use them to compute the probability of some event.

E.g., if we toss a fair coin 4 times what's the probability of exactly 4 heads? Answer: 1/16.
In statistics we do not know all the parameters, though we usually know that type the distribution is, e.g., normal. (We often know the standard deviation.)
1. We make observations about some members of the distribution, i.e., draw some samples.
2. From them we estimate the unknown parameters.
3. We often also compute a confidence interval on that estimate.
4. E.g., we toss an unknown coin 100 times and see 60 heads. A good estimate for the probability of that coin coming up heads is 0.6.
Some estimators are better than others, though that gets beyond this course.
1. Suppose I want to estimate the average height of an RPI student by measuring the heights of N random students.
2. The mean of the highest and lowest heights of my N students would converge to the population mean as N increased.
3. However the median of my sample would converge faster. Technically, the variance of the sample median is smaller than the variance of the sample hi-lo mean.
4. The mean of my whole sample would converge the fastest. Technically, the variance of the sample mean is smaller than the variance of any other estimator of the population mean. That's why we use it.
5. However perhaps the population's distribution is not normal. Then one of the other estimators might be better. It would be more robust.
(Enrichment) How to tell if the population is normal? We can do various plots of the observations and look. We can compute the probability that the observations would be this uneven if the population were normal.
An estimator may be biased. We have an distribution that is U[0,b] for unknown b. We take a sample. The max of the sample has a mean n/(n+1)b though it converges to b as n increases.
Example 8.2, page 413: One-tailed probability. This is the probability that the mean of our sample is at least so far above the population mean. $$\alpha = P[\overline{X_n}-\mu > c] = Q\left( \frac{c}{\sigma_x / \sqrt{n} } \right)$$ Q is defined on page 169: $$Q(x) = \int_x^ { \infty} \frac{1}{\sqrt{2\pi} } e^{-\frac{x^2}{2} } dx$$
Application: You sample n=100 students' verbal SAT scores, and see $ \overline{X} = 550$. You know that $\sigma=100$. If $\mu = 525$, what is the probability that $\overline{X_n} > 550$ ?

Answer: Q(2.5) = 0.006
This means that if we take 1000 random sample of students, each with 100 students, and measure each sample's mean, then, on average, 6 of those 1000 samples will have a mean over 550.
This is often worded as the probability of the population's mean being under 525 is 0.006, which is different. The problem with saying that is that presumes some probability distribution for the population mean.
The formula also works for the other tail, computing the probability that our sample mean is at least so far below the population mean.
The 2-tail probability is the probability that our sample mean is at least this far away from the sample mean in either direction. It is twice the 1-tail probability.
All this also works when you know the probability and want to know c, the cutoff.

5.2 Hypothesis testing

Say we want to test whether the average height of an RPI student (called the population) is 2m.
We assume that the distribution is Gaussian (normal) and that the standard deviation of heights is, say, 0.2m.
However we don't know the mean.
We do an experiment and measure the heights of n=100 random students. Their mean height is, say, 1.9m.
The question on the table is, is the population mean 2m?
This is different from the earlier question that we analyzed, which was this: What is the most likely population mean? (Answer: 1.9m.)
Now we have a hypothesis (that the population mean is 2m) that we're testing.
The standard way that this is handled is as follows.
Define a null hypothesis, called H0, that the population mean is 2m.
Define an alternate hypothesis, called HA, that the population mean is not 2m.
Note that we observed our sample mean to be $0.5 \sigma$ below the population mean, if H0 is true.
Each time we rerun the experiment (measure 100 students) we'll observe a different number.
We compute the probability that, if H0 is true, our sample mean would be this far from 2m.
Depending on what our underlying model of students is, we might use a 1-tail or a 2-tail probability.
Perhaps we think that the population mean might be less than 2m but it's not going to be more. Then a 1-tail distribution makes sense.
That is, our assumptions affect the results.
The probability is Q(5), which is very small.
Therefore we reject H0 and accept HA.
We make a type-1 error if we reject H0 and it was really true. See http://en.wikipedia.org/wiki/Type_I_and_type_II_errors
We make a type-2 error if we accept H0 and it was really false.
These two errors trade off: by reducing the probability of one we increase the probability of the other, for a given sample size.
E.g. in a criminal trial we prefer that a guilty person go free to having an innocent person convicted.
Rejecting H0 says nothing about what the population mean really is, just that it's not likely 2m.
(Enrichment) Random sampling is hard. The US government got it wrong here:

http://politics.slashdot.org/story/11/05/13/2249256/Algorithm-Glitch-Voids-Outcome-of-US-Green-Card-Lottery