Engineering Probability Class 26 Mon 20180423
Table of contents
1 Grades
1.1 Computation

This will accumulate the total score.

Normalize each homework to 100 points.
Homeworks that have not yet been graded (that's 9 and up) count for 0.

Sum top 10, multiply result by 0.02, and add into total.

Normalize each exam to 30 points.

Add top 2 into total.

Take the number of sessions in which at least one question was answered.

Divide by the total number of sessions minus 2, to help students who missed up to 2 classes.

Normalize that to 10 points and add into total.

Piazza:
 Divide the semester into 3 parts: up to first test, from then to last class, and after.
 Require two contributions for first part, three for second, and one for last.
 Add up the number of contributions (max: 6), normalize to 10 points, add add to total.

Add the number of knowitall points to total.

Convert total to a letter grade per the syllabus.

Upload total and letter grades to LMS.
1.2 Notes
 This is guaranteed; your grade cannot be lower (absent detected cheating).
 You can compute how latest homeworks would raise it.
1.3 LMS

I uploaded 5 columns to LMS.

There are updated iclicker, piazza, and knowitall numbers.
They should include all updates.

Your total numerical grade is in Total423.

Your letter grade is in Grade423.

Ignore other columns with names like total. They are wrong.
2 Iclicker questions
 X and Y are two uniform r.v. on the interval [0,1]. X and Y are independent. Z=X+Y. What is E[Z]?
 0
 1/2
 2/3
 Now let W=max(X,Y). What is E[W]?
 0
 1/2
 2/3
3 Material from text
3.1 Section 6.5, page 332: Estimation of random variables

Assume that we want to know X but can only see Y, which depends on X.

This is a generalization of our longrunning noisy communication channel example. We'll do things a little more precisely now.

Another application would be to estimate tomorrow's price of GOOG (X) given the prices to date (Y).

Sometimes, but not always, we have a prior probability for X.

For the communication channel we do, for GOOG, we don't.

If we do, it's a ''maximum a posteriori estimator''.

If we don't, it's a ''maximum likelihood estimator''. We effectively assume that that prior probability of X is uniform, even though that may not completely make sense.

You toss a fair coin 3 times. X is the number of heads, from 0 to 3. Y is the position of the 1st head. from 0 to 3. If there are no heads, we'll say that the first head's position is 0.
(X,Y) p(X,Y) (0,0) 1/8 (1,1) 1/8 (1,2) 1/8 (1,3) 1/8 (2,1) 2/8 (2,2) 1/8 (3,1) 1/8 E.g., 1 head can occur 3 ways (out of 8): HTT, THT, TTH. The 1st (and only) head occurs in position 1, one of those ways. p=1/8.

Conditional probabilities:
p(xy) y=0 y=1 y=2 y=3 x=0 1 0 0 0 x=1 0 1/4 1/2 1 x=2 0 1/2 1/2 0 x=3 0 1/4 0 0 $g_{MAP}(y)$ 0 2 1 or 2 1 $P_{error}(y)$ 0 1/2 1/2 0 p(y) 1/8 1/2 1/4 1/8 The total probability of error is 3/8.

We observe Y and want to guess X from Y. E.g., If we observe $$\small y= \begin{pmatrix}0\\1\\2\\3\end{pmatrix} \text{then } x= \begin{pmatrix}0\\ 2 \text{ most likely} \\ 1, 2 \text{ equally likely} \\ 1 \end{pmatrix}$$

There are different formulae. The above one was the MAP, maximum a posteriori probability.
$$g_{\text{MAP}} (y) = \max_x p_x(xy) \text{ or } f_x(xy)$$
That means, the value of $x$ that maximizes $p_x(xy)$

What if we don't know p(xy)? If we know p(yx), we can use Bayes. We might measure p(yx) experimentally, e.g., by sending many messages over the channel.

Bayes requires p(x). What if we don't know even that? E.g. we don't know the probability of the different possible transmitted messages.

Then use maximum likelihood estimator, ML. $$g_{\text{ML}} (y) = \max_x p_y(yx) \text{ or } f_y(yx)$$

There are other estimators for different applications. E.g., regression using least squares might attempt to predict a graduate's QPA from his/her entering SAT scores. At Saratoga in August we might attempt to predict a horse's chance of winning a race from its speed in previous races. Some years ago, an Engineering Assoc Dean would do that each summer.

Historically, IMO, some of the techniques, like least squares and logistic regression, have been used more because they're computationally easy than because they're logically justified.
3.2 Central limit theorem etc
 Review: Almost no matter what distribution the random variable X is, $F_{M_n}$ quickly becomes Gaussian as n increases. n=5 already gives a good approximation.
 nice applets:
 http://onlinestatbook.com/stat_sim/normal_approx/index.html This tests how good is the normal approximation to the binomial distribution.
 http://onlinestatbook.com/stat_sim/sampling_dist/index.html This lets you define a distribution, and take repeated samples of a given size. It shows how the means of the samples are distributed. For sample with more than a few observations, they look fairly normal.
 http://www.umd.umich.edu/casl/socsci/econ/StudyAids/JavaStat/CentralLimitTheorem.html This might also be interesting.
 Sample problems.
 Problem 7.1 on page 402.
 Problem 7.22.
 Problem 7.25.
3.3 Chapter 7, p 359, Sums of Random Variables
The long term goal of this section is to summarize information from a large group of random variables. E.g., the mean is one way. We will start with that, and go farther.
The next step is to infer the true mean of a large set of variables from a small sample.
3.4 Sums of random variables ctd
 Let Z=X+Y.
 $f_Z$ is convolution of $f_X$ and $f_Y$: $$f_Z(z) = (f_X * f_Y)(z)$$ $$f_Z(z) = \int f_X(x) f_Y(zx) dx$$
 Characteristic functions are useful. $$\Phi_X(\omega) = E[e^{j\omega X} ]$$
 $\Phi_Z = \Phi_X \Phi_Y$.
 This extends to the sum of n random variables: if $Z=\sum_i X_i$ then $\Phi_Z (\omega) = \Pi_i \Phi_{X_i} (\omega)$
 E.g. Exponential with $\lambda=1$: $\Phi_1(\omega) = 1/(1j\omega)$ (page 164).
 Sum of m exponentials has $\Phi(\omega)= 1/{(1j\omega)}^m$. That's called an mErlang.
 Example 2: sum of n iid Bernoullis. Probability generating function is more useful for discrete random variables.
 Example 3: sum of n iid Gaussians. $$\Phi_{X_1} = e^{j\mu\omega  \frac{1}{2} \sigma^2 \omega^2}$$ $$\Phi_{Z} = e^{jn\mu\omega  \frac{1}{2}n \sigma^2 \omega^2}$$ I.e., mean and variance sum.
 As the number increases, no matter what distribution the initial random variance is (provided that its moments are finite), for the sum $\Phi$ starts looking like a Gaussian.
 The mean $M_n$ of n random variables is itself a random variable.
 As $n\rightarrow\infty$ $M_n \rightarrow \mu$.
 That's a law of large numbers (LLN).
 $E[ M_n ] = \mu$. It's an unbiased estimator.
 $VAR[ M_n ] = n \sigma ^2$
 Weak law of large numbers $$\forall \epsilon >0 \lim_{n\rightarrow\infty} P[ M_n\mu < \epsilon] = 1$$
 How fast does it happen? We can use Chebyshev, though that is very conservative.
 Strong law of large numbers $$P [ \lim _ {n\rightarrow\infty} M_n = \mu ] =1$$
 As $n\rightarrow\infty$, $F_{M_n}$ becomes Gaussian. That's the Central Limit Theorem (CLT).
3.5 Chapter 8, Statistics

We have a population. (E.g., voters in next election, who will vote Democrat or Republican).

We don't know the population mean. (E.g., fraction of voters who will vote Democrat).

We take several samples (observations). From them we want to estimate the population mean and standard deviation. (Ask 1000 potential voters; 520 say they will vote Democrat. Sample mean is .52)

We want error bounds on our estimates. (.52 plus or minus .04, 95 times out of 100)

Another application: testing whether 2 populations have the same mean. (Is this batch of Guiness as good as the last one?)

Observations cost money, so we want to do as few as possible.

This gets beyond this course, but the biggest problems may be nonmath ones. E.g., how do you pick a random likely voter? In the past phone books were used. In a famous 1936 Presidential poll, that biased against poor people, who voted for Roosevelt.

In probability, we know the parameters (e.g., mean and standard deviation) of a distribution and use them to compute the probability of some event.
E.g., if we toss a fair coin 4 times what's the probability of exactly 4 heads? Answer: 1/16.

In statistics we do not know all the parameters, though we usually know that type the distribution is, e.g., normal. (We often know the standard deviation.)
 We make observations about some members of the distribution, i.e., draw some samples.
 From them we estimate the unknown parameters.
 We often also compute a confidence interval on that estimate.
 E.g., we toss an unknown coin 100 times and see 60 heads. A good estimate for the probability of that coin coming up heads is 0.6.

Some estimators are better than others, though that gets beyond this course.
 Suppose I want to estimate the average height of an RPI student by measuring the heights of N random students.
 The mean of the highest and lowest heights of my N students would converge to the population mean as N increased.
 However the median of my sample would converge faster. Technically, the variance of the sample median is smaller than the variance of the sample hilo mean.
 The mean of my whole sample would converge the fastest. Technically, the variance of the sample mean is smaller than the variance of any other estimator of the population mean. That's why we use it.
 However perhaps the population's distribution is not normal. Then one of the other estimators might be better. It would be more robust.

(Enrichment) How to tell if the population is normal? We can do various plots of the observations and look. We can compute the probability that the observations would be this uneven if the population were normal.

An estimator may be biased. We have an distribution that is U[0,b] for unknown b. We take a sample. The max of the sample has a mean n/(n+1)b though it converges to b as n increases.

Example 8.2, page 413: Onetailed probability. This is the probability that the mean of our sample is at least so far above the population mean. $$\alpha = P[\overline{X_n}\mu > c] = Q\left( \frac{c}{\sigma_x / \sqrt{n} } \right)$$ Q is defined on page 169: $$Q(x) = \int_x^ { \infty} \frac{1}{\sqrt{2\pi} } e^{\frac{x^2}{2} } dx$$

Application: You sample n=100 students' verbal SAT scores, and see $ \overline{X} = 550$. You know that $\sigma=100$. If $\mu = 525$, what is the probability that $\overline{X_n} > 550$ ?
Answer: Q(2.5) = 0.006

This means that if we take 1000 random sample of students, each with 100 students, and measure each sample's mean, then, on average, 6 of those 1000 samples will have a mean over 550.

This is often worded as the probability of the population's mean being under 525 is 0.006, which is different. The problem with saying that is that presumes some probability distribution for the population mean.

The formula also works for the other tail, computing the probability that our sample mean is at least so far below the population mean.

The 2tail probability is the probability that our sample mean is at least this far away from the sample mean in either direction. It is twice the 1tail probability.

All this also works when you know the probability and want to know c, the cutoff.
3.6 Hypothesis testing
 Say we want to test whether the average height of an RPI student (called the population) is 2m.
 We assume that the distribution is Gaussian (normal) and that the standard deviation of heights is, say, 0.2m.
 However we don't know the mean.
 We do an experiment and measure the heights of n=100 random students. Their mean height is, say, 1.9m.
 The question on the table is, is the population mean 2m?
 This is different from the earlier question that we analyzed, which was this: What is the most likely population mean? (Answer: 1.9m.)
 Now we have a hypothesis (that the population mean is 2m) that we're testing.
 The standard way that this is handled is as follows.
 Define a null hypothesis, called H0, that the population mean is 2m.
 Define an alternate hypothesis, called HA, that the population mean is not 2m.
 Note that we observed our sample mean to be $0.5 \sigma$ below the population mean, if H0 is true.
 Each time we rerun the experiment (measure 100 students) we'll observe a different number.
 We compute the probability that, if H0 is true, our sample mean would be this far from 2m.
 Depending on what our underlying model of students is, we might use a 1tail or a 2tail probability.
 Perhaps we think that the population mean might be less than 2m but it's not going to be more. Then a 1tail distribution makes sense.
 That is, our assumptions affect the results.
 The probability is Q(5), which is very small.
 Therefore we reject H0 and accept HA.
 We make a type1 error if we reject H0 and it was really true. See http://en.wikipedia.org/wiki/Type_I_and_type_II_errors
 We make a type2 error if we accept H0 and it was really false.
 These two errors trade off: by reducing the probability of one we increase the probability of the other, for a given sample size.
 E.g. in a criminal trial we prefer that a guilty person go free to having an innocent person convicted.
 Rejecting H0 says nothing about what the population mean really is, just that it's not likely 2m.
 (Enrichment) Random sampling is hard. The US government got it wrong here:
 http://politics.slashdot.org/story/11/05/13/2249256/AlgorithmGlitchVoidsOutcomeofUSGreenCardLottery