Normal Approximation and Box Models

STAT 20: Introduction to Probability and Statistics

Adapted by Gaston Sanchez

A Box Model

Look at a chance problem like drawing (with replacement) from a box with numbered tickets


Box Models provide an analogy for many chance processes which help to analyze chance variability

Flipping a fair coin

\(X\) = Getting heads when tossing a fair coin (once).


Box with tickets:

\[ \boxed{ \ \fbox{0} \quad \fbox{1} \ } \]

Draw one ticket out of this box.

Flipping a biased coin

\(X\) = Getting heads when tossing a biased coin (2/3 chance of heads).


Box with tickets:

\[ \boxed{ \ \fbox{0} \quad \fbox{1} \quad \fbox{1} \ } \]

Draw one ticket out this box.

Flipping another biased coin

\(X\) = Getting heads when tossing a biased coin (1/4 chance of heads).


Box with tickets:

\[ \boxed{ \ \fbox{0} \quad \fbox{0} \quad \fbox{0} \quad \fbox{1} \ } \]

Draw one ticket out this box.

Flipping a fair coin 5 times

\(X\) = Number of heads when tossing a fair coin five times.


Box with tickets:

\[ \boxed{ \ \fbox{0} \quad \fbox{1} \ } \]

Draw five tickets with replacement out of this box, and add them up.

Number of spots when rolling a fair die

\(X\) = number of spots when rolling a die (once).


Box with tickets:

\[ \boxed{ \ \fbox{1} \quad \fbox{2} \quad \fbox{3} \quad \fbox{4} \quad \fbox{5} \quad \fbox{6} \ } \]

Draw one ticket out of this box.

Sum of spots when rolling a pair of fair dice

\(X\) = Sum of dice.


Box with tickets:

\[ \boxed{ \ \fbox{1} \quad \fbox{2} \quad \fbox{3} \quad \fbox{4} \quad \fbox{5} \quad \fbox{6} \ } \]

Draw two tickets with replacement out of this box, and add them.

Determining a box

\(X\) is a random variable with the distribution shown below:

\[ X = \begin{cases} 3, \; \text{ with prob } 1/3\\ 4, \; \text{ with prob } 1/4\\ 5, \; \text{ with prob } 5/12 \end{cases} \]


Box with tickets:

\[ \boxed{ \ \fbox{3} \ \fbox{3} \ \fbox{3} \ \fbox{3} \quad \fbox{4} \ \fbox{4} \ \fbox{4} \quad \fbox{5} \ \fbox{5} \ \fbox{5} \ \fbox{5} \ \fbox{5} \ } \]

Draw one ticket out of this box.

5 spins of an American roulette

\(X\) = Number of spins landing on red


Box with 38 tickets:

\[ \boxed{ \ \underset{\text{18 black}}{\fbox{0} \ \fbox{0} \dots \fbox{0}} \quad \underset{\text{18 red}}{\fbox{1} \ \fbox{1} \dots \fbox{1}} \quad \underset{\text{2 green}}{\fbox{0} \ \fbox{0}} \ } \]

Draw five tickets with replacement out of this box, and add them.

Box Model and Expected Value

Number of spots when rolling a fair die

\(X\) = number of spots when rolling a die (once).


Draw one ticket out this box:

\[ \boxed{ \ \fbox{1} \quad \fbox{2} \quad \fbox{3} \quad \fbox{4} \quad \fbox{5} \quad \fbox{6} \ } \]

  • E(X) = ?

  • Var(X) = ?

Number of spots when rolling a fair die

\(X\) = number of spots when rolling a die (once).


Draw one ticket out this box:

\[ \boxed{ \ \fbox{1} \quad \fbox{2} \quad \fbox{3} \quad \fbox{4} \quad \fbox{5} \quad \fbox{6} \ } \]

  • E(X) = Average of tickets in box

  • Var(X) = Variance of tickets in box

Number of spots when rolling a fair die

\(X\) = number of spots when rolling a die (once).


Draw one ticket out this box:

\[ \boxed{ \ \fbox{1} \quad \fbox{2} \quad \fbox{3} \quad \fbox{4} \quad \fbox{5} \quad \fbox{6} \ } \]

\[ E(X) = \frac{1 + 2 + 3 + 4 + 5 + 6}{6} = 3.5 \]

\[ Var(X) = \frac{(1-3.5)^2 + (2-3.5)^2 + \dots + (5-3.5)^2 + (6-3.5)^2}{6} = 2.91 \]

Sum of spots when rolling a pair of fair dice

Draw two tickets with replacement out of this box, and add them.

\[ \boxed{ \ \fbox{1} \quad \fbox{2} \quad \fbox{3} \quad \fbox{4} \quad \fbox{5} \quad \fbox{6} \ } \]

Sum of dice \(S = X_1 + X_2\), where \(X_1\) is the number in first ticket, and \(X_2\) is the number in second ticket.

  • \(E(S) = ?\)

  • \(Var(S) = ?\)

Sum of spots when rolling a pair of fair dice

Draw two tickets with replacement out of this box, and add them.

\[ \boxed{ \ \fbox{1} \quad \fbox{2} \quad \fbox{3} \quad \fbox{4} \quad \fbox{5} \quad \fbox{6} \ } \]

Sum of dice \(S = X_1 + X_2\), where \(X_1\) is the number in first ticket, and \(X_2\) is the number in second ticket.


\(E(S) = E(X_1 + X_2)\)

Sum of spots when rolling a pair of fair dice

Draw two tickets with replacement out of this box, and add them.

\[ \boxed{ \ \fbox{1} \quad \fbox{2} \quad \fbox{3} \quad \fbox{4} \quad \fbox{5} \quad \fbox{6} \ } \]

Sum of dice \(S = X_1 + X_2\), where \(X_1\) is the number in first ticket, and \(X_2\) is the number in second ticket.


\(E(S) = E(X_1 + X_2) = E(X_1) + E(X_2)\)

Sum of spots when rolling a pair of fair dice

Draw two tickets with replacement out of this box, and add them.

\[ \boxed{ \ \fbox{1} \quad \fbox{2} \quad \fbox{3} \quad \fbox{4} \quad \fbox{5} \quad \fbox{6} \ } \]

Sum of dice \(S = X_1 + X_2\), where \(X_1\) is the number in first ticket, and \(X_2\) is the number in second ticket.


\(E(S) = E(X_1 + X_2) = E(X_1) + E(X_2) = 2E(X)\)

Sum of spots when rolling a pair of fair dice

Draw two tickets with replacement out of this box, and add them.

\[ \boxed{ \ \fbox{1} \quad \fbox{2} \quad \fbox{3} \quad \fbox{4} \quad \fbox{5} \quad \fbox{6} \ } \]

Sum of dice \(S = X_1 + X_2\), where \(X_1\) is the number in first ticket, and \(X_2\) is the number in second ticket.


\(E(S) = E(X_1 + X_2) = E(X_1) + E(X_2) = 2E(X) = 2(3.5) = 7\)

Sum of spots when rolling a pair of fair dice

Draw two tickets with replacement out of this box, and add them.

\[ \boxed{ \ \fbox{1} \quad \fbox{2} \quad \fbox{3} \quad \fbox{4} \quad \fbox{5} \quad \fbox{6} \ } \]

Sum of dice \(S = X_1 + X_2\), where \(X_1\) is the number in first ticket, and \(X_2\) is the number in second ticket.


\(Var(S) = Var(X_1 + X_2)\)

Sum of spots when rolling a pair of fair dice

Draw two tickets with replacement out of this box, and add them.

\[ \boxed{ \ \fbox{1} \quad \fbox{2} \quad \fbox{3} \quad \fbox{4} \quad \fbox{5} \quad \fbox{6} \ } \]

Sum of dice \(S = X_1 + X_2\), where \(X_1\) is the number in first ticket, and \(X_2\) is the number in second ticket.


\(Var(S) = Var(X_1 + X_2) = Var(X_1) + Var(X_2)\)

Sum of spots when rolling a pair of fair dice

Draw two tickets with replacement out of this box, and add them.

\[ \boxed{ \ \fbox{1} \quad \fbox{2} \quad \fbox{3} \quad \fbox{4} \quad \fbox{5} \quad \fbox{6} \ } \]

Sum of dice \(S = X_1 + X_2\), where \(X_1\) is the number in first ticket, and \(X_2\) is the number in second ticket.


\(Var(S) = Var(X_1 + X_2) = Var(X_1) + Var(X_2) = 2 \times Var(X)\)

Important Random Variables

Expected Value for S sum of draws from a box model:

\[ \Large E(S) = (\text{# of draws}) \times (\text{avg of box}) \]


\[ \begin{align} E(S) &= E(X_1 + \dots + X_n) \\ &= E(X_1) + \dots + E(X_n)\\ &= n \times E(X) \end{align} \]

Variance for S sum of draws from a box model:

\[ \Large Var(S) = (\text{# of draws}) \times (\text{variance of box}) \]


\[ \begin{align} Var(S) &= Var(X_1 + \dots + X_n) \\ &= Var(X_1) + \dots + Var(X_n)\\ &= n \times Var(X) \end{align} \]

Standard Deviation

How far off we expect to be from the expected value?

\[ \Large SD(S) = (\text{# of draws})^{1/2} \times (\text{SD of box}) \]


\[ \begin{align} SD(S) &= \left( nVar(X) \right)^{1/2} \\ &= \sqrt{n} \times SD(X) \end{align} \]


  • As we increase the number of draws, the SD becomes larger but not in a linear way.

Important Random Variables

  • \(X\): single draw


  • \(S\): sum of \(n\) draws (sample sum)


  • \(\bar{X}\): average of \(n\) draws (sample mean)

Normal Distribution

Normal Distribution

The most important continuous distribution in Statistics.


Also known as the Gaussian distribution.


If a random variable \(X\) follows a normal distribution, we write:

\[ X \sim N(\mu, \sigma) \quad \text{or} \quad X \sim N(\mu, \sigma^2) \] where \(\mu\) is the mean, and \(\sigma\) is the SD (\(\sigma^2\) is the Var).

Anatomy of Normal Curve

Normal Distribution: \(\mu = 5\), \(\sigma=1\)

Normal Distributions: different means

Normal Distributions: different std-devs

Normal Distribution

Total area under the curve?

Normal Distribution

Total area under the curve is 1

68% of area within \(1 \sigma\)

95% of area within \(2 \sigma\)

99.7% of area within \(3 \sigma\)

Functions in R

dnorm() computes the density \(f(x)\) of \(X \sim N(\mu, \sigma)\)


pnorm() computes the CDF \(F(x) = P(X \leq x)\) of \(X\)


qnorm() is the inverse of the CDF; given a probability (or percentile) it returns the value on the x-axis that corresponds to that percentile


rnorm() generates random numbers from a Normal distribution

dnorm()

dnorm() computes the density \(f(x)\) of \(X \sim N(\mu, \sigma)\)


example 1: \(\quad f(0)\) for \(X \sim N(0, 1)\)

dnorm(0, mean = 0, sd = 1)
[1] 0.3989423


example 2: \(\quad f(6)\) for \(X \sim N(5, 2)\)

dnorm(6, mean = 5, sd = 2)
[1] 0.1760327

pnorm()

pnorm() computes \(F(x)\) of \(X \sim N(\mu, \sigma)\)


example 1: \(\quad F(0.5) = P(X \leq 0.5)\) for \(X \sim N(1, 2)\)

pnorm(0.5, mean = 1, sd = 2)
[1] 0.4012937


example 2: \(\quad P(X \geq 7)\) for \(X \sim N(5, 2)\)

pnorm(7, mean = 5, sd = 2, lower.tail = FALSE)
[1] 0.1586553

qnorm()

qnorm() computes the inverse of \(F(x)\) for \(X \sim N(\mu, \sigma)\)


example 1: which \(x\) gives \(P(X \leq x) = 0.20\) for \(X \sim N(1, 2)\)

qnorm(0.2, mean = 1, sd = 2)
[1] -0.6832425


example 2: which \(x\) gives \(P(X \geq x) = 0.20\) for \(X \sim N(1, 2)\)

qnorm(0.2, mean = 1, sd = 2, lower.tail = FALSE)
[1] 2.683242

rnorm()

rnorm() generates random numbers from a Normal distribution


example 1: generate 3 values from \(X \sim N(0, 1)\)

rnorm(n = 3, mean = 0, sd = 1)
[1] -0.2051100 -0.1201549  0.2256920


example 2: generate 7 values from \(X \sim N(5, 2)\)

rnorm(n = 7, mean = 5, sd = 2)
[1] 6.246068 4.345725 4.386438 4.854817 6.257119 8.413961 4.833068

\(P(X \leq 5)\) for \(X \sim N(\mu = 5, \sigma=1)\)

# F(5)
pnorm(5, mean = 5, sd = 1)

\(P(X \geq 6)\) for \(X \sim N(\mu = 5, \sigma=1)\)

# 1 - F(6)
pnorm(6, mean = 5, sd = 1, lower.tail = FALSE)

\(P(-0.5 \leq X \leq 1.5)\) for \(X \sim N(\mu = 0, \sigma=1)\)

# F(1.5) - F(-0.5)
pnorm(1.5) - pnorm(-0.5)

\(P(-2 \leq X \leq 2)\) for \(X \sim N(\mu = 0, \sigma=2)\)

# F(2) - F(-2)
pnorm(2, mean = 0, sd = 2) - pnorm(-2, mean = 0, sd = 2)

\(P(X \leq x) = 0.3\) for \(X \sim N(\mu = 0, \sigma=2)\)

# F(x) = 0.3
qnorm(0.3, mean = 0, sd = 2)

Normal Approximation and CLT

Important Random Variables

\(S\) and \(\bar{X}\) will follow an approximately Normal Distribution, as we increase the number of draws.

Central Limit Theorem


Let \(\mu\) be the average of the box, and \(\sigma\) the SD of the box:

\(S \sim N(n \times \mu, \ \sqrt{n} \times \sigma)\)

\(\bar{X} \sim N(\mu, \ \sigma / \sqrt{n})\)

Example: American Roulette

Net gain while betting on red on a roulette spin.


If we bet a dollar on red, then our net gain is

\[ \text{gain} = \begin{cases} +1 & \text{with prob } \frac{18}{38} \\ -1 & \text{with prob } \frac{20}{38} \end{cases} \]

Example: American Roulette

# define the gain for a single spin
gain <- c(1, -1)

Example: American Roulette

# define the gain for a single spin
gain <- c(1, -1)

# define the corresponding probabilities
prob_gain <- c(18/38, 20/38) 

Example: American Roulette

# define the gain for a single spin
gain <- c(1, -1)

# define the corresponding probabilities
prob_gain <- c(18/38, 20/38)

exp_gain <- sum(gain * prob_gain)
exp_gain
[1] -0.05263158

Example: American Roulette

# define the gain for a single spin
gain <- c(1, -1)

# define the corresponding probabilities
prob_gain <- c(18/38, 20/38)

exp_gain <- sum(gain * prob_gain)
exp_gain
[1] -0.05263158


# simulate gain from 10 spins of the wheel
set.seed(123)
sample(x = gain, size = 10, prob = prob_gain, replace = TRUE)
 [1] -1  1 -1  1  1 -1  1  1  1 -1

Example: American Roulette

Code
gains <- replicate(
  n = 1000, # 1000 repetitions
  expr = {
    # net gain in 10 spins of roulette
    spins = sample(x = gain, size = 10, prob = prob_gain, replace = TRUE)
    gain = sum(spins)
})

# empirical histogram
data.frame(gains) |> 
  ggplot(aes(x = gains)) +
  geom_histogram(color = "white", binwidth = 2) +
  labs(title = "N = 10",
       x = "net gain") +
  theme_bw()

Code
gains <- replicate(
  n = 1000, # 1000 repetitions
  expr = {
    # net gain in 10 spins of roulette
    spins = sample(x = gain, size = 100, prob = prob_gain, replace = TRUE)
    gain = sum(spins)
})

# empirical histogram
data.frame(gains) |> 
  ggplot(aes(x = gains)) +
  geom_histogram(color = "white", binwidth = 2) +
  labs(title = "N = 100",
       x = "net gain") +
  theme_bw()

Code
gains <- replicate(
  n = 1000, # 1000 repetitions
  expr = {
    # net gain in 10 spins of roulette
    spins = sample(x = gain, size = 1000, prob = prob_gain, replace = TRUE)
    gain = sum(spins)
})

# empirical histogram
data.frame(gains) |> 
  ggplot(aes(x = gains)) +
  geom_histogram(color = "white", binwidth = 8) +
  labs(title = "N = 1000",
       x = "net gain") +
  theme_bw()

Code
gains <- replicate(
  n = 1000, # 1000 repetitions
  expr = {
    # net gain in 10 spins of roulette
    spins = sample(x = gain, size = 5000, prob = prob_gain, replace = TRUE)
    gain = sum(spins)
})

# empirical histogram
data.frame(gains) |> 
  ggplot(aes(x = gains)) +
  geom_histogram(color = "white", binwidth = 15) +
  labs(title = "N = 5000",
       x = "net gain") +
  theme_bw()