7 Running Even More Simulations
In the previous section we discussed one possible strategy to run three
simulations, each one consisting of five games. From the programming point of
view, the core of this strategy is the utilization of embedded for()
loops.
Now, three simulations and five games is obviously not enough. To get a better feeling of the probability of winning Game-A—approximated by the proportion of wins—we need to take things to the next level: with more simulations, and many more games.
Exactly how many more simulations, and how many more games? There is no
definitive answer to this question. Obviously the more games and simulations,
the better. To keep things simple, but also fairly realistic, I think 100
simulations and 1000 games is a reasonable choice. The following diagram
illustrates what the games
matrix (of 1000 rows and 4 columns) for each of
the 100 simulations could look like, and also de derived matrix simulations
having 1000 rows and 100 columns:
The following code uses 1000 games and 100 simulations. This allows us to play Game-A for a total of \(1,000 \times 100 = 100,000\) times. With this many games, we can safely calculate the average proportion of wins, and use this value to estimate the probability of winning Game-A:
set.seed(753)
# main inputs
= 1:6
die = 1000
number_games = 100
number_sims
# matrix to store simulation outputs
= matrix(0, nrow = number_games, ncol = number_sims)
simulations
for (sim in 1:number_sims) {
= matrix(0, nrow = number_games, ncol = 4)
games for (game in 1:number_games) {
= sample(die, size = 4, replace = TRUE)
games[game, ]
}= apply(games, 1, function(x) any(x == 6))
any_sixes = ifelse(any_sixes, 1, -1)
simulations[ ,sim]
}
rownames(simulations) = paste0("game", 1:number_games)
colnames(simulations) = paste0("sim", 1:number_sims)
= apply(simulations, 2, function(x) sum(x==1))
total_wins = total_wins / number_games
prop_wins
# mean win-proportion
mean(prop_wins)
## [1] 0.51674
As you can tell, the average proportion of wins turns out to be 0.51674 which is pretty close to the theoretical probability of winning Game-A
\[\begin{align*} Prob(\text{at least one six in 4 rolls}) &= 1 - Prob(\text{no six in 4 rolls}) \\ &= 1 - \left( \frac{5}{6} \right)^4 \\ &= 0.517747 \end{align*}\]
Of course, if you change the random seed and rerun this simulation, you will very likely obtain a slightly different proportion of wins. But still close enough to the actual theoretical probability of winning Game-A.
7.1 Graphing Cumulative Gains
Like in chapter Playing Game-A 100 times, let’s visualize what’s going on with the sequence of cumulative gains—assuming you gain $1 if you win a game, and you gain -$1 if you lose a game.
7.1.1 Plotting the first simulation
The main data object that we have is the matrix simulations
. This is a
1000 by 100 matrix. So let’s begin with a ggplot to
visualize the cumulative gains of the games in the first simulation. That is,
the games in the first column of this matrix.
Because we are interested in making a graph with "ggplot2"
, the first step
consists of creating a data frame. Perhaps the simplest thing to do is to
assemble a 2-column table dat_sim1
: column games
for the number of games,
and another column gain
for the cumulative gain in the games of the first
simulation.
= data.frame(
dat_sim1 games = 1:number_games,
gain = cumsum(simulations[ ,1])
)
head(dat_sim1, n = 10)
## games gain
## game1 1 1
## game2 2 2
## game3 3 1
## game4 4 0
## game5 5 -1
## game6 6 -2
## game7 7 -1
## game8 8 -2
## game9 9 -3
## game10 10 -4
The next step is to pass dat_sim1
to ggplot()
and friends to obtain a
timeline, very similar to the one obtain in section
Cumulative Gains with ggplot2
ggplot(data = dat_sim1, aes(x = games, y = gain)) +
geom_hline(yintercept = 0, color = "gray50") +
geom_line() +
theme_minimal() +
labs(title = "First Simulation of Game-A",
subtitle = "Cumulative gains in 1000 games")
7.1.2 Plotting the first four simulations
Let’s take a further step by considering the cumulative gains in the first four simulations.
Because we need to obtain the cumulative gains in each simulation, it’s better
if we apply()
the function cumsum()
on each column of the matrix
simulations
. This will produce another 1000-by-100 matrix that we’ll name
cum_gain_prog
:
# 1000-by-100 matrix of cumulative gains
= apply(simulations, 2, cumsum) cum_gain_prog
As usual, to call ggplot()
we must have the input data into a data.frame
(or a tibble) object. Here’s one way to create a table dat_4sims
containing
three columns: 1) column games
, 2) gain
containing the cumulative gain,
and 3) simulation
for the associated simulation.
# table with first four simulations
= data.frame(
dat_4sims games = rep(1:number_games, times = 4),
gain = as.vector(cum_gain_prog[ ,1:4]),
simulation = rep(paste0("simulation-", 1:4), each = number_games)
)
Now we are ready to graph the timelines for the first four simulations. In this case, we’ll use facets to better distinguish between each simulation:
ggplot(data = dat_4sims, aes(x = games, y = gain)) +
geom_hline(yintercept = 0, color = "gray70") +
geom_line() +
facet_wrap(~ simulation) +
theme_light() +
labs(title = "First 4 Simulations of Game-A",
subtitle = "Cumulative gains in 1000 games")
Notice the different trends in each simulation. The 1000 games played in simulations 1, 2, and 4 resulted in a positive gain. In contrast, the final gain in simulation 3 turned out to be negative.
7.1.3 Plotting all simulations
Out of curiosity let’s graph the cumulative gains of all 100 simulations.
# converting matrix into data.frame
= as.data.frame(cum_gain_prog)
dat_cum_gain $games = 1:number_games
dat_cum_gain
# reshaping table into "long" or "tall" format
= pivot_longer(
tbl_cum_gain
dat_cum_gain,cols = starts_with("sim"),
names_to = "simulation",
values_to = "gain")
We can now proceed with ggplot()
ggplot(data = tbl_cum_gain, aes(x = games, y = gain)) +
geom_line(aes(group = simulation),
color = "gray50", alpha = 0.3, size = 0.3) +
geom_hline(yintercept = 0, color = "gray50") +
theme_minimal() +
labs(title = "100 Simulations of Game-A",
subtitle = "Cumulative gains in 1000 games")
As you can tell, there is a lot going on in this graphic. Essentially, we
have 100 lines (each one corresponds to a given simulation),
and a considerable amount of overlapping between all of them. Despite
adding some transparency to the lines—with the alpha
argument inside
geom_line()
—it’s hard to follow the path of any single line.
One thing to pay attention to is that the majority of lines are above the horizontal zero-baseline. But not all of them, as reflected by a small proportion of lines that end up below the zero-baseline. Nevertheless, this plot allows us to see an undeniable fact: in the long run, playing Game-A gives you a decent chance of ending with a positive gain. No wonder why Chevalier De Méré was able to make some money with this game of chance.