- Introduction
- Binomial distribution (barplots, formula notation, mathematical expressions)
- Log normal distribution as example of a skewed distribution (
`abline()`

for vertical lines,`legend()`

for text labels) - From binomial to normal distribution (multiple plots,
`curve()`

,`legend()`

) - Continuous uniform distribution (line segments with
`segments()`

, area under the curve with`polygon()`

) - Normal distribution
- Normal distributions with different parameters (“empty” plots, styles, legends)
- Standard normal distribution: areas under the curve (area under the curve with
`polygon()`

, annotations with`arrows()`

) - Normal distribution (mathematical expressions in labels using
`axis()`

) - Density function, distribution function, quantile function and random number generator (annotations, plotting random points under a curve)

- Sampling distributions and central limit theorem
- Confidence intervals
- Student’s t distribution
- Hypothesis testing

This notebook contains code and figures for statistics and data science education. It stems from a statistics course I gave a the HTW Berlin. All figures are generated with base R plotting functions and no additional packages are required. This notebook may be useful for people in statistics or data science education and for those who want to learn more about (advanced) base R plotting and mathematical annotations in R plots. Each section title below mentions in parentheses new techniques or concepts which are introduced in the section. Most plots are not self-explanatory – they are intended to be used in class for illustration and discussion.

You can find the source code repository for this project on GitHub. The code is licensed under Apache License 2.0. See LICENSE file.

I start with the binomial distribution. Since it is a discrete probability distribution, I use a bar plot via `barplot()`

. I find the formula notation best when using this function. You simply provide a vector of *x* and *y* values and the formula notation `y ~ x`

will hence plot bars of height \(y_i\) for each corresponding \(x_i\).

I also use mathematical notation in the y-axis label and the plot title. Adding mathematical notation to plots in R is quite a hassle, as R basically requires you to provide the notation in some crude mixture between R and LaTeX notation. The syntax is described on the R documentation page `?plotmath`

and can be showcased using `demo(plotmath)`

in the console.

Pure mathematical expressions can be plotted via `expression(...)`

as seen below for the y-axis label. To mix text and mathematical notation, use `paste()`

inside `expression()`

. For unknown reasons, `paste()`

will *not* produce a space between the text and the mathematical expression, i.e. it works like `paste0()`

when used within `expression()`

.^{1} Also for unknown reasons, the \(\sim\)-operator `%~%`

will add a space *before*, but not after the operator. You will need to use `~`

to add a space after the operator.

I don’t find the typesetting as elegant as LaTeX, but it works for most scenarios.

```
p <- 0.2
x <- 0:10
n <- 10
y <- dbinom(x, n, p)
barplot(y ~ x,
xlab = "x", ylab = expression(P(X == x)),
main = expression(paste("Binomial distribution ", X %~% ~ B(10, 0.2))))
```

It’s not ideal to hard code concrete values for \(n\) and \(p\) in the plot title. In order to pass R variables into mathematical expressions, you can use `substitute(<expr>, <vars>)`

instead of `expression()`

. The first argument provides the mathematical expression with placeholders like `n`

and `p`

in the code below. The second argument must be a list that maps the placeholders to actual values. Note that if you pass a vector instead of a scalar value using `<vars>`

, only the first value will be taken from the vector.

```
p <- 0.2
x <- 0:10
n <- 10
y <- dbinom(x, n, p)
barplot(y ~ x,
xlab = "x", ylab = expression(P(X == x)),
main = substitute(paste("Binomial distribution ", X %~% ~ B(n, p)), list(n=n, p=p)))
```

`abline()`

for vertical lines, `legend()`

for text labels)The following code shows an example of a skewed distribution using the log-normal distribution. To draw a smooth curve, I generate linearly spaced values \(x_i\) using `seq()`

and calculate the respective \(y_i\) value with the `dlnorm()`

function. The function is plotted using a line plot (`plot()`

with parameter `type="l"`

).

Annotations for mean and median are added as vertical lines using `abline()`

with `v`

(for *vertical*) parameter and a helper function that uses `legend()`

. One could also add text annotations to the plot via `text()`

, but only `legend()`

provides a nice box around the text.

You may notice that the font in the plot title is now different than in the above plots (bold below and normal above). This is due to an (unfortunate) default setting in R – mathematical expressions are plotted by default in normal font and titles without mathematical expressions are plotted in bold by default. You can either make plot titles with expressions bold by using `bold(<expr>)`

inside `expression()`

/ `substitute()`

or make all plot titles appear in normal font by passing the graphical parameter `font.main = 1`

to the respective plot function.

```
x <- seq(0, 3, by = 0.001)
unit <- 100000
m <- 0.6
s <- 0.08
y <- dlnorm(x, log(m))
y <- y/sum(y)
mu <- sum(x * y * unit)
med <- max(x[cumsum(y) <= 0.5]) * unit
addlabel <- function(x, y, label, col, xjust = 0.05) {
legend(x, y, label, text.col = col, box.col = "lightgray", xjust = xjust, x.intersp = 0)
}
plot(x * unit, y, type = "l",
xlab = "Income in €", ylab = "Density",
main = "A fictitious but typical income distribution")
abline(v = c(med, mu), lty = "dashed", col = c("red", "blue"))
addlabel(med, 0.001, paste("Median:", med, "€"), "red")
addlabel(mu, 0.0008, paste("Mean:", round(mu), "€"), "blue")
```

`curve()`

, `legend()`

)The following plots illustrate the transition from binomial to normal distribution when increasing \(n\). I draw graphics for increasing values of \(n\) in a `for`

-loop^{2} and additionally overlay the final plot with the corresponding normal distribution density function. I’m also using vertical lines via `plot()`

with `type="h"`

here instead of a bar plot, since the latter causes overplotting at high \(n\). For each different \(n\), I’m setting some graphical properties using the `xlim`

and `lwd`

lists. `curve()`

is used for plotting the density function. It’s very similar to a line plot with `plot()`

and `type="l"`

, but you directly pass the function that should be plotted and it takes care about generating the right *x* values for it.

Usually it’s better to use *ggplot2* or *lattice* when generating multiple graphics, i.e. as *facets* or *small multiples*. In this case, however, I need these plots as separate figures to be included in presentation slides.

```
p <- 0.5
xlim <- list(
"10" = c(0, 10),
"100" = c(35, 65),
"1000" = c(450, 550)
)
lwd <- list(
"10" = 5,
"100" = 3,
"1000" = 1
)
lastn <- 0
for (n in c(10, 100, 1000, 1000)) {
x <- 0:n
y <- dbinom(x, n, p)
k <- as.character(n)
plot(x, y, type = "h", lwd = lwd[[k]],
xlab = 'x – Number of "heads"', ylab = expression(P(X == x)),
main = sprintf("Binomial distribution for %d coin flips", n),
xlim = xlim[[k]])
if (lastn == 1000) { # add the curve for the normal distribution
mu <- n*p
sigma <- n*p/sqrt(n)
f <- function(x) { dnorm(x, mean = mu, sd = sigma) }
curve(f, xlim[[k]][1], xlim[[k]][2], add = TRUE, col = "red")
legend(510, 0.025, substitute(paste("Normal distribution ", N(mu, sigma)),
list(mu=mu, sigma=round(sigma, 2))),
lty = "solid", col = "red", cex = 0.75)
}
lastn <- n
}
```