Introduction

This notebook contains code and figures for statistics and data science education. It stems from a statistics course I gave a the HTW Berlin. All figures are generated with base R plotting functions and no additional packages are required. This notebook may be useful for people in statistics or data science education and for those who want to learn more about (advanced) base R plotting and mathematical annotations in R plots. Each section title below mentions in parentheses new techniques or concepts which are introduced in the section. Most plots are not self-explanatory – they are intended to be used in class for illustration and discussion.

You can find the source code repository for this project on GitHub. The code is licensed under Apache License 2.0. See LICENSE file.

Binomial distribution (barplots, formula notation, mathematical expressions)

I start with the binomial distribution. Since it is a discrete probability distribution, I use a bar plot via barplot(). I find the formula notation best when using this function. You simply provide a vector of x and y values and the formula notation y ~ x will hence plot bars of height \(y_i\) for each corresponding \(x_i\).

I also use mathematical notation in the y-axis label and the plot title. Adding mathematical notation to plots in R is quite a hassle, as R basically requires you to provide the notation in some crude mixture between R and LaTeX notation. The syntax is described on the R documentation page ?plotmath and can be showcased using demo(plotmath) in the console.

Pure mathematical expressions can be plotted via expression(...) as seen below for the y-axis label. To mix text and mathematical notation, use paste() inside expression(). For unknown reasons, paste() will not produce a space between the text and the mathematical expression, i.e. it works like paste0() when used within expression().1 Also for unknown reasons, the \(\sim\)-operator %~% will add a space before, but not after the operator. You will need to use ~ to add a space after the operator.

I don’t find the typesetting as elegant as LaTeX, but it works for most scenarios.

p <- 0.2
x <- 0:10
n <- 10
y <- dbinom(x, n, p)

barplot(y ~ x,
        xlab = "x", ylab = expression(P(X == x)),
        main = expression(paste("Binomial distribution ", X %~% ~ B(10, 0.2))))

It’s not ideal to hard code concrete values for \(n\) and \(p\) in the plot title. In order to pass R variables into mathematical expressions, you can use substitute(<expr>, <vars>) instead of expression(). The first argument provides the mathematical expression with placeholders like n and p in the code below. The second argument must be a list that maps the placeholders to actual values. Note that if you pass a vector instead of a scalar value using <vars>, only the first value will be taken from the vector.

p <- 0.2
x <- 0:10
n <- 10
y <- dbinom(x, n, p)

barplot(y ~ x,
        xlab = "x", ylab = expression(P(X == x)),
        main = substitute(paste("Binomial distribution ", X %~% ~ B(n, p)), list(n=n, p=p)))

Log normal distribution as example of a skewed distribution (abline() for vertical lines, legend() for text labels)

The following code shows an example of a skewed distribution using the log-normal distribution. To draw a smooth curve, I generate linearly spaced values \(x_i\) using seq() and calculate the respective \(y_i\) value with the dlnorm() function. The function is plotted using a line plot (plot() with parameter type="l").

Annotations for mean and median are added as vertical lines using abline() with v (for vertical) parameter and a helper function that uses legend(). One could also add text annotations to the plot via text(), but only legend() provides a nice box around the text.

You may notice that the font in the plot title is now different than in the above plots (bold below and normal above). This is due to an (unfortunate) default setting in R – mathematical expressions are plotted by default in normal font and titles without mathematical expressions are plotted in bold by default. You can either make plot titles with expressions bold by using bold(<expr>) inside expression() / substitute() or make all plot titles appear in normal font by passing the graphical parameter font.main = 1 to the respective plot function.

x <- seq(0, 3, by = 0.001)
unit <- 100000
m <- 0.6
s <- 0.08
y <- dlnorm(x, log(m))

y <- y/sum(y)

mu <- sum(x * y * unit)
med <- max(x[cumsum(y) <= 0.5]) * unit

addlabel <- function(x, y, label, col, xjust = 0.05) {
    legend(x, y, label, text.col = col, box.col = "lightgray", xjust = xjust, x.intersp = 0)
}

plot(x * unit, y, type = "l",
     xlab = "Income in €", ylab = "Density",
     main = "A fictitious but typical income distribution")
abline(v = c(med, mu), lty = "dashed", col = c("red", "blue"))
addlabel(med, 0.001, paste("Median:", med, "€"), "red")
addlabel(mu, 0.0008, paste("Mean:", round(mu), "€"), "blue")

From binomial to normal distribution (multiple plots, curve(), legend())

The following plots illustrate the transition from binomial to normal distribution when increasing \(n\). I draw graphics for increasing values of \(n\) in a for-loop2 and additionally overlay the final plot with the corresponding normal distribution density function. I’m also using vertical lines via plot() with type="h" here instead of a bar plot, since the latter causes overplotting at high \(n\). For each different \(n\), I’m setting some graphical properties using the xlim and lwd lists. curve() is used for plotting the density function. It’s very similar to a line plot with plot() and type="l", but you directly pass the function that should be plotted and it takes care about generating the right x values for it.

Usually it’s better to use ggplot2 or lattice when generating multiple graphics, i.e. as facets or small multiples. In this case, however, I need these plots as separate figures to be included in presentation slides.

p <- 0.5

xlim <- list(
    "10" = c(0, 10),
    "100" = c(35, 65),
    "1000" = c(450, 550)
)

lwd <- list(
    "10" = 5,
    "100" = 3,
    "1000" = 1
)


lastn <- 0
for (n in c(10, 100, 1000, 1000)) {
    x <- 0:n
    y <- dbinom(x, n, p)
    k <- as.character(n)

    plot(x, y, type = "h", lwd = lwd[[k]],
         xlab = 'x – Number of "heads"', ylab = expression(P(X == x)),
         main = sprintf("Binomial distribution for %d coin flips", n),
         xlim = xlim[[k]])

    if (lastn == 1000) {  # add the curve for the normal distribution
        mu <- n*p
        sigma <- n*p/sqrt(n)
        f <- function(x) { dnorm(x, mean = mu, sd = sigma) }
        curve(f, xlim[[k]][1], xlim[[k]][2], add = TRUE, col = "red")
        legend(510, 0.025, substitute(paste("Normal distribution ", N(mu, sigma)),
                                      list(mu=mu, sigma=round(sigma, 2))),
               lty = "solid", col = "red", cex = 0.75)
    }

    lastn <- n
}