Introduction

This notebook contains code and figures for statistics and data science education. It stems from a statistics course I gave a the HTW Berlin. All figures are generated with base R plotting functions and no additional packages are required. This notebook may be useful for people in statistics or data science education and for those who want to learn more about (advanced) base R plotting and mathematical annotations in R plots. Each section title below mentions in parentheses new techniques or concepts which are introduced in the section. Most plots are not self-explanatory – they are intended to be used in class for illustration and discussion.

You can find the source code repository for this project on GitHub. The code is licensed under Apache License 2.0. See LICENSE file.

Binomial distribution (barplots, formula notation, mathematical expressions)

I start with the binomial distribution. Since it is a discrete probability distribution, I use a bar plot via barplot(). I find the formula notation best when using this function. You simply provide a vector of x and y values and the formula notation y ~ x will hence plot bars of height \(y_i\) for each corresponding \(x_i\).

I also use mathematical notation in the y-axis label and the plot title. Adding mathematical notation to plots in R is quite a hassle, as R basically requires you to provide the notation in some crude mixture between R and LaTeX notation. The syntax is described on the R documentation page ?plotmath and can be showcased using demo(plotmath) in the console.

Pure mathematical expressions can be plotted via expression(...) as seen below for the y-axis label. To mix text and mathematical notation, use paste() inside expression(). For unknown reasons, paste() will not produce a space between the text and the mathematical expression, i.e. it works like paste0() when used within expression().1 Also for unknown reasons, the \(\sim\)-operator %~% will add a space before, but not after the operator. You will need to use ~ to add a space after the operator.

I don’t find the typesetting as elegant as LaTeX, but it works for most scenarios.

p <- 0.2
x <- 0:10
n <- 10
y <- dbinom(x, n, p)

barplot(y ~ x,
        xlab = "x", ylab = expression(P(X == x)),
        main = expression(paste("Binomial distribution ", X %~% ~ B(10, 0.2))))

It’s not ideal to hard code concrete values for \(n\) and \(p\) in the plot title. In order to pass R variables into mathematical expressions, you can use substitute(<expr>, <vars>) instead of expression(). The first argument provides the mathematical expression with placeholders like n and p in the code below. The second argument must be a list that maps the placeholders to actual values. Note that if you pass a vector instead of a scalar value using <vars>, only the first value will be taken from the vector.

p <- 0.2
x <- 0:10
n <- 10
y <- dbinom(x, n, p)

barplot(y ~ x,
        xlab = "x", ylab = expression(P(X == x)),
        main = substitute(paste("Binomial distribution ", X %~% ~ B(n, p)), list(n=n, p=p)))

Log normal distribution as example of a skewed distribution (abline() for vertical lines, legend() for text labels)

The following code shows an example of a skewed distribution using the log-normal distribution. To draw a smooth curve, I generate linearly spaced values \(x_i\) using seq() and calculate the respective \(y_i\) value with the dlnorm() function. The function is plotted using a line plot (plot() with parameter type="l").

Annotations for mean and median are added as vertical lines using abline() with v (for vertical) parameter and a helper function that uses legend(). One could also add text annotations to the plot via text(), but only legend() provides a nice box around the text.

You may notice that the font in the plot title is now different than in the above plots (bold below and normal above). This is due to an (unfortunate) default setting in R – mathematical expressions are plotted by default in normal font and titles without mathematical expressions are plotted in bold by default. You can either make plot titles with expressions bold by using bold(<expr>) inside expression() / substitute() or make all plot titles appear in normal font by passing the graphical parameter font.main = 1 to the respective plot function.

x <- seq(0, 3, by = 0.001)
unit <- 100000
m <- 0.6
s <- 0.08
y <- dlnorm(x, log(m))

y <- y/sum(y)

mu <- sum(x * y * unit)
med <- max(x[cumsum(y) <= 0.5]) * unit

addlabel <- function(x, y, label, col, xjust = 0.05) {
    legend(x, y, label, text.col = col, box.col = "lightgray", xjust = xjust, x.intersp = 0)
}

plot(x * unit, y, type = "l",
     xlab = "Income in €", ylab = "Density",
     main = "A fictitious but typical income distribution")
abline(v = c(med, mu), lty = "dashed", col = c("red", "blue"))
addlabel(med, 0.001, paste("Median:", med, "€"), "red")
addlabel(mu, 0.0008, paste("Mean:", round(mu), "€"), "blue")

From binomial to normal distribution (multiple plots, curve(), legend())

The following plots illustrate the transition from binomial to normal distribution when increasing \(n\). I draw graphics for increasing values of \(n\) in a for-loop2 and additionally overlay the final plot with the corresponding normal distribution density function. I’m also using vertical lines via plot() with type="h" here instead of a bar plot, since the latter causes overplotting at high \(n\). For each different \(n\), I’m setting some graphical properties using the xlim and lwd lists. curve() is used for plotting the density function. It’s very similar to a line plot with plot() and type="l", but you directly pass the function that should be plotted and it takes care about generating the right x values for it.

Usually it’s better to use ggplot2 or lattice when generating multiple graphics, i.e. as facets or small multiples. In this case, however, I need these plots as separate figures to be included in presentation slides.

p <- 0.5

xlim <- list(
    "10" = c(0, 10),
    "100" = c(35, 65),
    "1000" = c(450, 550)
)

lwd <- list(
    "10" = 5,
    "100" = 3,
    "1000" = 1
)


lastn <- 0
for (n in c(10, 100, 1000, 1000)) {
    x <- 0:n
    y <- dbinom(x, n, p)
    k <- as.character(n)

    plot(x, y, type = "h", lwd = lwd[[k]],
         xlab = 'x – Number of "heads"', ylab = expression(P(X == x)),
         main = sprintf("Binomial distribution for %d coin flips", n),
         xlim = xlim[[k]])

    if (lastn == 1000) {  # add the curve for the normal distribution
        mu <- n*p
        sigma <- n*p/sqrt(n)
        f <- function(x) { dnorm(x, mean = mu, sd = sigma) }
        curve(f, xlim[[k]][1], xlim[[k]][2], add = TRUE, col = "red")
        legend(510, 0.025, substitute(paste("Normal distribution ", N(mu, sigma)),
                                      list(mu=mu, sigma=round(sigma, 2))),
               lty = "solid", col = "red", cex = 0.75)
    }

    lastn <- n
}

Continuous uniform distribution (line segments with segments(), area under the curve with polygon())

The following plot uses waiting time (waiting for a bus, waiting at the doctor’s, etc.) as an example for a continuous uniform distribution. segments() can be used to draw line segments from point \((x_1, y_1)\) to point \((x_2, y_2)\).

a <- 0
b <- 10
y <- 1/(b-a)

plot(c(a, b), c(y, y), xlim = c(a-1, b+1), ylim = c(0, y*1.1), pch = 16,
     xlab = "Waiting time in minutes", ylab = "Density",
     main = "Density function of the waiting time")
segments(a-1, 0, a, 0)
segments(a, 0, a, y, lty = "dashed")
segments(a, y, b, y)
segments(b, y, b, 0, lty = "dashed")
segments(b, 0, b+1, 0)

To highlight a specific probability as area under the curve, I draw shaded regions with polygon(). The coordinates of the vertices are passed as separate x and y vectors (a list with x and y components or a two-column matrix are also possible). The polygon is automatically closed, i.e. the last point and the first point are joined.

Please also note the usage of ~ in the mathematical expression used in substitute(), which is necessary to typeset the inequality \(a \le X \le b\). For reasons unknown to me, R will not accept an expression like P(a <= X <= b) in expression() / substitute() and will answer with Error: unexpected '<=' in "expression(P(a <= X <=". Whenever such an error comes up, you can try to add one of the two juxtaposition operators * or ~. The first juxtaposes two symbols without a space, the second with a space. So the solution here is to add the ~-operator before the X.

a <- 0
b <- 10
y <- 1/(b-a)

x1 <- 7.5
x2 <- 10

plot(c(a, b), c(y, y), xlim = c(a-1, b+1), ylim = c(0, y*1.1), pch = 16,
     xlab = "Waiting time in minutes", ylab = "Density",
     main = "Density function of the waiting time")

polygon(c(x1, x1, x2, x2), c(0, y, y, 0), col = "lightgray", border = FALSE)

text(x1 + (x2 - x1) / 2, y/2, substitute(P(a <= ~X <= b), list(a=x1, b=x2)), cex = 0.75)

segments(a-1, 0, a, 0)
segments(a, 0, a, y, lty = "dashed")
segments(a, y, b, y)
segments(b, y, b, 0, lty = "dashed")
segments(b, 0, b+1, 0)

Normal distribution

Normal distributions with different parameters (“empty” plots, styles, legends)

The following code again generates figures intended to be used in slides. I step-by-step plot the density function of the normal distribution with different values for \(\mu\) and \(\sigma\) using a nested for-loop. The parameters and different line styles are taken from lists. Each figure is first prepared in the outer for-loop as empty canvas using plot(NULL, ...) which sets limits for the axes and defines the labels. In the inner for-loop, the density functions are added using curve() with parameter add = TRUE. The legend labels and styles are collected in the inner loop and finally the legend is drawn for each figure via legend().

param <- list(
    c(0, 1),
    c(0, 0.5),
    c(0, 2),
    c(2, 1),
    c(-1, 0.4)
)

styles <- list(
    c("solid", "darkred"),
    c("solid", "coral"),
    c("solid", "red"),
    c("dashed", "darkred"),
    c("dotted", "orange")
)

for (maxshow in 1:length(param)) {  # outer loop: number of curves to show
    # create empty plot
    plot(NULL, xlim = c(-5, 5), ylim = c(0, 1.0),
         main = "Normal distribution with different parameters",
         xlab = "x",
         ylab = "f(x)")

    legend_labels <- character()
    legend_lty <- character()
    legend_col <- character()
    for (i in 1:maxshow) {   # inner loop: draw up to `maxshow` curves
        p <- param[[i]]
        m <- p[1]
        s <- p[2]
        k <- sprintf("m%.1f_s%.1f", m, s)

        linest <- styles[[i]]

        f <- function(x) { dnorm(x, mean = m, sd = s) }
        curve(f, lwd = 3, lty = linest[1], col = linest[2], n = 1001, add = TRUE)

        legend_labels <- append(legend_labels, sprintf("N(%.1f, %.1f)", m , s))
        legend_lty <- append(legend_lty, linest[1])
        legend_col <- append(legend_col, linest[2])
    }

    legend("topright", legend_labels, lty = legend_lty, col = legend_col)
}

Standard normal distribution: areas under the curve (area under the curve with polygon(), annotations with arrows())

This plot visualizes the 68-95-99.7 rule: within one, two and three standard deviations of the mean lie about 68%, 95% and 99.7% of the values in a normal distribution. I hence plot the areas under the standard normal distribution for different intervals of standard units \([-z, z]\) with \(z \in \{1, 2, 3\}\).

For shading the areas, I again use polygon. This time though, constructing the x and y coordinates for the vertices requires a little more effort. In the code below x and y refer to \(x\) and \(f(x)\) for the density function, respectively. I first construct a boolean vector iv that indicates which coordinates from x and y are inside the interval \([-z, z]\). We can refer to these values by x[iv] and y[iv] respectively – these are the parts of the density function between \(-z\) and \(z\) and form the upper edge of the polygon. However, to shade the full area we need to extend the coordinates vector for the bottom points at \((-z, 0)\) and \((z, 0)\) and hence provide c(-z, x[iv], z) for the \(x\) coordinates and c(0, y[iv], 0) for the \(y\) coordinates of the polygon. The order of these values is important, so that the last and first point again can be joined for closing the polygon.

In order to give these areas different colors, I could again use a list of different styles. This time however, I’m simply using a semi-transparent color "#aaaaaa50" where the last color component 50 sets the alpha channel to \(5 \cdot 16/256 \approx 31\%\) transparency. Since areas with smaller \(z\) are covered by areas with higher \(z\), the semi-transparent gray adds up to darker tones for smaller \(z\).

Finally, I’m using abline() again for vertical dashed lines and arrows() to depict the probabilities in each interval.

# plot the standard normal density function
x <- seq(-3.5, 3.5, length = 1001)
y <- dnorm(x)
plot(x, y, ylim = c(0, 0.6), type = "l",
     xlab = "z", ylab = "f(z)", main = "Standard normal distribution N(0, 1)")

for (z in 1:3) {  # iterate through the standard units z
    # add semi-transparent shaded area in [-z, z]
    iv <- x >= -z & x <= z
    polygon(c(-z, x[iv], z),
            c(0, y[iv], 0),
            col = "#aaaaaa50", border = FALSE)
    # add vertical lines at -z and z
    abline(v = c(-z, z), lty = "dashed", col = paste0("#000000", c("FF", "AA", "66")[z]))
    # add arrow and text label annotations to depict the probability
    y_segm <- 0.38 + 0.055 * z
    arrows(-z, y_segm, z, y_segm, code = 3, length = 0.1)
    p <- round((1-pnorm(-z) * 2) * 100,
               ifelse(z < 3, 0, 1))
    text(0, y_segm, paste0(p, "%"), adj = c(0.5, -0.2))
}

Normal distribution (mathematical expressions in labels using axis())

This is the same plot as above, but this time it shows a general normal distribution instead of the standard normal. I’m using axis() to provide mathematical expressions in the x-axis labels.

x <- seq(-3.5, 3.5, length = 1001)
y <- dnorm(x)
plot(x, y, ylim = c(0, 0.6), type = "l",
     xlab = "x", ylab = "f(x)",
     main = expression(paste("Normal distribution ", N(mu, sigma))),
     xaxt = "n")

for (z in 1:3) {
    iv <- x >= -z & x <= z
    polygon(c(x[iv][1], x[iv], max(x[iv])),
            c(0, y[iv], 0),
            col = "#aaaaaa50", border = FALSE)
    abline(v = c(-z, z), lty = "dashed", col = paste0("#000000", c("FF", "AA", "66")[z]))
    y_segm <- 0.38 + 0.055 * z
    arrows(-z, y_segm, z, y_segm, code = 3, length = 0.1)
    text(0, y_segm, paste0(round((1-pnorm(-z) * 2) * 100, 1), "%"), adj = c(0.5, -0.2))
}

xticks <- -3:3
axis(1, at = -3:3, labels = expression(mu - 3 * sigma,
                                       mu - 2 * sigma,
                                       mu - sigma,
                                       mu,
                                       mu + sigma,
                                       mu + 2 * sigma,
                                       mu + 3 * sigma))

Density function, distribution function, quantile function and random number generator (annotations, plotting random points under a curve)

Students often confuse the density function \(f(x)\) (aka probability density function – pdf), the distribution function \(F(x)\) (aka cumulative distribution function – cdf), the quantile function \(F^{-1}(x)\) and their respective counterparts in R. I therefore always depict these functions, explain how they relate to each other and how they can be used in R.

x <- seq(-4, 4, length = 1000)
y <- dnorm(x)
plot(x, y, type = "l",
     main = "Density function f(x) of the standard normal distribution:\ndnorm(x)",
     xlab = "x", ylab = "f(x)")

x <- seq(-4, 4, length = 1000)
y <- pnorm(x)
plot(x, y, type = "l",
     main = "Distribution function F(x) of the standard normal distribution:\npnorm(x)",
     xlab = "x", ylab = "F(x)")

This plot of the distribution function is used to introduce the quantile function, which some students have problems to comprehend.

x <- seq(-4, 4, length = 1000)
y <- pnorm(x)
plot(x, y, type = "l",
     main = "Distribution function F(x) of the standard normal distribution:\npnorm(x)",
     xlab = "x", ylab = "F(x)")
p <- 0.9
q <- qnorm(0.9)
segments(-4, p, q, p, lty = "dashed", col = "red")
segments(q, p, q, 0, lty = "dashed", col = "red")
text(-3.5, p, paste("F(x) =", p), pos = 3, col = "red")
text(q, 0, paste("For which x?"), pos = 4, col = "red")

This plot of the quantile function picks up the question about solving for \(x\) in \(F(x) = 0.9\) from the previous plot. Note how you have to write \(F^{-1}(x)\) as F^{-1} * (x). The * is required to juxtapose \(F^{-1}\) and \((x)\) without horizontal spacing (using ~ instead would introduce spacing).

x <- seq(0, 1, length = 1000)
y <- qnorm(x)
p <- 0.9
q <- qnorm(0.9)

plot(x, y, type = "l",
     main = expression(paste("Quantile function ", F^{-1} * (x),
                             " of the standard normal distribution: qnorm(x)")),
     xlab = "x", ylab = expression(F^{-1} * (x)))
segments(0, q, p, q, lty = "dashed", col = "red")
segments(p, q, p, -3, lty = "dashed", col = "red")
text(p, q + 0.4, substitute(F^{-1} * (p) %~~% q, list(p=p, q=round(q, 2))), col = "red", adj = 1)

After density, distribution and quantile function, the list of functions provided by R for each distribution family can be completed with the random number generation function. As an example, I’m using rnorm() to generate 1000 values x from the standard normal distribution. I’m using density() to approximate the density function of the generated points. Then, I’m constructing max_y as a vector that contains the maximum y coordinate for each value in x so that it is below the density function curve. The density curve is plotted via plot() and the random points are added via points(). Here I’m placing each point in x on a random y coordinate between 0 and the respective value in max_y so that it will always be drawn below the density curve. This works since we can pass a vector like max_y to the max parameter (upper limit of the uniform distribution) of runif() so that for each coordinate in x we have the correct upper limit for the random \(y\) coordinate in order to be placed underneath the curve. Again, I’m using semi-transparent color – this time for the points by setting the fourth parameter in rgb() to 0.5.

set.seed(20231206)
x <- rnorm(1000)
n <- length(x)
d <- density(x)

# for each generated x_i, find the corresponding value of the density curve; this allows to generate a random
# y-coordinate that fits under the density curve
max_y <- sapply(x, function(x_i) {
    d$y[min(which(x_i < d$x))]
})

plot(d, main = paste(n, "randomly sampled values from the standard normal distribution"),
     xlab = "x", ylab = "Density")
points(x, y = runif(n, 0, max_y), col = rgb(0, 0, 0, 0.5))

Sampling distributions and central limit theorem

The following figures depict sampling distributions and show practical implications of the central limit theorem (CLT).

Visualizing different sample draws and estimation errors (dotchart(), mathematical expressions, legends)

For the following plots I will draw random samples from data about rents in a Germany city. The rent data is right skewed as we can see in the following plot. For educational reasons, the data is considered to be the “population” from which the samples are drawn. The mean rent from the data is considered the “population mean” \(\mu\).

set.seed(20231024)

d <- read.table('data/miete03.asc', header = TRUE)

rent <- d$nm
mu <- mean(rent)

n <- length(rent)

hist(rent, main = "Histogram of the rent", freq = FALSE,
     sub = paste("n =", n), xlab = "Rent in €", ylab = "Probability")
abline(v = mu, lty = 'dashed', col = 'red')
text(mu, 0.0017, expression(mu), adj = c(-0.5, 0), col = 'red')

I now draw \(m=10\) random samples from the population, each consisting of \(n=30\) data points. For each of the \(m\) samples, the mean \(\hat \mu_i\) is calculated and depicted in a dot plot via dotchart() together with the mean \(\overline{\hat \mu}\) of all samples.

n <- 30
m <- 10

hat_mu_i <- sapply(1:m, function(i) {
    mean(sample(rent, n))
})

dotchart(hat_mu_i, labels = as.character(1:m), xlab = "Rent in €", ylab = "Sample",
         main = substitute(
             paste("Sample means ", hat(mu)[i], " for ", m, " samples of size n=", n),
             list(m=m, n=n)
        ))
abline(v = mu, lty = 'dashed', col = 'red')
text(mu, 0.5, expression(mu), adj = c(-0.5, 0), col = 'red')
mean_hat_mu <- mean(hat_mu_i)
abline(v = mean_hat_mu, lty = 'dashed', col = 'blue')
text(mean_hat_mu, 0.5, expression(bar(hat(mu))), adj = c(-0.5, 0), col = 'blue')

The following plot then draws much more random samples (\(m=10,000\)) and shows the resulting sample means as histogram. The plot is annotated with the (tiny) absolute error \(|\overline{\hat \mu} - \mu|\) showing that the sample means are distributed around the population mean \(\mu\).

n <- 30
m <- 10000

hat_mu_i <- sapply(1:m, function(i) {
    mean(sample(rent, n))
})

hist(hat_mu_i, freq = FALSE, xlab = "Rent in €", ylab = "Probability",
     main = substitute(
         paste("Histogram of sample means for ", m, " samples of size n=", n),
         list(m=m, n=n)
     )
)

abline(v = mu, lty = "solid", col = 'red')
text(mu, 0, expression(mu), adj = c(-0.5, 0), col = 'red')
mean_hat_mu <- mean(hat_mu_i)
abline(v = mean_hat_mu, lty = 'dashed', col = 'blue')
text(mean_hat_mu, 0.0005, expression(bar(hat(mu))), adj = c(-0.5, 0), col = 'blue')

e <- round(abs(mean_hat_mu - mu), 2)
text(mean_hat_mu, 0.0085, expression(abs(paste(" ", bar(hat(mu)) - mu, " "))), adj = c(-1, 0))
text(mean_hat_mu, 0.0085, paste0("= ", e, "€"), adj = c(-2.2, -0.2))

The next plot shows how random samples of different sizes \(n\) are distributed. One can see that for low \(n\), the skew of the population is also reflected in the sampling distribution, but as \(n\) is increased, this skew disappears. This shows the importance of using a sufficient sample size when suspecting a skewed population distribution.

Each plot is also drawn on the same x-axis scale, so that students see how the standard deviation of the sampling distribution (i.e. the standard error) reduces with increasing sample size \(n\). I also collect the empirical standard errors calculated from the samples in the vector se_per_samplesize to plot them in the next figure.

m <- 10000
se_per_samplesize <- numeric()
samplesizes <- c(15, 30, 100, 1000)

for (n in samplesizes) {
    hat_mu_i <- sapply(1:m, function(i) {
        mean(sample(rent, n))
    })

    hist(hat_mu_i, freq = FALSE, xlab = "Rent in €", ylab = "Probability",
         main = substitute(
             paste("Histogram of sample means for ", m, " samples of size n=", n),
             list(m=m, n=n)
         ),
         breaks = seq(200, 940, by = 5)
    )

    se_per_samplesize <- c(se_per_samplesize, sd(hat_mu_i))
}

The notion of decreasing the standard error and hence increasing the precision of an estimate by increasing the sample size can then be formalized for the mean \(\overline X\) for iid. samples as \(\sigma_{\overline X} = \frac \sigma {\sqrt n}\). The points in the plot reflect the collected standard errors from the samples se_per_samplesize for each \(n\) in samplesizes used in the previous figures. The red dashed line is calculated from the formula for the standard error of sample mean.

plot(samplesizes, se_per_samplesize, type = "p",
     main = "Estimation error of the rent by sample size",
     xlab = "Sample size",
     ylab = "Standard error in €")
nrange <- min(samplesizes):max(samplesizes)
se_theoretic <- sd(rent) / sqrt(nrange)
lines(nrange, se_theoretic, lty = 'dashed', col = 'red')
legend("topright", c("Empirical standard error", "Theoretic standard error"),
       pch = c(1, NA_integer_), lty = c(NA_integer_, 2), col = c("black", "red"))

Distribution of the sample mean \(\overline X\)

According to the CLT, the mean \(\overline X\) of an iid. sample \(X\) is normally distributed with expected value \(\mu\) and standard deviation \(\sigma_{\overline X}\) (standard error). We can reuse the plot of the normal distribution with 68-95-99.7-rule and apply it for this sampling distribution.

x <- seq(-3.5, 3.5, length = 1001)
y <- dnorm(x)
plot(x, y, ylim = c(0, 0.6), type = "l",
     xlab = "x", ylab = "f(x)",
     main = expression(paste("Distribution of the sample mean ", bar(X) %~% ~ N(mu, sigma[bar(X)]))),
     xaxt = "n")

for (z in 1:3) {
    iv <- x >= -z & x <= z
    polygon(c(x[iv][1], x[iv], max(x[iv])),
            c(0, y[iv], 0),
            col = "#aaaaaa50", border = FALSE)
    abline(v = c(-z, z), lty = "dashed", col = paste0("#000000", c("FF", "AA", "66")[z]))
    y_segm <- 0.38 + 0.055 * z
    arrows(-z, y_segm, z, y_segm, code = 3, length = 0.1)
    text(0, y_segm, paste0(round((1-pnorm(-z) * 2) * 100, 1), "%"), adj = c(0.5, -0.2))
}

xticks <- -3:3
axis(1, at = -3:3, labels = expression(mu - 3 * sigma[bar(X)],
                                       mu - 2 * sigma[bar(X)],
                                       mu - sigma[bar(X)],
                                       mu,
                                       mu + sigma[bar(X)],
                                       mu + 2 * sigma[bar(X)],
                                       mu + 3 * sigma[bar(X)]))

To stress again the importance of a sufficient sample size, the following plots sample from a highly right skewed distribution and depict how the sampling distribution picks up this asymmetry so that it is even visible for a sample size of \(n=100\).

unit <- 100000
mean <- 0.6
x <- seq(0, 3, by = 0.01)
d <- dlnorm(x, log(mean)) * unit
d <- d[d < 100000]

hist(d, breaks = seq(0, 1.2, by = 0.1) * unit, probability = TRUE,
     main = "Fictitious income distribution",
     xlab = "Income in €",
     ylab = "Density")

m <- 10000

for (n in c(10, 100, 1000)) {
    means <- sapply(1:m, function(i) {
        X <- rlnorm(n, log(mean)) * unit
        mean(X)
    })
    means <- means[means < 250000]
    
    hist(means, probability = TRUE, breaks = seq(0, 2.5, by = 0.02) * unit,
         main = paste("Distribution of the sample mean of incomes with n =", n),
         xlab = "Sample mean of incomes",
         ylab = "Density")
}

Confidence intervals

To show how confidence intervals for the sample mean work, I draw multiple random samples from the rent dataset (considered the “population” here), calculate their confidence intervals and check whether the interval covers the true population mean \(\mu\). This is done for three different confidence levels using the same scale on the x-axis. It shows that increasing the confidence interval widens the confidence interval so that it more reliably captures the (actually unknown) population mean, but at the same time leads to less precise results. The confidence intervals are calculated under the assumption that the true standard deviation of the population \(\sigma\) is known.

The last legend() call shows another R quirk: If you’re using substitute() in a legend label, you cannot write legend("bottomright", substitute(...)), you have to explicitly set the second parameter to NULL: legend("bottomright", NULL, substitute(...)). Otherwise you will get an argument “legend” is missing, with no default error. This is really odd, because using expression() instead of substitute() works without explicitly setting the second parameter to NULL, as can be seen in the previous legend() call.

set.seed(20231111)

d <- read.table('data/miete03.asc', header = TRUE)

rent <- d$nm
mu <- mean(rent)
sigma <- sqrt(mean((mu-rent)^2))

m <- 100   # number of draws
n <- 30    # sample size

# function to draw sample of size `n` from `data` and calc. the
# confidence interval using normal distrib. quantile `z`;
# pop. standard dev. is assumed to be `sigma`
sample_conf_interval <- function(i, data, n, sigma, z) {
    X <- sample(data, size = n)
    Xbar <- mean(X)
    E <- z * sigma/sqrt(n)
    c(Xbar - E, Xbar + E)
}

for (gamma in c(0.8, 0.95, 0.99)) {   # confidence levels
    alpha <- 1 - gamma    # significance level

    z <- qnorm(1-alpha/2)

    conf_ints <- t(sapply(1:m, sample_conf_interval, rent, n, sigma, z))

    plot(NULL, xlim = c(300, 850), ylim = c(1, m+1),
         main = sprintf("%d%%-confidence intervals for %d random samples (n=%d)\nof the rent dataset",
                        round(gamma*100), m, n),
         xlab = "Rent in €",
         ylab = "Sample draw")
    abline(v = mu, col = "red", lty = "dashed")
    text(mu, 100, expression(mu), col = "red", adj = c(-0.5, -0.5))

    mu_covered <- conf_ints[,1] < mu &  mu < conf_ints[,2]
    conf_int_col <- ifelse(mu_covered, "black", "red")
    segments(conf_ints[,1], 1:m, conf_ints[,2], 1:m, col = conf_int_col)
    n_covered <- sum(mu_covered)
    legend("topright", c(expression(paste("Confidence interval covers ", mu)),
                         expression(paste("Confidence interval doesn't cover ", mu))),
           lty = "solid",
           col = c("black", "red"))
    legend("bottomright", NULL,
           substitute(paste(x, " confidence intervals out of ", m, " cover ", mu),
                      list(x=n_covered, m=m)),
           x.intersp = 0, text.col = "red", box.col = "red")
}

Student’s t distribution

This plot shows the density function of the t distribution with different degrees of freedom together with the standard normal distribution for comparison.

# first plot the standard normal distribution density function
curve(dnorm, lwd = 2, n = 1001, from = -5, to = 5,
      lty = "dashed",
      main = expression(paste(italic(t), " distributions and the standard normal distribution")),
      xlab = "x",
      ylab = "f(x)")

# add t distribution with different df parameter values;
# collect labels and styles for legend in `legend_*` variables
nvals <- c(2, 5, 15)
legend_labels <- c("N(0, 1)")
legend_col <- c(gray(0))
legend_lty <- c("dashed")
for (i in 1:length(nvals)) {
    n <- nvals[i]
    col <- gray(1-0.2*i)

    f <- function(x) { dt(x, n) }
    curve(f, lwd = 2, n = 1001, from = -5, to = 5, add = TRUE, col = col)
    legend_labels <- c(legend_labels, sprintf("t(%d)", n))
    legend_col <- c(legend_col, col)
    legend_lty <- c(legend_lty, "solid")
}

legend("topright", legend_labels, col = legend_col, lty = legend_lty)

Hypothesis testing

One-sample tests

The final topic is hypothesis testing. For this I start with an example of a machine that produces pencils. The manufacturer of the machine claims it produces pencils of length 17 cm with a standard deviation of \(\sigma = 1.5 \text{ cm}\), but a random sample \(X\) of \(n=9\) gives \(\overline X = 17.8 \text{ cm}\). We use hypothesis testing to see whether the manufacturer’s claim of \(\mu = 17 \text{ cm}\) holds.

First, I show a plot displaying the values from the sample, the sample mean \(\overline X\), its confidence interval and \(\mu\). Using arrows() with angle = 90 makes the arrow heads to appear as vertical lines.

mu <- 17
X <- c(16.4, 19.4, 17.8, 16.4, 18.5, 19.0, 19.5, 17.1, 16.1)
Xbar <- mean(X)
n <- length(X)
sigma <- 1.5 / sqrt(n)
z <- qnorm(c(0.025, 0.975), mean = Xbar, sd = sigma)
confint <- paste0(round(z, 2), "cm")
plot(X, runif(n), ylim = c(-0.5, 1.5), yaxt = "n",
     ylab = "", xlab = "Pencil length in cm",
     main = "Random sample of pencil lengths and 95% confidence interval")
abline(v = mu, lty = "dashed")
text(mu, -0.45, expression(mu), adj = -0.5)
arrows(z[1], -0.1, z[2], -0.1, col = "red", length = 0.08, code = 3, angle = 90)
points(Xbar, -0.1, pch = 3, col = "red")
text(c(z, Xbar), 0.075, c(confint, paste0(Xbar, "cm")), col = "red")
text(Xbar, -0.45, expression(bar(X)), col = "red")

The next plot shows the hypothesized distribution of the sample mean under the null hypothesis \(H_0\) that \(\mu = 17 \text{ cm}\) for the given example. I shade the regions of rejection at the tails of the distribution and also display the area under the curve representing the p-value, i.e. the probability that we see a sample mean of \(\overline X = 17.8 \text{ cm}\) or higher, given that the null hypothesis is true.

X <- c(16.4, 19.4, 17.8, 16.4, 18.5, 19.0, 19.5, 17.1, 16.1)
Xbar <- mean(X)
mu <- 17
n <- length(X)
sigma <- 1.5 / sqrt(n)
z <- qnorm(c(0.025, 0.975), mean = mu, sd = sigma)
p <- 1-pnorm(Xbar, mu, sigma)
x <- seq(15, 19, length = 1001)
y <- dnorm(x, mu, sigma)
plot(x, y, type = "l", xlab = "Pencil length in cm", ylab = "f(x)",
     main = expression(paste("Distribution of ", bar(X), " around ", mu, "=17cm under ", H[0])))

iv <- x < z[1]
polygon(c(x[iv][1], x[iv], max(x[iv])),
        c(0, y[iv], 0),
        col = "#aaaaaa50", border = FALSE)
iv <- x > z[2]
polygon(c(x[iv][1], x[iv], max(x[iv])),
        c(0, y[iv], 0),
        col = "#aaaaaa50", border = FALSE)
abline(v = z, lty = "dashed", col = "#00000066")
abline(v = mu, lty = "dashed", col = "#000000DD")
text(mu, 0, expression(mu), adj = -0.25)

iv <- x >= Xbar
polygon(c(x[iv][1], x[iv], max(x[iv])),
        c(0, y[iv], 0),
        col = "#ff000050", border = FALSE)
abline(v = Xbar, lty = "dashed", col = "red")

text(Xbar, 0.15, substitute(P(bar(X) >= Xbar) == p,
                            list(Xbar=Xbar, p=sprintf("%.1f%%", p*100))),
     adj = -0.27, col = "red")
text(c(mu, z[2]), 0.035, c("95%", "2.5%"), adj = -0.25)
text(z[1], 0.035, "2.5%", adj = 1.25)
legend("topright", NULL, substitute(paste(sigma == s, " cm and ", n == nn),
                                    list(s = sigma, nn = n)))

A similar plot as above, showing the relationship of hypothesis tests with confidence intervals:

X <- c(16.4, 19.4, 17.8, 16.4, 18.5, 19.0, 19.5, 17.1, 16.1)
Xbar <- mean(X)
mu <- 17
n <- length(X)
sigma <- 1.5 / sqrt(n)
x <- seq(15, 19, length = 1001)
y <- dnorm(x, mu, sigma)
z <- qnorm(c(0.025, 0.975), mean = Xbar, sd = sigma)
konfint <- paste0(round(z, 2), "cm")

plot(x, y, type = "l", xlab = "Pencil length in cm", ylab = "f(x)",
     main = expression(paste("Distribution of ", bar(X), " around ", mu, "=17cm under ", H[0])))
abline(v = mu, lty = "dashed")
text(mu, 0.1, expression(mu), adj = -0.5)
arrows(z[1], -0.01, z[2], -0.01, col = "red", length = 0.08, code = 3, angle = 90)
points(Xbar, -0.01, pch = 3, col = "red")
text(c(z, Xbar), 0.025, c(konfint, paste0(Xbar, "cm")), col = "red")
text(Xbar, 0.07, expression(bar(X)), col = "red")
legend("topright", NULL, substitute(paste(sigma == s, " cm and ", n == nn),
                                    list(s = sigma, nn = n)))

The following plots visualize the p-value for one-sided and two-sided tests as area under the hypothesized sampling distribution given the null hypothesis and different scenarios.

x <- seq(-4, 4, length = 1001)
y <- dnorm(x)
plot(x, y, type = "l", xlab = "", ylab = "Density",
     main = expression(paste("One-sided test: Alternative hypothesis is ", mu < mu[0])),
     xaxt = "n")
p <- 0.3
z <- qnorm(p)
abline(v = 0, lty = "dashed")
segments(z, 0, z, dnorm(z), lty = "dashed")
axis(1, c(0, z), labels = c(expression(mu[0]), expression(bar(x))))

iv <- x < z
polygon(c(x[iv][1], x[iv], max(x[iv])),
        c(0, y[iv], 0),
        col = "#aaaaaa50", border = FALSE)
text(-1.5, 0.01, expression(P(bar(X) <= bar(x))))

plot(x, y, type = "l", xlab = "", ylab = "Density",
     main = expression(paste("One-sided test: Alternative hypothesis is ", mu > mu[0])),
     xaxt = "n")
abline(v = 0, lty = "dashed")
segments(z, 0, z, dnorm(z), lty = "dashed")
axis(1, c(0, z), labels = c(expression(mu[0]), expression(bar(x))))

iv <- x > z
polygon(c(x[iv][1], x[iv], max(x[iv])),
        c(0, y[iv], 0),
        col = "#aaaaaa50", border = FALSE)
text(1.3, 0.01, expression(P(bar(X) >= bar(x))))

x <- seq(-4, 4, length = 1001)
y <- dnorm(x)
plot(x, y, type = "l", xlab = "", ylab = "Density",
     main = expression(paste("Two-sided test: sample mean is ", bar(x) < mu[0])),
     xaxt = "n")
p <- 0.3
z <- qnorm(p)
abline(v = 0, lty = "dashed")
segments(z, 0, z, dnorm(z), lty = "dashed")
axis(1, c(0, z), labels = c(expression(mu[0]), expression(bar(x))))

iv <- x < z
polygon(c(x[iv][1], x[iv], max(x[iv])),
        c(0, y[iv], 0),
        col = "#aaaaaa50", border = FALSE)
text(-1.5, 0.01, expression(P(bar(X) <= bar(x))))

x <- seq(-4, 4, length = 1001)
y <- dnorm(x)
plot(x, y, type = "l", xlab = "", ylab = "Density",
     main = expression(paste("Two-sided test: sample mean is ", bar(x) > mu[0])),
     xaxt = "n")
abline(v = 0, lty = "dashed")
z <- qnorm(1-p)
segments(z, 0, z, dnorm(z), lty = "dashed")
axis(1, c(0, z), labels = c(expression(mu[0]), expression(bar(x))))

iv <- x > z
polygon(c(x[iv][1], x[iv], max(x[iv])),
        c(0, y[iv], 0),
        col = "#aaaaaa50", border = FALSE)
text(1.3, 0.01, expression(P(bar(X) >= bar(x))))

Exact binomial test

The following two plots visualize how the p-value is calculated using the exact binomial test that can be used in case of count data and sometimes proportions. For this scenario, the example is a machine that is supposed to produce some items. The manufacturer of the machine claims that 90% of the time the machine produces flawless items, so in 10% of the time waste items are produced. A random sample of \(n=100\) items is taken and \(x=12\) of these are waste items. The task is to check the manufacturers claim: How likely is it to get 12 waste items out of 100 items, if the probability of a waste item is 10% as the manufacturer promises? The process can be seen as a Bernoulli experiment where we define a waste item as “success” and hence we can use the exact binomial test with \(x=12\), \(p=0.1\) and \(n=100\).

I first plot the density function of the binomial distribution with the given parameters under the null hypothesis and highlight the probabilities that add up to the p-value.

n <- 100
p <- 0.1
k <- 12
p_less_k <- round(pbinom(k-1, n, p), 3)
p_ge_k <- 1 - p_less_k
x <- 0:100
y <- dbinom(x, n, p)

barcols <- ifelse(x >= k, "orange", "gray")
barplot(y ~ x, axis.lty = 1, col = barcols,
        xlab = "Number x of waste items",
        ylab = "Probability P(X=x)",
        main = "Probability of number of waste items under the null hypothesis")
legend("topright", NULL,
       c(substitute(paste("Number of waste items " < k), list(k = k)),
         substitute(paste("Number of waste items " >= k), list(k = k))),
       fill = c("gray", "orange"))
text(34, 0.04, substitute(P(X >= k) == p, list(k = k, p = p_ge_k)), adj = 0)
arrows(33, 0.039, 20, 0.03, length = 0.1)

I also show the distribution function of the binomial distribution with the given parameters under the null hypothesis and highlight the complement of the p-value.

x <- 0:100
y <- pbinom(x, n, p)

barcols <- ifelse(x == k - 1, "red", "gray")
barplot(y, names.arg = as.character(x), axis.lty = 1,
        col = barcols,
        xlab = "Number x of waste items",
        ylab = expression(paste("Probability ", P(X <= x))),
        main = "Probability of number of flawless items under the null hypothesis")

legend("topright", NULL,
       substitute(paste("Probability ", P(X <= x)), list(x = k - 1)),
       fill = "red")

Two-sample tests

For two-sample hypothesis tests, I’m using the built-in R dataset PlantGrowth, but using only the control group and second treatment data.

pgreduced <- PlantGrowth[PlantGrowth$group != "trt1", ]  # remove data from first treatment
pgreduced$group <- factor(pgreduced$group)
boxplot(weight ~ group, data = pgreduced,
        xlab = "Group", ylab = "Dried weight of plants in kg",
        main = "Plant growth experiment")

The following plot depicts the theoretical distribution of sample means for the treatment data \(X\) and control data \(Y\). I’m using points() with a little random jitter to also show the individual data points.

ctrl <- pgreduced$group == "ctrl"
w_ctrl <- pgreduced$weight[ctrl]
w_trt2 <- pgreduced$weight[!ctrl]

xbar <- mean(w_trt2)
ybar <- mean(w_ctrl)
nx <- length(w_trt2)
ny <- length(w_ctrl)
se_x <- sd(w_trt2) / sqrt(nx)
se_y <- sd(w_ctrl) / sqrt(ny)

x <- seq(4, 6.5, length = 1001)
y_trt2 <- dnorm(x, mean = xbar, sd = se_x)
plot(w_trt2, runif(nx, 0, 0.1), col = "darkred",
     xlim = c(min(x), max(x)), ylim = c(0, 3),
     xlab = "Dried weight of plants in kg", ylab = "Density",
     main = "Plant growth experiment\nDistribution of sample means")
points(w_ctrl, runif(ny, 0, 0.1), col = "blue")
lines(x, y_trt2, col = "darkred")
abline(v = xbar, lty = "dashed", col = "darkred")
text(xbar, 0.15, substitute(bar(X) == xbar, list(xbar = paste0(xbar, "kg"))), pos = 4, col = "darkred")
y_ctrl <- dnorm(x, mean = ybar, sd = se_y)
lines(x, y_ctrl, col = "blue")
abline(v = ybar, lty = "dashed", col = "blue")
text(ybar, 0.15, substitute(bar(Y) == ybar, list(ybar = paste0(ybar, "kg"))), pos = 2, col = "blue")
legend("topright", c("X: treatment", "Y: control"),
       lty = "solid", col = c("darkred", "blue"))

I plot the distribution of the difference in sample means, \(\overline X - \overline Y\) under the null hypothesis that the treatment doesn’t work, i.e. that the treatment produces equal or less yield: \(\mu_X \le \mu_Y\). The shown density function is that of a t distribution for the respective degrees of freedom, but scaled for difference in kg instead of t-values. The data points for the actual differences are again plotted with a little jitter and the area under the curve that represents the p-value is shaded in red.

w_ttest_res <- t.test(w_trt2, w_ctrl, alternative = "greater")
se <- w_ttest_res$stderr
d <- w_trt2 - w_ctrl
d_tscale <- d / se
dbar <- mean(d)
dbar_tscale <- dbar / se
xmin <- floor(min(d_tscale)) - 1
xmax <- ceiling(max(d_tscale))

x <- seq(xmin, xmax, length.out = 1001)
y <- dt(x, w_ttest_res$parameter)
plot(x, y, type = "l", xaxt = "n",
     xlab = expression(paste("Difference ", bar(X) - bar(Y), " in kg")),
     ylab = "Density",
     main = expression(paste("Distribution of the difference ", bar(X) - bar(Y),
                             " under ", H[0], ": ", mu[X] - mu[Y] <= 0)))
xaxt_labels <- seq(floor(xmin*se), ceiling(xmax*se), by = 0.5)
xaxt_labels_pos <- xaxt_labels/se
axis(1, at = xaxt_labels_pos, labels = xaxt_labels)
points(d_tscale, runif(length(d_tscale), 0, 0.01), col = "red")
abline(v = 0, lty = "dashed")
text(0, 0.025, expression(mu[X] - mu[Y] == 0), pos = 4)
abline(v = dbar_tscale, lty = "dashed", col = "red")
iv <- x >= dbar_tscale
polygon(c(x[iv][1], x[iv], max(x[iv])),
        c(0, y[iv], 0),
        col = "#ff000080", border = FALSE)
text(dbar_tscale + 0.25, 0.025, substitute(P(bar(X) - bar(Y) > dbar) == p,
                                           list(dbar = paste0(dbar, "kg"),
                                                p = round(w_ttest_res$p.value, 4))),
     pos = 4, col = "red")

The following plot shows the same picture as before, but using t-values / standard units instead of the difference in kg.

x <- seq(xmin, xmax, length.out = 1001)
y <- dt(x, w_ttest_res$parameter)
plot(x, y, type = "l",
     xlab = "t",
     ylab = "Density",
     main = expression(paste(italic(t), " distribution under ", H[0], ": ", mu[X] - mu[Y] <= 0)))
points(d_tscale, runif(length(d), 0, 0.01), col = "red")
abline(v = 0, lty = "dashed")
text(0, 0.025, expression(mu[X] - mu[Y] == 0), pos = 4)
abline(v = dbar_tscale, lty = "dashed", col = "red")
iv <- x >= dbar_tscale
polygon(c(x[iv][1], x[iv], max(x[iv])),
        c(0, y[iv], 0),
        col = "#ff000080", border = FALSE)
text(dbar_tscale + 0.25, 0.025, substitute(1-P(t < tval) == pval,
                                           list(dbar = paste0(dbar, "kg"),
                                                tval = round(w_ttest_res$statistic, 4),
                                                pval = round(w_ttest_res$p.value, 4))),
     pos = 4, col = "red")

Finally, I show a figure that depicts the probability of making type I error (i.e. rejecting the null hypothesis when it is actually true) in the case of multiple tests.

alpha <- 0.05
m <- 1:25
y <- 1-(1-alpha)^m

plot(m, y,
     main = substitute(paste("Error rate for multiple tests with ", alpha==a), list(a=alpha)),
     xlab = "Number of tests", ylab = "Probability of type I error")
lines(m, y)


  1. Interestingly paste0() doesn’t work at all within expression().↩︎

  2. Yes, there are situations where using a for-loop instead of vectorized operations, lapply(), etc. is fine in R and this is one of these situations, since this loop doesn’t generate or update values, but only causes side effects (drawing plots).↩︎