library(dplyr)
library(tidyr)
library(ggplot2)
library(ggtext)

theme_set(theme_minimal())

Introduction

Contemplating about the weather, I wondered if I could find out the “most unusual” and “most ideal” years regarding air temperature in Germany, i.e. if I could identify the years in which the daily temperature deviated the most and the least from the expected seasonal temperature. So I decided to look into historical climate data, created an extremely simplified seasonal temperature model and then investigated the deviations from that model. Although it’s all quite simple, this little exploration gives some insights into how and why we can use a linear model for such data.

You can find the source code repository for this project on GitHub. The code is licensed under Apache License 2.0. See LICENSE file.

Data

I retrieved the historical climate data for a weather station in Berlin-Dahlem (a bit outside the city) from 1950 to now from the German Meteorological Service (Deutscher Wetterdienst – DWD). The data come as delimited files with semicolon as column separator. Historical data until 2022 and present data from 2022 to now come as separate files.

raw_hist <- read.delim('data/produkt_klima_tag_19500101_20221231_00403.txt', sep = ';')
head(raw_hist)
##   STATIONS_ID MESS_DATUM QN_3   FX   FM QN_4  RSK RSKF  SDK SHK_TAG  NM VPM
## 1         403   19500101 -999 -999 -999    5  2.2    7 -999       0 5.0 4.0
## 2         403   19500102 -999 -999 -999    5 12.6    8 -999       0 8.0 6.1
## 3         403   19500103 -999 -999 -999    5  0.5    1 -999       0 5.0 6.5
## 4         403   19500104 -999 -999 -999    5  0.5    7 -999       0 7.7 5.2
## 5         403   19500105 -999 -999 -999    5 10.3    7 -999       0 8.0 4.0
## 6         403   19500106 -999 -999 -999    5  7.2    8 -999      12 7.3 5.6
##       PM  TMK UPM  TXK  TNK  TGK eor
## 1 1025.6 -3.2  83 -1.1 -4.9 -6.3 eor
## 2 1005.6  1.0  95  2.2 -3.7 -5.3 eor
## 3  996.6  2.8  86  3.9  1.7 -1.4 eor
## 4  999.5 -0.1  85  2.1 -0.9 -2.3 eor
## 5 1001.1 -2.8  79 -0.9 -3.3 -5.2 eor
## 6  997.5  2.6  79  5.0 -4.0 -4.0 eor
raw_pres1 <- read.delim('data/produkt_klima_tag_20221107_20240509_00403.txt', sep = ';')
head(raw_pres1)
##   STATIONS_ID MESS_DATUM QN_3   FX   FM QN_4 RSK RSKF SDK SHK_TAG  NM  VPM
## 1         403   20221107 -999 -999 -999   10 0.0    6 4.5       0 6.2  9.6
## 2         403   20221108 -999 -999 -999   10 0.2    6 7.5       0 6.0 10.4
## 3         403   20221109 -999 -999 -999   10 1.0    6 3.7       0 6.6 11.4
## 4         403   20221110 -999 -999 -999   10 0.0    0 6.1       0 5.1 10.2
## 5         403   20221111 -999 -999 -999   10 0.0    0 1.9       0 6.3  9.6
## 6         403   20221112 -999 -999 -999   10 0.0    0 7.3       0 4.0  8.8
##       PM  TMK UPM  TXK TNK  TGK eor
## 1 1002.9 10.7  75 15.0 6.4  5.1 eor
## 2 1002.7 12.1  75 16.9 7.9  4.2 eor
## 3 1001.5 11.8  83 15.0 9.0  5.1 eor
## 4 1012.6 11.7  74 14.3 8.6  5.8 eor
## 5 1020.1  8.6  87 12.8 4.0  0.6 eor
## 6 1022.8  6.4  92 13.8 1.8 -0.9 eor
raw_pres2 <- read.delim('data/produkt_klima_tag_20230203_20240805_00403.txt', sep = ';')
head(raw_pres2)
##   STATIONS_ID MESS_DATUM QN_3   FX   FM QN_4 RSK RSKF SDK SHK_TAG  NM VPM
## 1         403   20230203 -999 -999 -999   10 5.1    8 1.8       0 7.4 7.8
## 2         403   20230204 -999 -999 -999   10 0.0    0 5.4       0 3.8 5.2
## 3         403   20230205 -999 -999 -999   10 0.1    7 3.8       0 4.8 4.1
## 4         403   20230206 -999 -999 -999   10 0.8    7 0.0       0 5.4 4.8
## 5         403   20230207 -999 -999 -999   10 0.0    0 8.9       0 1.3 4.0
## 6         403   20230208 -999 -999 -999   10 0.0    0 8.9       0 0.4 4.0
##       PM  TMK UPM TXK  TNK   TGK eor
## 1 1004.3  5.5  87 8.6  1.4   0.4 eor
## 2 1024.5  1.3  76 4.1 -2.1  -5.3 eor
## 3 1028.5 -2.4  82 1.3 -5.3  -7.3 eor
## 4 1029.7 -2.8  95 0.2 -5.9  -9.4 eor
## 5 1032.2 -3.5  85 2.5 -8.0 -10.6 eor
## 6 1031.0 -2.5  80 2.4 -6.5  -8.6 eor

After reading in the files, we merge them, select only the necessary variables, transform the dates and remove duplicates (since the historical and the present data both contain observations from 2022) to generate our final measurements dataset meas:

meas <- bind_rows(raw_hist, raw_pres1, raw_pres2) |>
    select(date = MESS_DATUM, temp = TMK) |>   # TMK is day-time average temperature in °C
    mutate(date = as.POSIXct(strptime(date, "%Y%m%d")),
           year = as.integer(as.numeric(format(date, "%Y"))),
           day = as.integer(as.numeric(format(date, "%j")))) |> # day of the year as decimal number from 1 to 366
    distinct(date, .keep_all = TRUE)   # remove duplicates
rm(raw_hist, raw_pres1, raw_pres2)         # don't need the raw data any more
stopifnot(all(count(meas, date)$n == 1))   # make sure there are no duplicates
head(meas)
##         date temp year day
## 1 1950-01-01 -3.2 1950   1
## 2 1950-01-02  1.0 1950   2
## 3 1950-01-03  2.8 1950   3
## 4 1950-01-04 -0.1 1950   4
## 5 1950-01-05 -2.8 1950   5
## 6 1950-01-06  2.6 1950   6

Visual analysis

Let’s visualize the time series with a simple plot. I will also add a smoothed curve showing an overall trend, which indicates a nearly linear increase in average annual temperature by about 2°C since the 1950’s. I’ll later come back to that. We can also see the typical seasonal changes.

ggplot(meas, aes(date, temp)) +
    geom_line() +
    geom_smooth(method = "gam") +
    labs(title = "Daily day-time average temperature in Berlin-Dahlem over time",
         x = "",
         y = "Temperature in °C")
Daily day-time average temperature in Berlin-Dahlem over time

Figure 1: Daily day-time average temperature in Berlin-Dahlem over time

The periodical temperature changes can be better shown by looking at a smaller time frame:

filter(meas, year >= 2018) |>
    ggplot(aes(date, temp)) +
        geom_line() +
        geom_smooth(span = 0.2, method = "loess") +
        labs(title = "Daily day-time average temperature in Berlin-Dahlem since 2018",
             x = "",
             y = "Temperature in °C")

We can also visualize the annual trend by plotting the temperature against the day of the year. We can see the typical seasonal pattern, but also the slight overall increase in temperature over the years, since more recent years (yellow color) tend to have higher temperatures, especially in the winter.

ggplot(meas, aes(day, temp, color = year)) +
    geom_line(alpha = 0.25) +
    scale_color_binned(name = "Year", type = 'viridis') +
    labs(title = "Daily day-time average temperature in Berlin-Dahlem over time",
             x = "Day of the year",
             y = "Temperature in °C")
Daily day-time average temperature in Berlin-Dahlem over time

Figure 2: Daily day-time average temperature in Berlin-Dahlem over time

Modeling

Naturally, and confirmed with the above plots, we can use a periodic function like the cosine function to model these temperatures. In general this periodic function can be written as

\[\begin{equation} y = c \cos (x + \varphi), \tag{1} \end{equation}\]

where \(c\) controls the amplitude (maximum spikes), \(\varphi\) the phase (shift on the x-axis) and \(x\) the frequency. With a linear model, we can only fit linear terms like \(y = ax + b\), so we have the problem that we can’t estimate the frequency and the phase. Luckily – in our very simple case – the frequency is already known: the seasonal pattern repeats yearly, so we can calculate \(x = 2 \pi D / 366\), where \(D\) is the day of the year. Because of leap years, \(D\) can range from 1 to 366 and so we divide it by 366. This means that over the course of a year, \(x\) makes “a full circle” from above 0 to \(2 \pi\).

The second problem – that we can’t estimate the phase with a linear model directly – can be solved by applying a neat trick that transforms the cosine wave with an amplitude and a phase shift to a linear combination of a cosine and a sine wave:

\[\begin{equation} c \cos (x + \varphi) = a \cos x + b \sin x, \tag{2} \end{equation}\]

where

\[ c = \text{sgn}(a) \sqrt{a^2 + b^2}, \\ \varphi = \arctan \frac{-b} a. \]

This means we can estimate \(a\) and \(b\) as coefficients for the above linear combination that is equivalent to the initially defined cosine wave. Hence we can finally specify our linear model m1 for the temperatures \(Y_t\) as

\[\begin{equation} Y_t = \beta_0 + \beta_1 \cos(x_t) + \beta_2 \sin(x_t) + \epsilon_t, \tag{3} \end{equation}\]

where \(x_t\) is the only regressor – the day of the year transformed to range \((0, 2 \pi]\) as described above –, \(\beta_0\) to \(\beta_2\) are the coefficients we seek to estimate and \(\epsilon_t\) is the error term.

Model estimation

We can now estimate the model m1 in R by first computing \(x_t\) and then using lm to fit the model using our measurements meas:

# compute frequency x
meas$x <- 2 * pi * meas$day/366

# fit the model
m1 <- lm(temp ~ cos(x) + sin(x), meas)
summary(m1)
## 
## Call:
## lm(formula = temp ~ cos(x) + sin(x), data = meas)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.346  -2.561  -0.023   2.585  14.054 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.41585    0.02285   412.0   <2e-16 ***
## cos(x)      -9.01488    0.03234  -278.8   <2e-16 ***
## sin(x)      -2.55510    0.03230   -79.1   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.772 on 27243 degrees of freedom
## Multiple R-squared:  0.7551, Adjusted R-squared:  0.7551 
## F-statistic: 4.2e+04 on 2 and 27243 DF,  p-value: < 2.2e-16

The model has a decent fit given its simplicity: About 75% of the variation of the temperature can be explained by the seasonal pattern we modeled with the periodic functions. The estimated intercept reflects the mean of the temperature and the coefficients for the periodic functions result in the oscillation around the mean.

We can see that for predicted values near 0°C the errors are a bit larger and also the distribution of the errors is slightly left skewed.

plot(m1, which = 1:2, ask = FALSE)

Let’s plot the model predictions of the temperatures on top of the measured values. For a better overview, we only consider data from 2018 or newer:

meas_fit <- cbind(meas, pred = fitted(m1))

filter(meas_fit, year >= 2018) |>
    ggplot() +
        geom_line(aes(date, temp), alpha = 0.25) +
        geom_line(aes(date, pred), color = 'red') +
        labs(title = "Daily day-time average temperature in Berlin-Dahlem since 2018",
             subtitle = "Model predictions in <span style='color:red'>red</span>.",
             x = "",
             y = "Temperature in °C") +
        theme(plot.subtitle = element_markdown())

Let’s show that we can again apply the transformation in equation (2) so that we can recover the cosine wave form in equation (1) from the linear combination form of the model equation (3). As expected, we get the same predictions from both model representations:

a <- m1$coefficients[2]
b <- m1$coefficients[3]

# calculate c and phi from the coefficients a and b
c <- sign(a) * sqrt(a^2 + b^2)
phi <- atan(-b/a)

# predictions from the cosine wave form of the model equation for model m1
meas_fit$pred2 <- m1$coefficients[1] + c * cos(meas$x + phi)

filter(meas_fit, year >= 2018) |>
    ggplot() +
        geom_line(aes(date, temp), alpha = 0.25) +
        geom_line(aes(date, pred), color = 'red') +
        geom_line(aes(date, pred2), color = 'blue', linetype = "dashed") +
        labs(title = "Daily day-time average temperature in Berlin-Dahlem since 2018",
             subtitle = "Model predictions in <span style='color:red'>red (linear combination form)</span> 
                         and <span style='color:blue'>blue (cosine wave form)</span>.",
             x = "",
             y = "Temperature in °C") +
        theme(plot.subtitle = element_markdown())

Let’s plot the model predictions for the whole time range.

ggplot(meas_fit) +
    geom_line(aes(date, temp), alpha = 0.25) +
    geom_line(aes(date, pred), color = 'red') +
    labs(title = "Daily day-time average temperature in Berlin-Dahlem over time",
         subtitle = "Model predictions in <span style='color:red'>red</span>.",
         x = "",
         y = "Temperature in °C") +
    theme(plot.subtitle = element_markdown())

It’s barely visible at this scale, but the model systematically overestimates temperatures around the beginning of the time range and underestimates temperatures towards the end of the time range. We can see this more clearly with a plot similar to figure 2. Temperatures before 1970 (dark purple) tend to be overestimated while temperatures from 2000 and later tend to be underestimated:

ggplot(meas_fit, aes(day, temp, color = year)) +
    geom_line(alpha = 0.25) +
    geom_line(aes(day, pred), color = 'red') +
    scale_color_binned(name = "Year", type = 'viridis') +
    labs(title = "Daily day-time average temperature in Berlin-Dahlem over time",
         subtitle = "Model predictions in <span style='color:red'>red</span>.",
         x = "Day of the year",
         y = "Temperature in °C") +
    theme(plot.subtitle = element_markdown())

We can confirm this conjecture quantitatively by calculating the mean error of our predictions \(\hat Y_t - Y_t\) for different decades:

(mean_err_per_decade <- mutate(meas_fit, d = round((year - 1900) / 10) * 10,
                 decade = as.ordered(ifelse(d < 100, paste0("19", d, "s"), sprintf("20%02ds", d - 100)))) |>
    group_by(decade) |>
    summarise(mean_error = mean(pred - temp)))
## # A tibble: 8 × 2
##   decade mean_error
##   <ord>       <dbl>
## 1 1950s      0.292 
## 2 1960s      0.832 
## 3 1970s      0.499 
## 4 1980s      0.512 
## 5 1990s     -0.0935
## 6 2000s     -0.215 
## 7 2010s     -0.463 
## 8 2020s     -1.39
ggplot(mean_err_per_decade, aes(x = decade, y = mean_error)) +
    geom_hline(yintercept = 0, linetype = "dashed") +
    geom_point() +
    scale_y_continuous(limits = c(-2, 2)) +
    labs(title = "Mean prediction error per decade for m1",
         x = "Decade",
         y = "Prediction error in °C")
Mean prediction error per decade for m1

Figure 3: Mean prediction error per decade for m1

Improving the model

The reason for these systematic errors is the slight increase of the average temperature per annum as indicated by the trend line in figure 1. To put it bluntly: we didn’t account for global warming! We can do so by updating the model equation from model m1 (3) in order to include a term for a yearly linear change (i.e. increase) in temperature. Of course this is very simplified, but it should improve our previous model. So our updated model m2 is now:

\[ Y_t = \beta_0 + \beta_1 \cos(x_t) + \beta_2 \sin(x_t) + \beta_3 \text{year}_t + \epsilon_t. \]

And we can fit this updated model:

m2 <- lm(temp ~ cos(x) + sin(x) + year, meas)
summary(m2)
## 
## Call:
## lm(formula = temp ~ cos(x) + sin(x) + year, data = meas)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.0969  -2.5327  -0.0539   2.5743  13.0428 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -47.489279   2.080318  -22.83   <2e-16 ***
## cos(x)       -9.012241   0.031902 -282.50   <2e-16 ***
## sin(x)       -2.563383   0.031870  -80.43   <2e-16 ***
## year          0.028642   0.001047   27.36   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.722 on 27242 degrees of freedom
## Multiple R-squared:  0.7617, Adjusted R-squared:  0.7616 
## F-statistic: 2.902e+04 on 3 and 27242 DF,  p-value: < 2.2e-16

We can see that the fit is slightly better. \(R^2\) increased by about \(1\) percentage points, indicating that about \(1\%\) of the variation of the average daily temperature can be attributed to the annual temperature increase. This doesn’t sound much, however, you should keep in mind that we’re talking about daily average temperatures which are influenced by a lot of factors.

The estimated coefficient for the annual change is about \(0.0289\), which means that the we expect the annual average temperature to increase by about \(0.289\text{°C}\) within \(10\) years, or about \(2.89\text{°C}\) within \(100\) years, which is indeed quite much.

Note also that our intercept has changed from about \(9.4\text{°C}\) to about \(-47.3\text{°C}\), because the model needs to account for the annual temperature increase and hence predicts an average temperature of \(-47.3\text{°C}\) in year \(0\), while for example the average temperature in year \(2000\) is predicted as \(-47.3\text{°C} + 2000 \cdot 0.0289\text{°C} = 10.5\text{°C}\) (and \(39.4\text{°C}\) in year \(3000\)!). This of course shows that such simple models should never be used for extrapolation.

The model fit inspection plots didn’t change much:

plot(m2, which = 1:2, ask = FALSE)

We again check the model fit visually for a smaller time span:

meas_fit2 <- cbind(meas, pred = fitted(m2))

filter(meas_fit2, year >= 2018) |>
    ggplot() +
        geom_line(aes(date, temp), alpha = 0.25) +
        geom_line(aes(date, pred), color = 'red') +
        labs(title = "Daily day-time average temperature in Berlin-Dahlem since 2018",
             subtitle = "Model predictions in <span style='color:red'>red</span>.",
             x = "",
             y = "Temperature in °C") +
        theme(plot.subtitle = element_markdown())

And we also check it for the full time range. Here, we also see the linear increase in average temperature introduced in the updated model:

ggplot(meas_fit2) +
    geom_line(aes(date, temp), alpha = 0.25) +
    geom_line(aes(date, pred), color = 'red') +
    labs(title = "Daily day-time average temperature in Berlin-Dahlem over time",
         subtitle = "Model predictions in <span style='color:red'>red</span>.",
         x = "",
         y = "Temperature in °C") +
    theme(plot.subtitle = element_markdown())

ggplot(meas_fit2, aes(day, temp, color = year)) +
    geom_line(alpha = 0.25) +
    geom_line(aes(day, pred, color = year, group = year)) +
    scale_color_binned(type = 'viridis') +
    labs(title = "Daily day-time average temperature in Berlin-Dahlem over time",
         subtitle = "Model predictions as opaque colored curves.",
         x = "Day of the year",
         y = "Temperature in °C") +
    theme(plot.subtitle = element_markdown())

As compared to model m1 (see figure 3), we don’t see the systematic underestimation of the temperature in the later decades anymore:

(mean_err_per_decade2 <- mutate(meas_fit2, d = round((year - 1900) / 10) * 10,
                                           decade = as.ordered(ifelse(d < 100, paste0("19", d, "s"),
                                                                      sprintf("20%02ds", d - 100)))) |>
    group_by(decade) |>
    summarise(mean_error = mean(pred - temp)))
## # A tibble: 8 × 2
##   decade mean_error
##   <ord>       <dbl>
## 1 1950s    -0.704  
## 2 1960s     0.0641 
## 3 1970s     0.0177 
## 4 1980s     0.318  
## 5 1990s    -0.00176
## 6 2000s     0.163  
## 7 2010s     0.201  
## 8 2020s    -0.461
ggplot(mean_err_per_decade2, aes(x = decade, y = mean_error)) +
    geom_hline(yintercept = 0, linetype = "dashed") +
    geom_point() +
    scale_y_continuous(limits = c(-2, 2)) +
    labs(title = "Mean prediction error per decade for m2",
         x = "Decade",
         y = "Prediction error in °C")
Mean prediction error per decade for m2

Figure 4: Mean prediction error per decade for m2

Investigating deviations from the model

The m2 model seems good enough for our purpose to identify the years with the least and the most deviation from the typical seasonal temperatures. We will hence analyze the residuals \(Y_t - \hat Y_t\) from m2 and we’ll start by looking at their distribution. The distribution of residuals is, as confirmed earlier by the model inspection plots, normal around 0°C:

resid <- meas_fit2$temp - meas_fit2$pred
ggplot(data.frame(resid = resid), aes(resid, after_stat(density))) +
    geom_histogram(bins = 20) +
    labs(title = "Distribution of residuals for model m2",
         x = "Residual of predicted temperature in °C",
         y = "Density")

We want to investigate the annual rate of “unusual” day temperatures, i.e. on how many days the temperature was either much lower or much higher than the typical temperature for that season predicted by our model m2. Choosing a threshold for this can be quite arbitrary. We will at least base the threshold on the given historical data, by using the 90th-percentile of the absolute residuals as threshold:

quantile(abs(resid), 0.9)
##      90% 
## 6.010399

This means an “unusual” temperature is one that, from our historical data, is either higher or lower than the seasonal trend plus the global warming trend by around \(6\text{°C}\) at 10% of the days since 1950. We now use this threshold to calculate some statistics from the residuals for each year: The mean deviation \(\overline{Y_t - \hat Y_t}\), mean absolute deviation \(\overline{|Y_t - \hat Y_t|}\) and the proportions of unusually cold or warm days.

thresh_unusal_temp <- 6

resid_stats <- group_by(meas_fit2, year) |>
    summarise(me = mean(temp - pred),
              mae = mean(abs(temp - pred)),
              prop_days_warmer = mean(temp > pred + thresh_unusal_temp),
              prop_days_colder = mean(temp < pred - thresh_unusal_temp))

resid_stats_plt <- pivot_longer(resid_stats, !year, names_to = "measure")

filter(resid_stats_plt, measure %in% c("mae", "me")) |>
    mutate(measure = factor(case_match(measure, "mae" ~ "Mean abs. deviation", "me" ~ "Mean deviation"),
                            levels = c("Mean deviation", "Mean abs. deviation"))) |>
    ggplot(aes(x = year, y = value, fill = measure)) +
        geom_col(position = position_dodge()) +
        scale_fill_discrete(guide = "none") +
        facet_wrap(vars(measure), nrow = 2, scales = "free_y") +
        labs(title = "Deviation from the typical seasonal temperature",
             x = "Year",
             y = "Temperature in °C")

We can see that there’s quite some variation in terms of mean deviation in the different years. Some years like 1996 or 2010 tended to be mostly colder than usual, while 1953 and 2024 (so far) were mostly warmer than usual:

select(resid_stats, year, me) |>
    arrange(me) |>
    mutate(rank = row_number()) |>
    filter(rank %in% c(1:3, (n()-2):n())) |>
    select(rank, everything())
## # A tibble: 6 × 3
##    rank  year    me
##   <int> <int> <dbl>
## 1     1  1996 -1.92
## 2     2  2010 -1.86
## 3     3  1987 -1.58
## 4    73  1951  1.27
## 5    74  2024  1.41
## 6    75  1953  1.57

Exceptionally high and low temperatures may cancel each other out, so we have to look at the mean absolute deviation and at the proportion of unusual cold or warm days. 2017, 1973 and 1958 are the years with lowest mean absolute deviation from the seasonal trend, while 1956, 2010 and 1985 deviated the most.

select(resid_stats, year, mae) |>
    arrange(mae) |>
    mutate(rank = row_number()) |>
    filter(rank %in% c(1:3, (n()-2):n())) |>
    select(rank, everything())
## # A tibble: 6 × 3
##    rank  year   mae
##   <int> <int> <dbl>
## 1     1  2017  2.35
## 2     2  1973  2.45
## 3     3  1958  2.48
## 4    73  1956  3.48
## 5    74  2010  3.52
## 6    75  1985  3.55
filter(resid_stats_plt, measure %in% c("prop_days_warmer", "prop_days_colder")) |>
    mutate(measure = case_match(measure, "prop_days_warmer" ~ "Unusually warm", "prop_days_colder" ~ "Unusually cold")) |>
    ggplot(aes(x = year, y = value, fill = measure)) +
        geom_col(position = position_stack()) +
        scale_fill_discrete(limits = rev, name = "") +
        labs(title = "Proportion of unusually warm or cold days",
             x = "Year",
             y = "")

1958, 2017 and 2023 had the least number of days with unusually low or high temperatures and 1985, 2010 and 2013 had the most. Again, please keep in mind, that our model m2 already takes into account the annual temperature increase caused by global warming, so “unusual” refers only to the seasonal trend.

select(resid_stats, year, prop_days_warmer, prop_days_colder) |>
    mutate(prop_days_unusual = prop_days_warmer + prop_days_colder) |>
    arrange(prop_days_unusual) |>
    mutate(rank = row_number()) |>
    filter(rank %in% c(1:3, (n()-2):n())) |>
    select(rank, year, prop_days_unusual, everything())
## # A tibble: 6 × 5
##    rank  year prop_days_unusual prop_days_warmer prop_days_colder
##   <int> <int>             <dbl>            <dbl>            <dbl>
## 1     1  1958            0.0384           0.0219           0.0164
## 2     2  2017            0.0411           0.0274           0.0137
## 3     3  1951            0.0493           0.0493           0     
## 4    73  2013            0.153            0.0521           0.101 
## 5    74  2010            0.159            0.0329           0.126 
## 6    75  1985            0.195            0.0493           0.145

We identify the years with the least and most deviation using the the mean absolute deviation measure. By this measure, 2017 is the “most ideal year”, i.e. the year following closest the seasonal trend. This is also reflected by a small proportion of unusual day temperatures (ranking second in the above table). In contrast, 1985 is the year with the highest mean absolute deviation. It also has the highest proportion of unusual day temperatures, especially caused by a very cold winter. The following figure compares both years and highlights unusual temperatures:

least_deviation_yr <- slice_min(resid_stats, mae) |> pull(year)
most_deviation_yr <- slice_max(resid_stats, mae) |> pull(year)

least_most_plt <- data.frame(year = c(least_deviation_yr, most_deviation_yr), label = c("least deviation", "most deviation")) |>
    inner_join(meas_fit2, by = 'year') |>
    mutate(label = paste0(year, " (", label, ")"),
           resid = temp - pred,
           transparency = ifelse(abs(resid) > thresh_unusal_temp, 0.5, 0.1))

ggplot(least_most_plt, aes(day, temp, color = label)) +
    geom_point(aes(alpha = transparency)) +
    geom_line(aes(day, pred)) +
    scale_color_discrete(guide = guide_legend(title = NULL)) +
    scale_alpha_identity(guide = NULL) +
    labs(title = "Daily day-time average temperature in Berlin-Dahlem over time for select years",
         x = "Day of the year",
         y = "Temperature in °C")

Looking only at the residuals gives a clearer image regarding the deviations:

ggplot(least_most_plt, aes(day, resid, color = label, alpha = transparency)) +
    geom_hline(yintercept = 0, linetype = "dashed") +
    geom_hline(yintercept = -thresh_unusal_temp, linetype = "dotted") +
    geom_hline(yintercept = thresh_unusal_temp, linetype = "dotted") +
    geom_point() +
    scale_color_discrete(guide = guide_legend(title = NULL)) +
    scale_alpha_identity(guide = NULL) +
    labs(title = "Residuals for select years",
         x = "Day of the year",
         y = "Residual of predicted temperature in °C")

Global warming

In our previous model m2 we included the yearly increasing temperatures due to global warming. This means, we expected higher temperatures in recent years and so they are not “unusual.” But what happens if we don’t account for global warming? For this, we train the model only with data from the first ten years, i.e. 1950 to 19591, and don’t include the year term:

# fit the model
m3 <- lm(temp ~ cos(x) + sin(x), filter(meas, year < 1960))
summary(m3)
## 
## Call:
## lm(formula = temp ~ cos(x) + sin(x), data = filter(meas, year < 
##     1960))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.2150  -2.4748  -0.1029   2.4999  11.7294 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.90251    0.06101  145.91   <2e-16 ***
## cos(x)      -9.08065    0.08638 -105.12   <2e-16 ***
## sin(x)      -2.76348    0.08619  -32.06   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.687 on 3649 degrees of freedom
## Multiple R-squared:  0.768,  Adjusted R-squared:  0.7679 
## F-statistic:  6040 on 2 and 3649 DF,  p-value: < 2.2e-16

Again we check the model fit:

meas_fit3 <- cbind(meas, pred = predict(m3, meas))

ggplot(meas_fit3) +
    geom_line(aes(date, temp), alpha = 0.25) +
    geom_line(aes(date, pred), color = 'red') +
    labs(title = "Daily day-time average temperature in Berlin-Dahlem",
         subtitle = "Model predictions in <span style='color:red'>red</span>.",
         x = "",
         y = "Temperature in °C") +
    theme(plot.subtitle = element_markdown())

We can see that since the 1990s, the model underestimates the actual temperatures:

(mean_err_per_decade3 <- mutate(meas_fit3, d = round((year - 1900) / 10) * 10,
                                           decade = as.ordered(ifelse(d < 100, paste0("19", d, "s"),
                                                                      sprintf("20%02ds", d - 100)))) |>
    group_by(decade) |>
    summarise(mean_error = mean(pred - temp)))
## # A tibble: 8 × 2
##   decade mean_error
##   <ord>       <dbl>
## 1 1950s   -0.221   
## 2 1960s    0.318   
## 3 1970s   -0.0144  
## 4 1980s   -0.000892
## 5 1990s   -0.607   
## 6 2000s   -0.728   
## 7 2010s   -0.977   
## 8 2020s   -1.91
ggplot(mean_err_per_decade3, aes(x = decade, y = mean_error)) +
    geom_hline(yintercept = 0, linetype = "dashed") +
    geom_point() +
    scale_y_continuous(limits = c(-2, 2)) +
    labs(title = "Mean prediction error per decade for m3",
         x = "Decade",
         y = "Prediction error in °C")

resid_stats <- group_by(meas_fit3, year) |>
    summarise(me = mean(temp - pred),
              mae = mean(abs(temp - pred)),
              prop_days_warmer = mean(temp > pred + thresh_unusal_temp),
              prop_days_colder = mean(temp < pred - thresh_unusal_temp))

resid_stats_plt <- pivot_longer(resid_stats, !year, names_to = "measure")

filter(resid_stats_plt, measure %in% c("mae", "me")) |>
    mutate(measure = factor(case_match(measure, "mae" ~ "Mean abs. deviation", "me" ~ "Mean deviation"),
                            levels = c("Mean deviation", "Mean abs. deviation"))) |>
    ggplot(aes(x = year, y = value, fill = measure)) +
        geom_col(position = position_dodge()) +
        scale_fill_discrete(guide = "none") +
        facet_wrap(vars(measure), nrow = 2, scales = "free_y") +
        labs(title = "Deviation from the typical seasonal temperature",
             x = "Year",
             y = "Temperature in °C")

Which years feature the most deviation now with this model? As we can see, the years with the highest positive deviations all occurred recently:

select(resid_stats, year, me) |>
    arrange(me) |>
    mutate(rank = row_number()) |>
    filter(rank %in% c(1:3, (n()-2):n())) |>
    select(rank, everything())
## # A tibble: 6 × 3
##    rank  year    me
##   <int> <int> <dbl>
## 1     1  1956 -1.34
## 2     2  1996 -1.15
## 3     3  1962 -1.08
## 4    73  2023  2.26
## 5    74  2019  2.26
## 6    75  2024  3.08
select(resid_stats, year, mae) |>
    arrange(mae) |>
    mutate(rank = row_number()) |>
    filter(rank %in% c(1:3, (n()-2):n())) |>
    select(rank, everything())
## # A tibble: 6 × 3
##    rank  year   mae
##   <int> <int> <dbl>
## 1     1  1958  2.46
## 2     2  1973  2.47
## 3     3  2017  2.48
## 4    73  1956  3.55
## 5    74  2018  3.59
## 6    75  2024  4.05

Since the beginning of the millennium, we can see a high proportion of unusually hot days:

filter(resid_stats_plt, measure %in% c("prop_days_warmer", "prop_days_colder")) |>
    mutate(measure = case_match(measure, "prop_days_warmer" ~ "Unusually warm", "prop_days_colder" ~ "Unusually cold")) |>
    ggplot(aes(x = year, y = value, fill = measure)) +
        geom_col(position = position_stack()) +
        scale_fill_discrete(limits = rev, name = "") +
        labs(title = "Proportion of unusually warm or cold days",
             x = "Year",
             y = "")

The years with the highest proportion of unusual day temperatures all occurred recently, with the current year heading for a record mostly because of unusually hot temperatures.

select(resid_stats, year, prop_days_warmer, prop_days_colder) |>
    mutate(prop_days_unusual = prop_days_warmer + prop_days_colder) |>
    arrange(prop_days_unusual) |>
    mutate(rank = row_number()) |>
    filter(rank %in% c(1:3, (n()-2):n())) |>
    select(rank, year, prop_days_unusual, everything())
## # A tibble: 6 × 5
##    rank  year prop_days_unusual prop_days_warmer prop_days_colder
##   <int> <int>             <dbl>            <dbl>            <dbl>
## 1     1  1958            0.0411          0.0219           0.0192 
## 2     2  1951            0.0411          0.0384           0.00274
## 3     3  1965            0.0548          0.00822          0.0466 
## 4    73  1998            0.175           0.137            0.0384 
## 5    74  2018            0.192           0.164            0.0274 
## 6    75  2024            0.271           0.257            0.0138

When we sort by proportion of unusually warm day temperatures, the top five list includes only recent years. This year so far every fourth day is unusually warm, when using the 1950s as reference:

select(resid_stats, year, prop_days_warmer) |>
    arrange(desc(prop_days_warmer)) |>
    head(5)
## # A tibble: 5 × 2
##    year prop_days_warmer
##   <int>            <dbl>
## 1  2024            0.257
## 2  2023            0.167
## 3  2018            0.164
## 4  2019            0.145
## 5  2015            0.142

We can again plot the year with the least (1958) and most (2024) deviation from the expected day temperatures:

least_deviation_yr <- slice_min(resid_stats, mae) |> pull(year)
most_deviation_yr <- slice_max(resid_stats, mae) |> pull(year)

least_most_plt <- data.frame(year = c(least_deviation_yr, most_deviation_yr), label = c("least deviation", "most deviation")) |>
    inner_join(meas_fit3, by = 'year') |>
    mutate(label = paste0(year, " (", label, ")"),
           resid = temp - pred,
           transparency = ifelse(abs(resid) > thresh_unusal_temp, 0.5, 0.1))

ggplot(least_most_plt) +
    geom_point(aes(day, temp, alpha = transparency, color = label)) +
    geom_line(aes(day, pred)) +
    scale_color_discrete(guide = guide_legend(title = NULL)) +
    scale_alpha_identity(guide = NULL) +
    labs(title = "Daily day-time average temperature in Berlin-Dahlem over time for select years",
         x = "Day of the year",
         y = "Temperature in °C")

Conclusion

We saw how we can model a periodic trend such as seasonal changes in day temperatures using a simple linear model. We could then analyze the deviations between the actual temperatures and the predictions of this model to identify years with low and high deviations, i.e. years with few and with many unusually cold or warm days. The results are quite different depending on whether or not we take into account the annual temperature increase due to global warming.


  1. We can assume that global warming already had an effect in the 1950s, but we don’t have data from earlier.↩︎