1 Introduction
2 Colorado precipitation data
3 Part I: Univariate extreme-value theory
- 3.1 Annual maxima
- 3.2 Threshold exceedances
4 Part II: Generalized additive models for extremes
- 4.1 A GAM with GEV response depending on elevation and spatial location
- 4.2 Exercice
5 Part III: Integrated Nested Laplace Approximation
- 5.1 A GEV model that is nonstationary in elevation
- 5.2 A time series model for threshold exceedances of temperatures
6 Part IV: Spatial modeling with INLA-SPDE

1 Introduction

This tutorial focuses on extreme-value modeling using the GAMs and the Integrated Nested Laplace Approximation (INLA) implemented in the INLA package, which is usually referred to as “R-INLA”. We first load a few libraries that will be necessary. If R-INLA is not yet installed, please uncomment the corresponding line and run the relevant install.packages() command.

# install.packages('INLA',repos=c(getOption('repos'),INLA='https://inla.r-inla-download.org/R/stable'),
# dep=TRUE)
library(evgam)  # frequentist GAMs, and Colorado precipitation dataset
library(extRemes)  # for univariate extremes 
library(fields)  #for plotting (image.plot)
# library(geostatsp) # library for geostatistical modeling making use of
# INLA
library(grDevices)  # for color palettes
library(INLA)  # implementation of INLA 
# library(inlabru) # high-level interface to R-INLA with additional modeling
# tools
library(maps)  # mapping tools
library(sp)  # spatial tools
library(viridis)

2 Colorado precipitation data

In the evgam package, we have a data frame with observations of daily precipitation amounts at 64 gauge stations in the state of Colorado, US, during the summer months of April to October (type “?COprcp” for more information about this dataset).

data(COprcp)
dim(COprcp)

## [1] 404326      3

head(COprcp)

##         date prcp meta_row
## 1 1990-04-01  0.0        1
## 2 1990-04-02  0.5        1
## 3 1990-04-03  0.0        1
## 4 1990-04-04  0.0        1
## 5 1990-04-05  0.0        1
## 6 1990-04-06  8.1        1

str(COprcp)

## 'data.frame':    404326 obs. of  3 variables:
##  $ date    : Date, format: "1990-04-01" "1990-04-02" ...
##  $ prcp    : num  0 0.5 0 0 0 8.1 0 0 0 0.8 ...
##  $ meta_row: int  1 1 1 1 1 1 1 1 1 1 ...

head(COprcp_meta)

##            id            name       lon     lat   elev
## 1 USC00050263     ANTERO RSVR -105.8919 38.9933 2718.8
## 2 USC00050454          BAILEY -105.4764 39.4053 2362.8
## 3 USC00050848         BOULDER -105.2667 39.9919 1671.5
## 4 USC00050950   BRIGHTON 3 SE -104.8361 39.9436 1528.9
## 5 USC00051060 BUCKHORN MTN 1E -105.2969 40.6158 2255.5
## 6 USC00051179     BYERS 5 ENE -104.1275 39.7406 1554.8

n_sites = nrow(COprcp_meta)
coor = cbind(COprcp_meta$lon, COprcp_meta$lat)
dates = as.POSIXlt(COprcp$date)
dates_unique = sort(unique(dates))
# beginning of observation period:
head(dates_unique)

## [1] "1990-04-01 UTC" "1990-04-02 UTC" "1990-04-03 UTC" "1990-04-04 UTC"
## [5] "1990-04-05 UTC" "1990-04-06 UTC"

# end of observation period:
tail(dates_unique)

## [1] "2019-10-26 UTC" "2019-10-27 UTC" "2019-10-28 UTC" "2019-10-29 UTC"
## [5] "2019-10-30 UTC" "2019-10-31 UTC"

n_days = length(unique(dates))
years = 1900 + dates$year
range(years)

## [1] 1990 2019

months = 1 + dates$mon  # month (1-12)
ydays = 1 + dates$yday  # day within year (1-366)
obs = COprcp$prcp
id_coor = COprcp$meta_row
obs_mat = matrix(NA, nrow = n_days, ncol = n_sites)
# fill the matrix:
for (i in 1:n_sites) {
    rows_i = which(id_coor == i)
    obs_mat[match(dates[rows_i], dates_unique), i] = obs[rows_i]
}
# Are there any missing values?
mean(is.na(obs_mat))  # yes, around 1.6% of all values are missing.

## [1] 0.01595113

# Show the time series for two locations:
par(mfrow = c(1, 2))
plot(dates_unique, obs_mat[, 1], type = "h", ylim = c(0, max(obs_mat[, c(1, 
    20)], na.rm = TRUE)), xlab = "Year", ylab = "Daily precipitation")
plot(dates_unique, obs_mat[, 20], type = "h", ylim = c(0, max(obs_mat[, c(1, 
    20)], na.rm = TRUE)), xlab = "Year", ylab = "Daily precipitation")

# export as pdf: dev.copy(pdf, file = 'precipitation-series.pdf', width =
# 12) dev.off()

For potential later use, we transform longitude-latitude to a coordinate system that has been optimized for the United States and corresponds to metric distances for the study region. This is recommendable for accurate modeling of the spatial correlations since the metric distance of one unit latitude is (slightly) different from one unit longitude. We use the EPSG codes for projections (European Petroleum Survey Group):

4326 for longitude-latitude;
2163 for US National Atlas Equal Area.

For the spatial distance unit, we use 1km. Note that the some implementations may sometimes show stability problems with very large coordinate values (e.g. expressed in meters).

coor_sp = SpatialPoints(coords = coor, proj4string = CRS("+init=epsg:4326"))
coor_metric = spTransform(coor_sp, CRS("+init=epsg:2163"))@coords/10^3
# We prepare the boundary coordinates of the state of Colorado:
mymap = map("state", regions = "colorado", plot = FALSE)
# Transform coordinates (those that are not 'NA') to EPSG 2163 (in km unit)
# using spTransform:
xy_region = SpatialPoints(coords = na.omit(cbind(mymap$x, mymap$y)), proj4string = CRS("+init=epsg:4326"))
xy_region = spTransform(xy_region, CRS("+init=epsg:2163"))@coords  # in m unit
xy_region = xy_region/10^3  # in km unit
mymap$x[!is.na(mymap$x)] = xy_region[, 1]
mymap$y[!is.na(mymap$y)] = xy_region[, 2]
# We plot observation sites and the Colorado boundary:
plot(coor_metric, pch = 19, cex = 0.5, asp = 1, xlab = "x (km)", ylab = "y (km)")
lines(mymap$x, mymap$y, lwd = 2)

3 Part I: Univariate extreme-value theory

3.1 Annual maxima

For modeling annual maxima, we generate a data matrix containing only the annual maxima, where rows correspond to years and columns correspond to observation locations. We use the aggregate-function to extract annual maxima (where we remove missing data during a year). The result of the aggregate-function is a data frame with first column indicating the year, and the other columns reporting the annual maxima. We transform this data frame to numerical matrix containing only the maxima values. For illustration, we plot the series of annual maxima for two observation locations.

maxima = aggregate(obs_mat, by = list(year = 1900 + dates_unique$year), FUN = max, 
    na.rm = TRUE)  # first column of maxima contains the year
maxima = as.matrix(maxima[, -1])
maxima[maxima == -Inf] = NA
par(mfrow = c(1, 2))
plot(unique(1900 + dates_unique$year), maxima[, 1], type = "h", ylim = c(0, 
    max(maxima[, 1:2])), xlab = "Year", ylab = "Precipitation maximum")
plot(unique(1900 + dates_unique$year), maxima[, 20], type = "h", ylim = c(0, 
    max(maxima[, 1:2])), xlab = "Year", ylab = "Precipitation maximum")

For a (very) simple GEVD model, we could estimate GEVD parameters for the entire sample of maxima. For this purpose, we here use the extRemes package. Several other packages implement maximum likelihood estimation of the GEV and GPD (e.g. packages evd, fExtremes).

max_sample = as.numeric(maxima)
max_sample = na.omit(max_sample)
fit = fevd(max_sample, type = "GEV", method = "MLE", period.basis = "year", 
    use.phi = TRUE)
summary(fit)

## 
## fevd(x = max_sample, use.phi = TRUE, type = "GEV", method = "MLE", 
##     period.basis = "year")
## 
## [1] "Estimation Method used: MLE"
## 
## 
##  Negative Log-Likelihood Value:  7671.267 
## 
## 
##  Estimated parameters:
##   location      scale      shape 
## 27.7792569 10.6739940  0.1021712 
## 
##  Standard Error Estimates:
##   location      scale      shape 
## 0.27200262 0.20443419 0.01583819 
## 
##  Estimated parameter covariance matrix.
##              location         scale         shape
## location  0.073985426  0.0261594943 -0.0012695922
## scale     0.026159494  0.0417933383 -0.0003801474
## shape    -0.001269592 -0.0003801474  0.0002508484
## 
##  AIC = 15348.53 
## 
##  BIC = 15365.21

# plot(fit) # visualizations to check goodness-of-fit

We obtain a slightly positive value of the shape parameter (xi), which is typical for precipitation data.

3.2 Threshold exceedances

For a (very) simple Peaks-over-Threshold model of precipitation data, we could choose a global threshold as a high quantile of positive precipitation intensities, for instance at the 95% level. After fixing the threshold, we can explore the resulting exceedances (how many?, histogram, etc.).

prob = 0.95
thresh = quantile(obs_mat[obs_mat > 0], prob, na.rm = TRUE)
sum(as.numeric(obs_mat > thresh), na.rm = TRUE)

## [1] 5894

hist((obs_mat - thresh)[obs_mat > thresh], xlab = "Excesses", main = "Histogram of excesses")

Next, we can estimate a global GPD model (where we first remove the missing data).

fit = fevd(x = obs_mat[!is.na(obs_mat)], threshold = thresh, type = "GP", method = "MLE", 
    use.phi = TRUE)
summary(fit)

## 
## fevd(x = obs_mat[!is.na(obs_mat)], threshold = thresh, use.phi = TRUE, 
##     type = "GP", method = "MLE")
## 
## [1] "Estimation Method used: MLE"
## 
## 
##  Negative Log-Likelihood Value:  19178.96 
## 
## 
##  Estimated parameters:
##     scale     shape 
## 8.1710978 0.1533774 
## 
##  Standard Error Estimates:
##      scale      shape 
## 0.16224081 0.01511824 
## 
##  Estimated parameter covariance matrix.
##              scale         shape
## scale  0.026322082 -0.0016223712
## shape -0.001622371  0.0002285611
## 
##  AIC = 38361.92 
## 
##  BIC = 38375.29

# plot(fit) # visualizations to check goodness-of-fit

Again we obtain a slightly positive value of the shape parameter (xi) that is slightly different from the GEVD estimation.

Remark: Various tools exist to help with selecting an appropriate threshold above which the peaks-over-threshold model becomes stable (but the choice is rarely obvious and often not easy…). To find an appropriate threshold (or to validate our threshold from above), we could study a parameter stability plot for a range of possible thresholds.

threshrange.plot(obs_mat[!is.na(obs_mat)], type = "GP", r = c(10, 80), nint = 8)

Here, the estimates look relatively stable for different thresholds when taking into account the estimation uncertainty given by the error bars. In particular, estimates for lower thresholds are usually within the error bars of estimates at higher thresholds.

4 Part II: Generalized additive models for extremes

We now explore the GAMs for extremes as implemented in evgam.

4.1 A GAM with GEV response depending on elevation and spatial location

We continue working with the Colorado precipitation data from the evgam package. The package also provides data for the auxiliary variable of elevation.

par(lwd = 2, cex.axis = 1.5, cex.lab = 1.5)
brks = pretty(COelev$z, 50)
cols = viridis(length(brks) - 1)
image(COelev$x, COelev$y, COelev$z, breaks = brks, col = cols, asp = 1, xlab = "Longitude", 
    ylab = "Latitude")
col_points = cols[as.integer(cut(COprcp_meta$elev, brks))]
points(COprcp_meta$lon, COprcp_meta$lat, pch = 4)

# dev.copy(pdf, file = 'map-colorado.pdf', height = 12) dev.off()

Could extreme-value behavior vary with elevation? For a simple exploratory check, let’s plot averages of location-wise annual maxima against elevation.

plot(COprcp_meta$elev, apply(maxima, 2, mean), xlab = "Elevation", ylab = "Average of maxima", 
    pch = 19)

There could be a slight decreasing trend.

We here explore a GEV model that has been developed by Youngman (2020). It includes two spline terms into the location parameter \(\mu\) (for elevation, and for spatial location) and one spline term in the scale parameter (for spatial location). The shape parameter is kept constant (but is estimated). For evgam, we have to provide all data (response and predictors) in a data frame.

COprcp$year = format(COprcp$date, "%Y")
COprcp_gev = aggregate(prcp ~ year + meta_row, COprcp, max)
COprcp_gev = cbind(COprcp_gev, COprcp_meta[COprcp_gev$meta_row, ])

We can now specify the structure of the linear predictors for the three GEV parameters (location, scale, shape). We use relatively high numbers of basis functions (k=30, k=20), such that the penalty mechanism can automatically determine the smoothness/wiggliness required for a good fit of the model. The default basis functions for two-dimensional splines (with lon and lat) are so-called “thin-plate splines”; alternatively, we could use tensor product splines with “te(…)” instead of “s(…)”. For the one-dimensional spline for elevation, we do not use the default but use cubic splines (bs = “cr”). See the help page for more details.

# Uncomment the following two lines to consult the help pages for available
# spline types ?mgcv::s ?mgcv::smooth.terms
form_gev = list(prcp ~ s(lon, lat, k = 30) + s(elev, bs = "cr"), ~s(lon, lat, 
    k = 20), ~1)

Now we can estimate this GEV model and print a summary of the fitted object. The summary shows estimated linear coefficients, and p-values are provided to assess if the nonlinear spline terms improve the goodness-of-fit of the model.

fit_gev = evgam(form_gev, COprcp_gev, family = "gev")
summary(fit_gev)

## 
## ** Parametric terms **
## 
## location
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)    28.56       0.26  111.89   <2e-16
## 
## logscale
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)     2.24       0.02  118.07   <2e-16
## 
## shape
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)     0.08       0.02    5.08 1.92e-07
## 
## ** Smooth terms **
## 
## location
##              edf max.df Chi.sq Pr(>|t|)
## s(lon,lat) 19.27     29 178.23   <2e-16
## s(elev)     5.19      9  19.39  0.00139
## 
## logscale
##              edf max.df Chi.sq Pr(>|t|)
## s(lon,lat) 13.94     19 211.15   <2e-16

Here, the inclusion of the spline terms seems to be beneficial since all p-values are very small.

To explore the results, we can plot the estimated spline functions. The default outputs (with confidence information) are contour plots for two-dimensional splines, and estimated curves for one-dimensional splines.

plot(fit_gev)

We can also plot spatial maps of estimated parameters (and other quantities, such as return levels). For this, the evgam package provides the elevation values over a rectangular grid of longitude and latitude values. We can use the predict function, in which we specify that we want to predict the “response” (i.e., the estimates for location, scale and shape obtained after back-transforming the linear predictor with the link function).

COprcp_plot = expand.grid(lon = COelev$x, lat = COelev$y)
COprcp_plot$elev = as.vector(COelev$z)
pred_gev = predict(fit_gev, COprcp_plot, type = "response")
data2plot = COelev
data2plot$z[] = pred_gev$location
par(mfrow = c(1, 2))  # 1-line 2-columns display
image(data2plot, asp = 1, col = viridis(16))
title("GEV location")
data2plot$z[] = pred_gev$scale
image(data2plot, asp = 1, col = viridis(16))
title("GEV scale")

4.2 Exercice

Implement a GAM with GPD response for threshold exceedances. Note: you can follow the code examples in https://arxiv.org/abs/2003.04067.

5 Part III: Integrated Nested Laplace Approximation

We implement two relatively simple extreme-value models in INLA to illustrate how the INLA-implementation works.

5.1 A GEV model that is nonstationary in elevation

We here estimate nonlinear elevation effect for the Colorado precipitation data by using a step function with a Gaussian first-order random walk prior. We start by creating a data frame with rows for the response variable and the covariates. For the random walk, we use 15 equidistant intervals based on the elevation values in the study area. For the intercept beta_0, we define a covariate “intercept”. If you want to read how the GEV distribution is parametrized in INLA, you can call inla.doc(“gev”) in the console (which will open a pdf file containing documentation).

names(COprcp_gev)

## [1] "year"     "meta_row" "prcp"     "id"       "name"     "lon"     
## [7] "lat"      "elev"

data_df = COprcp_gev
data_df$intercept = 1
nbreak = 16
breaks = min(COelev$z) + (max(COelev$z) - min(COelev$z)) * (0:nbreak)/nbreak
breaks[1] = breaks[1] - 1  # avoid observations on the boundary
breaks[nbreak] = breaks[nbreak] + 1  # avoid observations on the boundary
data_df$elev_group = as.integer(cut(COprcp_gev$elev, breaks))
table(data_df$elev_group)

## 
##   1   2   3   4   5   6   7   8   9  10  11  12 
##  90 298 240 120  90  89 180 270 120 210 120  90

We will use a Penalized Complexity Prior for the precision (i.e., 1/variance) hyperparameter of the random walk. Here we specify that the prior probability of having standard deviation larger than 1 is 0.5. We then specify the regression formula of the INLA model, and run the estimation. Finally, we can print a summary of the fitted object.
Note: the “-1” in the INLA-formula means that we remove the internal “default” intercept, and we replace it by our user-defined intercept variable.

# inla.doc('prec')
hyper_prec = list(theta = list(prior = "pc.prec", param = c(1, 0.5)))
my_formula = prcp ~ -1 + intercept + f(elev_group, model = "rw1", hyper = hyper_prec)
fit = inla(my_formula, data = data_df, family = "gev", verbose = FALSE)
summary(fit)

## 
## Call:
##    "inla(formula = my_formula, family = \"gev\", data = data_df, 
##    verbose = FALSE)" 
## Time used:
##     Pre = 1.24, Running = 3.7, Post = 0.0815, Total = 5.02 
## Fixed effects:
##             mean    sd 0.025quant 0.5quant 0.975quant   mode kld
## intercept 28.015 0.275     27.478   28.014     28.554 28.013   0
## 
## Random effects:
##   Name     Model
##     elev_group RW1 model
## 
## Model hyperparameters:
##                                       mean    sd 0.025quant 0.5quant
## precision for GEV observations       0.009 0.000      0.009    0.009
## shape-parameter for gev observations 0.094 0.015      0.065    0.094
## Precision for elev_group             0.136 0.063      0.049    0.125
##                                      0.975quant  mode
## precision for GEV observations            0.010 0.009
## shape-parameter for gev observations      0.123 0.094
## Precision for elev_group                  0.292 0.103
## 
## Expected number of effective parameters(stdev): 10.38(0.56)
## Number of equivalent replicates : 184.61 
## 
## Marginal log-Likelihood:  -7626.43

We can now plot the estimated elevation effect along with 95% credible intervals. Therefore, we first define an auxiliary function that will be useful to plot the random effect coefficients (betas) estimated with INLA, and then we use it.

plot_random_effect = function(effect) {
    ylim = range(effect$mean, effect$`0.025quant`, effect$`0.975quant`)
    plot(effect$ID, effect$mean, type = "l", lwd = 2, ylim = ylim, xlab = "ID", 
        ylab = "Coefficient")
    lines(effect$ID, effect$`0.025quant`, lwd = 2, col = "blue", lty = 2)
    lines(effect$ID, effect$`0.975quant`, lwd = 2, col = "blue", lty = 2)
}
plot_random_effect(fit$summary.random$elev_group)

We find a nonlinear influence with two modes, similar to what we had estimated in the GAM model. Next, we provide a spatial visualization with a map of the elevation effect at the observed locations.

head(fit$summary.random$elev_group)

##   ID       mean        sd 0.025quant   0.5quant 0.975quant       mode
## 1  1 -0.6011079 0.9096767 -2.4424182 -0.5818653   1.131988 -0.5439435
## 2  2  2.5195943 0.5672077  1.4003237  2.5213039   3.627956  2.5244663
## 3  3  2.4244501 0.5944474  1.2403670  2.4300823   3.576135  2.4412076
## 4  4  4.9889844 0.8439180  3.3148233  4.9950408   6.627486  5.0067052
## 5  5  1.5264130 0.8666065 -0.2103671  1.5379953   3.196831  1.5603000
## 6  6 -2.9848712 0.8622880 -4.7416271 -2.9621440  -1.355059 -2.9171633
##            kld
## 1 2.255227e-07
## 2 1.352467e-07
## 3 6.887784e-08
## 4 5.211328e-08
## 5 1.060195e-07
## 6 5.952874e-08

est_elev_scaled = scale(fit$summary.random$elev_group$mean)/4
size_points = exp(est_elev_scaled)
plot(coor_metric, pch = 19, cex = size_points, asp = 1, xlab = "x (km)", ylab = "y (km)")
lines(mymap$x, mymap$y, lwd = 2)

Based on the gridded elevation data, we can also map the elevation effect on the grid. Note that we do not have observations for locations at very high altitude. Therefore, we extrapolate the step function for high elevations by taking the coefficient value of the class with highest observed location.

data2plot = COelev
beta_elev = fit$summary.random$elev_group$mean
beta_elev[13:16] = beta_elev[12]
data2plot$z[] = beta_elev[cut(COelev$z, breaks)]
image(data2plot, asp = 1, col = viridis(15))

Due to the step function approach (and the absence of a smooth longitude-latitude effect), the map here looks less smooth than the GAM estimate from evgam.

5.2 A time series model for threshold exceedances of temperatures

Next, we illustrate a latent AR1-time series model for threshold exceedances. Since there is only very weak dependence among threshold exceedances of precipitation data, we here instead consider a daily time series of temperature maxima at Fort Collins (Colorado), and we focus on the hot summer months of July and August.

data(FCtmax)
obs_ts = FCtmax$tmax
any(is.na(obs_ts))

## [1] FALSE

plot(obs_ts, type = "h")

months_ts = 1 + as.POSIXlt(FCtmax$date)$mon
year_ts = 1900 + as.POSIXlt(FCtmax$date)$year
idx2keep = which(months_ts %in% 7:8)
obs_ts = obs_ts[idx2keep]
months_ts = months_ts[idx2keep]
year_ts = year_ts[idx2keep]
table(months_ts)

## months_ts
##    7    8 
## 1545 1549

n_ts = length(obs_ts)
plot(obs_ts, type = "h")

Now we fix a threshold and generate exceedance indicators and positive exceedances. For the threshold, we fix the quantile at 90%.

prob = 0.9
thresh = quantile(obs_ts, prob, na.rm = TRUE)
thresh

##  90% 
## 34.4

excess = pmax(obs_ts - thresh, 0)
mean(excess[excess > 0])  # average excess

## [1] 1.375472

is_exceed = as.numeric(obs_ts > thresh)
sum(is_exceed)

## [1] 212

Our estimation procedure consists of two steps here. We construct and fit two separate models: the first one for the probability of exceedances over the threshold, the second one for the positive excess over the threshold.

Note: we could also fit the two models simultaneously with INLA, using two likelihood types (binomial and GP) in the same INLA model, with random effects that are in common or have correlation between the two submodels. However, for simplicity, we here fit the models for exceedance probability and for the size of the excess separately.

We generate a data frame with the response variable (called “y” generically) and covariates to be included in the model. To check for seasonal effects, we include the month as a covariate. We also generate a daily index for the AR1-model. For simplicity, we here treat the transition between years (August of year a-1 to July of year a) as “continuous” (alternatively, we would have to work with a “replicated” AR1-effect, with a separate AR1-series for each year).

data_df = data.frame(y = is_exceed, intercept = 1, month = months_ts, year = year_ts, 
    idx = 1:n_ts)

5.2.1 Step 1: Modeling the exceedance probability

To learn about the formulation of the AR1 model, we can consult the documentation. The following command (if you remove the comment sign “#”) opens a pdf file (or several pdf files if there are several related models):

# inla.doc('ar1')

We will use Penalized Complexity Priors for the hyperparameters (standard deviation, and auto-correlation). See

hyper.prec = list(prior = "pc.prec", param = c(1, 0.5), initial = 1)
# inla.doc('pccor1')
hyper.cor = list(prior = "pccor1", param = c(0.75, 0.5), initial = log(0.75/(1 - 
    0.75)))

We write the model formula with the f(…)-terms for the AR1 random effect.

form = y ~ -1 + intercept + f(idx, model = "ar1", hyper = list(theta1 = hyper.prec, 
    theta2 = hyper.cor))

We here have Bernoulli variables (i.e. binary variables corresponding to a binomial distribution with 1 trial), so we will set “Ntrials = rep(1,n)” in the inla()-command.

Remark (missing response data): If there were missing response values (NA), we could keep them during the estimation. INLA will automatically calculate the predictions (=fitted values) for the missing responses. By default, an identity link function is used for missing values (but this is not what we want here). By specifying “link=1” in the control.predictor-argument, we will use the correct link function (i.e., the link function of the first likelihood “family” - here, we have only one likelihood family “binomial”).

Running the following estimation with INLA takes one to several minutes:

fit_p=inla(form,
         data = data_df,
         family = "binomial",
         Ntrials = rep(1, n_ts), 
         control.predictor = list(compute=TRUE, link=1),
         control.fixed = list(prec = 1),
         #control.inla=list(strategy="simplified.laplace",int.strategy="ccd"),
         verbose=FALSE,
         num.threads = 2 #number of CPU threads that INLA can run in parallel
)
summary(fit_p)

## 
## Call:
##    c("inla(formula = form, family = \"binomial\", data = data_df, 
##    Ntrials = rep(1, ", " n_ts), verbose = FALSE, control.predictor = 
##    list(compute = TRUE, ", " link = 1), control.fixed = list(prec = 
##    1), num.threads = 2)" ) 
## Time used:
##     Pre = 1.16, Running = 18.7, Post = 0.25, Total = 20.1 
## Fixed effects:
##             mean    sd 0.025quant 0.5quant 0.975quant   mode kld
## intercept -3.924 0.271     -4.491   -3.911     -3.427 -3.885   0
## 
## Random effects:
##   Name     Model
##     idx AR1 model
## 
## Model hyperparameters:
##                    mean    sd 0.025quant 0.5quant 0.975quant  mode
## Precision for idx 0.266 0.060      0.167     0.26      0.402 0.248
## Rho for idx       0.888 0.024      0.835     0.89      0.929 0.894
## 
## Expected number of effective parameters(stdev): 257.49(34.80)
## Number of equivalent replicates : 12.02 
## 
## Marginal log-Likelihood:  -697.72 
## Posterior marginals for the linear predictor and
##  the fitted values are computed

We estimate a posterior mean of the precision paramater at 1/4 approximately, i.e., the variance of the latent Gaussian AR1-series is 4 approximately. Moreover, we obtain a correlation coefficient Rho of 0.89 in the series, which indicates quite strong temporal correlation.

We first plot the estimated latent AR1-effect. Then, we plot the data for only 1 year along with fitted values and the estimated random effects.

plot_random_effect(fit_p$summary.random$idx)

id_year = which(data_df$year == 2019)
plot((1:n_ts)[id_year], data_df$y[id_year], pch = 4, xlab = "Day", ylab = "Exceedance probability")
lines((1:n_ts)[id_year], fit_p$summary.fitted.values$mean[id_year], col = "blue", 
    lwd = 2)

5.2.2 Step 2: Modeling the size of the excesses

Here we remove all data without observed positive exceedance for fitting the generalized Pareto distribution. Then, we write the INLA formula as before (same formula here).

idx2keep = which(!is.na(excess) & excess > 0)
n.exc = length(idx2keep)
n.exc

## [1] 212

data_df = data.frame(y = excess[idx2keep], intercept = 1, month = months_ts[idx2keep], 
    idx = idx2keep)
# hyper.prec = list(prior = 'pc.prec', param=c(1,.5), initial=1)
form = y ~ -1 + intercept + f(idx, model = "ar1", hyper = list(theta1 = hyper.prec, 
    theta2 = hyper.cor))

For the GPD in INLA, the log-link function is designed model a certain (fixed) quantile level of the generalized Pareto distribution. We here use the median, and we will specify this value in the control.family-argument of inla(…). You can consult inla.doc(“loggamma”) to see how the approximate PC prior of the shape parameter, corresponding to an exponential distribution (i.e. a gamma distribution with shape 1), can be implemented. Since temperatures are usually not heavy tailed (i.e., xi is not strongly positive), we here set the penalization rate to a relatively high value of 10. Running the estimation with INLA then takes around several seconds.

q_gpd = 0.5
hyper.shape = list(prior = "loggamma", param = c(1, 10), initial = log(0.05))
fit_gp = inla(form, data = data_df, family = "gp", control.family = list(list(control.link = list(quantile = q_gpd), 
    hyper = list(theta = hyper.shape))), control.compute = list(waic = TRUE), 
    verbose = FALSE, num.threads = 2)
summary(fit_gp)

## 
## Call:
##    c("inla(formula = form, family = \"gp\", data = data_df, verbose = 
##    FALSE, ", " control.compute = list(waic = TRUE), control.family = 
##    list(list(control.link = list(quantile = q_gpd), ", " hyper = 
##    list(theta = hyper.shape))), num.threads = 2)" ) 
## Time used:
##     Pre = 1.1, Running = 0.39, Post = 0.08, Total = 1.57 
## Fixed effects:
##            mean    sd 0.025quant 0.5quant 0.975quant  mode kld
## intercept -0.05 0.109     -0.285   -0.049      0.172 -0.05   0
## 
## Random effects:
##   Name     Model
##     idx AR1 model
## 
## Model hyperparameters:
##                                                   mean       sd 0.025quant
## Shape parameter for the genPareto observations   0.012    0.012      0.001
## Precision for idx                              965.504 5138.520     23.662
## Rho for idx                                      0.635    0.482     -0.700
##                                                0.5quant 0.975quant   mode
## Shape parameter for the genPareto observations    0.009      0.044  0.002
## Precision for idx                               218.938   6287.211 52.507
## Rho for idx                                       0.869      1.000  1.000
## 
## Expected number of effective parameters(stdev): 3.03(2.01)
## Number of equivalent replicates : 69.94 
## 
## Watanabe-Akaike information criterion (WAIC) ...: 563.82
## Effective number of parameters .................: 1.11
## 
## Marginal log-Likelihood:  -290.56 
## Posterior marginals for the linear predictor and
##  the fitted values are computed

Due to the high precision of almost 1000 (standard deviation of approximately 1/sqrt(1000)) in the estimated in AR1-series, the AR1-random effect is not important for the estimated model. We refit the model without this effect and compare the WAIC values.

form = y ~ -1 + intercept 
fit_gpd = inla(form,
         data = data_df,
         family = "gp",
         control.family = list(list(control.link = list(quantile = q_gpd), hyper = list(theta = hyper.shape))),
         control.compute = list(waic = TRUE),
         #control.inla=list(strategy="simplified.laplace",int.strategy="ccd"),
         verbose = FALSE,
         num.threads = 2
)
summary(fit_gpd)

## 
## Call:
##    c("inla(formula = form, family = \"gp\", data = data_df, verbose = 
##    FALSE, ", " control.compute = list(waic = TRUE), control.family = 
##    list(list(control.link = list(quantile = q_gpd), ", " hyper = 
##    list(theta = hyper.shape))), num.threads = 2)" ) 
## Time used:
##     Pre = 0.822, Running = 0.171, Post = 0.071, Total = 1.06 
## Fixed effects:
##             mean    sd 0.025quant 0.5quant 0.975quant   mode kld
## intercept -0.048 0.069     -0.181   -0.049      0.091 -0.051   0
## 
## Model hyperparameters:
##                                                 mean    sd 0.025quant
## Shape parameter for the genPareto observations 0.013 0.013       0.00
##                                                0.5quant 0.975quant mode
## Shape parameter for the genPareto observations    0.009      0.048 0.00
## 
## Expected number of effective parameters(stdev): 1.00(0.00)
## Number of equivalent replicates : 211.73 
## 
## Watanabe-Akaike information criterion (WAIC) ...: 562.20
## Effective number of parameters .................: 0.374
## 
## Marginal log-Likelihood:  -287.83 
## Posterior marginals for the linear predictor and
##  the fitted values are computed

The WAIC value is essentially the same (even slightly lower and therefore better), so we could decide to keep the simpler model here.

6 Part IV: Spatial modeling with INLA-SPDE

We consider a relatively simple model for a spatial trend. We use a Gaussian SPDE field as prior for the trend surface.

We first extract the exceedances above the global 95%-quantile from the Colorado precipitation data.

prob = 0.95
thresh = quantile(COprcp$prcp[COprcp$prcp > 0], prob, na.rm = TRUE)
sum(as.numeric(COprcp$prcp > thresh), na.rm = TRUE)

## [1] 5894

CO_exc = COprcp[COprcp$prcp > thresh, ]
CO_exc$excess = CO_exc$prcp - thresh

We have to create a mesh (i.e., triangulation of space). The Gaussian variables will be defined on the mesh nodes and are then interpolated linearly between the nodes. We here use a spatial distance unit of 1km. We also define a vector of indices for the nodes of the mesh.

bound = inla.nonconvex.hull(coor_metric, convex = -0.05)
bound2 = inla.nonconvex.hull(coor_metric, convex = -0.25)
plot(rbind(bound2$loc, bound2$loc[1, ]), type = "l", asp = 1, xlab = "x (km)", 
    ylab = "y (km)")
lines(rbind(bound$loc, bound$loc[1, ]))
points(coor_metric, pch = 19, cex = 0.5)

# Define minimum edge length as 10km and maximum edge length as 25km (within
# the study region) and 100km (in the extension zone).
mesh = inla.mesh.2d(loc = coor_metric, boundary = list(bound, bound2), cutoff = 10, 
    max.edge = c(25, 100), min.angle = c(21, 21))
cat("Number of mesh nodes:", mesh$n, "\n")

## Number of mesh nodes: 394

plot(mesh, asp = 1)  #looks OK

idx.spatial = inla.spde.make.index("spatial", n.spde = mesh$n)

We define an SPDE object with PC prior. The distribution of the PC priors is expressed as a probability to deviate from a baseline, which is a less complex model. For the variance parameter, the baseline corresponds to \(\sigma^2=0\). For the spatial range parameter, the baseline has an infinite range, since this would correspond to perfect dependence (hence less variability). Below, we set the probability of having range \(<\) 10 km to 5%, and that of having standard deviation larger than 10 to 50%. See documentation on PC priors for more details.

myspde = inla.spde2.pcmatern(mesh = mesh, alpha = 2, prior.range = c(10, 0.05), 
    prior.sigma = c(10, 0.5))

Next, we need an observation matrix to make the connection between mesh nodes and observation sites.

A = inla.spde.make.A(mesh, loc = coor_metric, index = CO_exc$meta_row)

Next, we create a dataframe with the covariates used as linear effects, here with the intercept term 1, since in R-INLA it is more comfortable to include the intercept explicitly.

covar.df = data.frame(intercept = rep(1, nrow(CO_exc)))
mystack = inla.stack(data = list(excess = CO_exc$excess), A = list(A, 1), effects = list(idx.spatial, 
    covar.df))
form = excess ~ -1 + intercept + f(spatial, model = myspde)

Note: since we handle the intercept explicitly, we have to include “-1”. Recall that we have given the name “spatial” to the index of the spatial effect.

Let’s run the inla function. To avoid that inla becomes too greedy with memory and cpu, we can fix the maximum number of parallel threads to 2.

The following INLA estimation takes approximatively 1 minute.

num.thr = 2  #number of threads that INLA is allowed to run in parallel
q_gpd = 0.5
hyper.shape = list(prior = "loggamma", param = c(1, 10), initial = log(0.1))
fit = inla(form, data = inla.stack.data(mystack), family = "gp", control.family = list(list(control.link = list(quantile = q_gpd), 
    hyper = list(theta = hyper.shape))), control.predictor = list(A = inla.stack.A(mystack), 
    compute = FALSE), verbose = FALSE, num.threads = num.thr)
summary(fit)

## 
## Call:
##    c("inla(formula = form, family = \"gp\", data = 
##    inla.stack.data(mystack), ", " verbose = FALSE, control.predictor 
##    = list(A = inla.stack.A(mystack), ", " compute = FALSE), 
##    control.family = list(list(control.link = list(quantile = q_gpd), 
##    ", " hyper = list(theta = hyper.shape))), num.threads = num.thr)" 
##    ) 
## Time used:
##     Pre = 1.5, Running = 5.92, Post = 0.0791, Total = 7.49 
## Fixed effects:
##            mean    sd 0.025quant 0.5quant 0.975quant  mode kld
## intercept 1.745 0.672      0.286    1.754      3.169 1.765   0
## 
## Random effects:
##   Name     Model
##     spatial SPDE2 model
## 
## Model hyperparameters:
##                                                  mean      sd 0.025quant
## Shape parameter for the genPareto observations   0.12   0.014      0.093
## Range for spatial                              362.22 247.052    120.461
## Stdev for spatial                                0.41   0.167      0.202
##                                                0.5quant 0.975quant    mode
## Shape parameter for the genPareto observations    0.119       0.15   0.119
## Range for spatial                               291.256    1018.92 205.408
## Stdev for spatial                                 0.371       0.84   0.306
## 
## Expected number of effective parameters(stdev): 25.37(4.13)
## Number of equivalent replicates : 232.31 
## 
## Marginal log-Likelihood:  -19100.84

The range of the spatial effect is approximately 360km, and its standard deviation is around 0.4.

We can now plot the fitted spatial trend. Since standard plotting functions in R work on regular grids, we have to interpolate the estimated trend to such a grid, using the inla.mesh.project mechanism.

proj_grid = inla.mesh.projector(mesh, xlim = range(bound$loc[, 1]), ylim = range(bound$loc[, 
    2]), dims = c(200, 200))
est_on_mesh = fit$summary.random$spatial$mean
image.plot(proj_grid$x, proj_grid$y, inla.mesh.project(proj_grid, field = est_on_mesh), 
    asp = 1, xlab = "x (km)", ylab = "y (km)", col = viridis(16))
points(coor_metric, pch = 19, cex = 0.25)

The posterior standard deviation (=uncertainty) of the spatial effect can also be mapped.

sd_on_mesh = fit$summary.random$spatial$sd
image.plot(proj_grid$x, proj_grid$y, inla.mesh.project(proj_grid, field = sd_on_mesh), 
    asp = 1, xlab = "x (km)", ylab = "y (km)", col = viridis(16))
points(coor_metric, pch = 19, cex = 0.25)
lines(rbind(bound$loc, bound$loc[1, ]), lwd = 2)

Note: as expected, there is usually lower uncertainty where we have many observations sites.

Practicals: GAMs and INLA for extremes

(Version 1)

Thomas Opitz (BioSP, INRAE, Avignon, France)

03 December 2021