32  混合效应模型

library(nlme)         # 线性混合效应模型
library(GLMMadaptive) # 广义线性混合效应模型
library(mgcv)         # 广义线性/可加混合效应模型

library(splines)   # 样条
library(cmdstanr)  # 编译采样
library(ggplot2)   # 作图
library(bayesplot) # 后验分布
library(loo)       # LOO-CV

最好找 3 个真实数据集,其中数据集 sleepstudycbpp 均来自 lme4 包。

32.1 线性混合效应模型

32.1.1 频率派

32.1.1.1 nlme

data(sleepstudy, package = "lme4")
library(nlme)
fm1 <- lme(Reaction ~ Days, random = ~ Days | Subject, data = sleepstudy)
summary(fm1)
Linear mixed-effects model fit by REML
  Data: sleepstudy 
       AIC      BIC    logLik
  1755.628 1774.719 -871.8141

Random effects:
 Formula: ~Days | Subject
 Structure: General positive-definite, Log-Cholesky parametrization
            StdDev    Corr  
(Intercept) 24.740241 (Intr)
Days         5.922103 0.066 
Residual    25.591843       

Fixed effects:  Reaction ~ Days 
                Value Std.Error  DF  t-value p-value
(Intercept) 251.40510  6.824516 161 36.83853       0
Days         10.46729  1.545783 161  6.77151       0
 Correlation: 
     (Intr)
Days -0.138

Standardized Within-Group Residuals:
        Min          Q1         Med          Q3         Max 
-3.95355735 -0.46339976  0.02311783  0.46339621  5.17925089 

Number of Observations: 180
Number of Groups: 18 

32.1.1.2 lme4

或使用 lme4 包,可以得到同样的结果

fm2 <- lme4::lmer(Reaction ~ Days + (Days|Subject), data = sleepstudy)
summary(fm2)
Linear mixed model fit by REML ['lmerMod']
Formula: Reaction ~ Days + (Days | Subject)
   Data: sleepstudy

REML criterion at convergence: 1743.6

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.9536 -0.4634  0.0231  0.4634  5.1793 

Random effects:
 Groups   Name        Variance Std.Dev. Corr
 Subject  (Intercept) 612.10   24.741       
          Days         35.07    5.922   0.07
 Residual             654.94   25.592       
Number of obs: 180, groups:  Subject, 18

Fixed effects:
            Estimate Std. Error t value
(Intercept)  251.405      6.825  36.838
Days          10.467      1.546   6.771

Correlation of Fixed Effects:
     (Intr)
Days -0.138

32.1.2 贝叶斯派

32.1.2.1 cmdstanr

library(cmdstanr)

32.1.2.2 brms

bm <- brms::brm(Reaction ~ Days + (Days | Subject), data = sleepstudy)
summary(bm)
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: Reaction ~ Days + (Days | Subject) 
   Data: sleepstudy (Number of observations: 180) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Group-Level Effects: 
~Subject (Number of levels: 18) 
                    Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)          27.01      6.91    15.27    42.82 1.00     1655     2080
sd(Days)                6.54      1.51     4.15     9.93 1.00     1359     1917
cor(Intercept,Days)     0.08      0.30    -0.49     0.67 1.00      972     1465

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept   251.17      7.65   235.76   266.45 1.00     1958     2029
Days         10.37      1.72     7.01    13.81 1.00     1351     1999

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma    25.93      1.59    23.07    29.29 1.00     3005     2780

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
brms::loo(bm)
Computed from 4000 by 180 log-likelihood matrix

         Estimate   SE
elpd_loo   -861.7 22.6
p_loo        34.6  8.7
looic      1723.4 45.2
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     170   94.4%   592       
 (0.5, 0.7]   (ok)         7    3.9%   571       
   (0.7, 1]   (bad)        1    0.6%   46        
   (1, Inf)   (very bad)   2    1.1%   7         
See help('pareto-k-diagnostic') for details.
Warning message:
Found 3 observations with a pareto_k > 0.7 in model 'bm'. It is recommended to set 'moment_match = TRUE' in order to perform moment matching for problematic observations.  
# Display Conditional Effects of Predictors
brms::conditional_effects(bm, effects = "Days")
# Non-Linear Hypothesis Testing
brms::hypothesis(bm, hypothesis = "Days > 10")

32.2 广义线性混合效应模型

二项分布

32.2.1 频率派

32.2.1.1 GLMMadaptive

data(cbpp, package = "lme4")
library(GLMMadaptive)
fgm1 <- mixed_model(
  fixed = cbind(incidence, size - incidence) ~ period,
  random = ~ 1 | herd, data = cbpp, family = binomial(link = "logit")
)
summary(fgm1)

Call:
mixed_model(fixed = cbind(incidence, size - incidence) ~ period, 
    random = ~1 | herd, data = cbpp, family = binomial(link = "logit"))

Data Descriptives:
Number of Observations: 56
Number of Groups: 15 

Model:
 family: binomial
 link: logit 

Fit statistics:
   log.Lik      AIC     BIC
 -91.98337 193.9667 197.507

Random effects covariance matrix:
               StdDev
(Intercept) 0.6475934

Fixed effects:
            Estimate Std.Err z-value    p-value
(Intercept)  -1.3995  0.2335 -5.9923    < 1e-04
period2      -0.9914  0.3068 -3.2316 0.00123091
period3      -1.1278  0.3268 -3.4513 0.00055793
period4      -1.5795  0.4276 -3.6937 0.00022101

Integration:
method: adaptive Gauss-Hermite quadrature rule
quadrature points: 11

Optimization:
method: EM
converged: TRUE 

32.2.1.2 lme4

或使用 lme4 包,可以得到同样的结果

fgm2 <- lme4::glmer(
  cbind(incidence, size - incidence) ~ period + (1 | herd),
  family = binomial("logit"), data = cbpp
)
summary(fgm2)
Generalized linear mixed model fit by maximum likelihood (Laplace
  Approximation) [glmerMod]
 Family: binomial  ( logit )
Formula: cbind(incidence, size - incidence) ~ period + (1 | herd)
   Data: cbpp

     AIC      BIC   logLik deviance df.resid 
   194.1    204.2    -92.0    184.1       51 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.3816 -0.7889 -0.2026  0.5142  2.8791 

Random effects:
 Groups Name        Variance Std.Dev.
 herd   (Intercept) 0.4123   0.6421  
Number of obs: 56, groups:  herd, 15

Fixed effects:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -1.3983     0.2312  -6.048 1.47e-09 ***
period2      -0.9919     0.3032  -3.272 0.001068 ** 
period3      -1.1282     0.3228  -3.495 0.000474 ***
period4      -1.5797     0.4220  -3.743 0.000182 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
        (Intr) perid2 perid3
period2 -0.363              
period3 -0.340  0.280       
period4 -0.260  0.213  0.198

32.2.1.3 mgcv

或使用 mgcv 包,可以得到近似的结果。随机效应部分可以看作可加的惩罚项

library(mgcv)
fgm3 <- gam(
  cbind(incidence, size - incidence) ~ period + s(herd, bs = "re"),
  data = cbpp, family = binomial(link = "logit"), method = "REML"
)
summary(fgm3)

Family: binomial 
Link function: logit 

Formula:
cbind(incidence, size - incidence) ~ period + s(herd, bs = "re")

Parametric coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -1.3670     0.2358  -5.799 6.69e-09 ***
period2      -0.9693     0.3040  -3.189 0.001428 ** 
period3      -1.1045     0.3241  -3.407 0.000656 ***
period4      -1.5519     0.4251  -3.651 0.000261 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
         edf Ref.df Chi.sq  p-value    
s(herd) 9.66     14  32.03 3.21e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.515   Deviance explained =   53%
-REML = 93.199  Scale est. = 1         n = 56

下面给出随机效应的标准差的估计及其上下限,和前面 GLMMadaptive 包和 lme4 包给出的结果也是接近的。

gam.vcomp(fgm3)

Standard deviations and 0.95 confidence intervals:

          std.dev     lower    upper
s(herd) 0.6818673 0.3953145 1.176135

Rank: 1/1

32.2.2 贝叶斯派

32.2.2.1 cmdstanr

library(cmdstanr)

32.2.2.2 brms

bgm <- brms::brm(
  incidence | trials(size) ~ period + (1 | herd),
  family = binomial("logit"), data = cbpp
)

32.3 广义可加混合效应模型

从线性到可加,意味着从线性到非线性,可加模型容纳非线性的成分,比如高斯过程、样条。

32.3.1 频率派

32.3.1.1 mgcv (gam)

# 加载数据
rongelap <- readRDS(file = "data/rongelap.rds")
rongelap_coastline <- readRDS(file = "data/rongelap_coastline.rds")

近似高斯过程、高斯过程的核函数,mgcv 包的函数 s() 帮助文档参数的说明,默认值是梅隆型相关函数及默认的范围参数,作者自己定义了一套符号约定

library(nlme)
library(mgcv)
fit_rongelap_gam <- gam(
  counts ~ s(cX, cY, bs = "gp", k = 50), offset = log(time), 
  data = rongelap, family = poisson(link = "log")
)
# 模型输出
summary(fit_rongelap_gam)

Family: poisson 
Link function: log 

Formula:
counts ~ s(cX, cY, bs = "gp", k = 50)

Parametric coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) 1.976815   0.001642    1204   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
           edf Ref.df Chi.sq p-value    
s(cX,cY) 48.98     49  34030  <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.876   Deviance explained = 60.7%
UBRE = 153.78  Scale est. = 1         n = 157
# 随机效应
gam.vcomp(fit_rongelap_gam)
s(cX,cY) 
2543.376 
# 球型相关函数及范围参数为 0.5
fit_rongelap_gam <- gam(
  counts ~ s(cX, cY, bs = "gp", k = 50, m = c(1, .5)),
  offset = log(time), data = rongelap, family = poisson(link = "log")
)

参数 m 接受一个向量, m[1] 取值为 1 至 5,分别代表球型 spherical, 幂指数 power exponential 和梅隆型 Matern with \(\kappa\) = 1.5, 2.5 or 3.5 等 5 种相关/核函数。

library(sf)
Linking to GEOS 3.11.1, GDAL 3.6.4, PROJ 9.1.1; sf_use_s2() is TRUE
library(abind)
library(stars)
# 类型转化
rongelap_sf <- st_as_sf(rongelap, coords = c("cX", "cY"), dim = "XY")
rongelap_coastline_sf <- st_as_sf(rongelap_coastline, coords = c("cX", "cY"), dim = "XY")
rongelap_coastline_sfp <- st_cast(st_combine(st_geometry(rongelap_coastline_sf)), "POLYGON")
# 添加缓冲区
rongelap_coastline_buffer <- st_buffer(rongelap_coastline_sfp, dist = 50)
# 构造带边界约束的网格
rongelap_coastline_grid <- st_make_grid(rongelap_coastline_buffer, n = c(150, 75))
# 将 sfc 类型转化为 sf 类型
rongelap_coastline_grid <- st_as_sf(rongelap_coastline_grid)
rongelap_coastline_buffer <- st_as_sf(rongelap_coastline_buffer)
rongelap_grid <- rongelap_coastline_grid[rongelap_coastline_buffer, op = st_intersects]
# 计算网格中心点坐标
rongelap_grid_centroid <- st_centroid(rongelap_grid)
# 共计 1612 个预测点
rongelap_grid_df <- as.data.frame(st_coordinates(rongelap_grid_centroid))
colnames(rongelap_grid_df) <- c("cX", "cY")

# 预测
rongelap_grid_df$ypred <- as.vector(predict(fit_rongelap_gam, newdata = rongelap_grid_df, type = "response")) 
# 整理预测数据
rongelap_grid_sf <- st_as_sf(rongelap_grid_df, coords = c("cX", "cY"), dim = "XY")
rongelap_grid_stars <- st_rasterize(rongelap_grid_sf, nx = 150, ny = 75)
rongelap_stars <- st_crop(x = rongelap_grid_stars, y = rongelap_coastline_sfp)

核辐射强度的空间分布

代码
library(ggplot2)
ggplot() +
  geom_stars(data = rongelap_stars, aes(fill = ypred), na.action = na.omit) +
  geom_sf(data = rongelap_coastline_sfp, fill = NA, color = "gray50", linewidth = 0.5) +
  scale_fill_viridis_c(option = "C") +
  theme_bw() +
  labs(x = "横坐标(米)", y = "纵坐标(米)", fill = "预测值")

图 32.1: 核辐射强度的预测分布

32.3.1.2 mgcv (INLA)

mgcv 包的函数 ginla() 实现简化版 INLA

rongelap_gam <- gam(
  counts ~ s(cX, cY, bs = "gp", k = 50), offset = log(time), 
  data = rongelap, family = poisson(link = "log"), fit = FALSE
)
# 简化版 INLA
fit_rongelap_ginla <- ginla(G = rongelap_gam)

其中, \(k = 50\) 表示 50 个参数

plot(
  fit_rongelap_ginla$beta[1, ], fit_rongelap_ginla$density[1, ],
  type = "l", xlab = "intercept", ylab = "density"
)

32.3.2 贝叶斯派

32.3.2.1 brms

参考 brms 包的函数 gp() / s()mgcv 包的函数 gamm() 的帮助文档,先用模拟数据检测

library(cmdstanr)

mgcv 包的函数 gamSim() 专门用于模拟数据。

sim_dat <- mgcv::gamSim(eg = 4, dist = "normal", n = 90, scale = 2)
Factor `by' variable example

参数 \(n=90\) 设置样本量,参数 dist = "normal" 设置变量分布,参数 eg 设置模拟数据的生成模型,取值如下

  1. Gu and Wahba 4 univariate term example.
  2. A smooth function of 2 variables.
  3. Example with continuous by variable.
  4. Example with factor by variable.
  5. An additive example plus a factor variable.
  6. Additive + random effect.
  7. As 1 but with correlated covariates.
str(sim_dat)
'data.frame':   90 obs. of  8 variables:
 $ y  : num  3.993 -1.332 -2.834 -0.915 0.848 ...
 $ x0 : num  0.304 0.184 0.806 0.578 0.936 ...
 $ x1 : num  0.0687 0.6264 0.6771 0.1502 0.2628 ...
 $ x2 : num  0.6083 0.3385 0.0791 0.8044 0.5966 ...
 $ fac: Factor w/ 3 levels "1","2","3": 3 2 2 2 2 2 3 1 1 3 ...
 $ f1 : num  1.885 1.748 0.492 1.153 1.909 ...
 $ f2 : num  -0.383 -1.791 -2.587 1.238 -0.461 ...
 $ f3 : num  3.24 6.34 2.17 1.02 3.18 ...

模拟分组的高斯过程

# 按变量 fac 的不同水平
fit_sim_dat <- brms::brm(y ~ gp(x2, by = fac), data = sim_dat, chains = 2)
summary(fit_sim_dat)
plot(brms::conditional_effects(fit_sim_dat), points = TRUE)

32.3.2.2 INLA

INLA 近似贝叶斯计算

library(INLA)
library(splancs)
# 构造非凸的边界
boundary <- list(
  inla.nonconvex.hull(
    points = as.matrix(rongelap_coastline[,c("cX", "cY")]), 
    convex = 100, concave = 150, resolution = 100),
  inla.nonconvex.hull(
    points = as.matrix(rongelap_coastline[,c("cX", "cY")]), 
    convex = 200, concave = 200, resolution = 200)
)
# 构造非凸的网格
mesh <- inla.mesh.2d(
  loc = as.matrix(rongelap[, c("cX", "cY")]), offset = 100,
  max.edge = c(300, 600), boundary = boundary
)
spde <- inla.spde2.matern(mesh = mesh, alpha = 3/2, constr = TRUE)
indexs <- inla.spde.make.index(name = "s", n.spde = spde$n.spde)
lengths(indexs)

A <- inla.spde.make.A(mesh = mesh, loc = as.matrix(rongelap[, c("cX", "cY")]) )
coop <- as.matrix(rongelap_grid_df[, c("cX", "cY")])
Ap <- inla.spde.make.A(mesh = mesh, loc = coop)
dim(Ap)

# 在采样点的位置上估计 estimation stk.e
stk.e <- inla.stack(
  tag = "est",
  data = list(y = rongelap$counts, E = rongelap$time),
  A = A,
  effects = list(s = indexs)
)

# 在新生成的位置上预测 prediction stk.p
stk.p <- inla.stack(
  tag = "pred",
  data = list(y = NA, E = NA),
  A = Ap,
  effects = list(s = indexs)
)

# 合并数据 stk.full has stk.e and stk.p
stk.full <- inla.stack(stk.e, stk.p)

formula <- y ~ f(s, model = spde)

res <- inla(formula,
  data = inla.stack.data(stk.full),
  E = E, # E 已知漂移项
  control.family = list(link = "log"),
  control.predictor = list(
    compute = TRUE, 
    link = 1, # 与 control.family 联系函数相同
    A = inla.stack.A(stk.full)
  ),
  control.compute = list(
    cpo = TRUE, 
    waic = TRUE, # WAIC 统计量 通用信息准则
    dic = TRUE   # DIC 统计量 偏差信息准则
  ),
  family = "poisson"
)

summary(res)

res$summary.fixed
res$summary.hyperpar

index <- inla.stack.index(stk.full, tag = "pred")$data

pred_mean <- res$summary.fitted.values[index, "mean"]
pred_ll <- res$summary.fitted.values[index, "0.025quant"]
pred_ul <- res$summary.fitted.values[index, "0.975quant"]

dpm <- rbind(
  data.frame(
    cX = rongelap_grid_df[, 1], cY = rongelap_grid_df[, 2],
    value = pred_mean, variable = "pred_mean"
  ),
  data.frame(
    cX = rongelap_grid_df[, 1], cY = rongelap_grid_df[, 2],
    value = pred_ll, variable = "pred_ll"
  ),
  data.frame(
    cX = rongelap_grid_df[, 1], cY = rongelap_grid_df[, 2],
    value = pred_ul, variable = "pred_ul"
  )
)

dpm_sf <- st_as_sf(dpm, coords = c("cX", "cY"), dim = "XY")

ggplot(dpm_sf[dpm_sf$variable =="pred_mean", ]) +
  geom_sf(aes(color = value)) +
  scale_color_viridis_c(option = "C", name = expression(lambda)) +
  theme_bw()

32.4 总结

通过对频率派和贝叶斯派方法的比较,发现一些有意思的结果。

32.5 习题

  1. MASS 包的地形数据集 topo 为例,高斯过程回归模型

    data(topo, package = "MASS")
    bgamm1 <- brms::brm(
      z ~ gp(x, y, cov = "exp_quad"), 
      # family = Gamma(link = "inverse"),
      data = topo, chains = 2, seed = 20232023,
      warmup = 1000, iter = 2000, thin = 1, refresh = 0,
      control = list(adapt_delta = 0.99)
    )
    summary(bgamm1)
    # 高斯过程近似计算
    bgamm2 <- brms::brm(
      z ~ gp(x, y, cov = "exp_quad", c = 5 / 4, k = 50),
      data = topo, chains = 2, seed = 20232023,
      warmup = 1000, iter = 2000, thin = 1, refresh = 0,
      control = list(adapt_delta = 0.99)
    )
    summary(bgamm2)

    brms 包的函数 gp() 的参数 \(k\) 表示近似高斯过程 GP 所用的基函数的数目

    me3 <- brms::conditional_effects(bgamm1, ndraws = 200, spaghetti = TRUE)
    plot(me3, ask = FALSE, points = TRUE)