Suggest GAM smoothing parameters for epidemic curve models

A helper function to guide selection of the k_smooth, k_trt, k_env, and gamma parameters used by functional_curves. Can infer the effective replication from a data frame or accept scalar values directly when the data are not yet available.

For sparse epidemic data it is strongly recommended to use rule = "minimum" so that k values are based on the least-replicated treatment-by-environment combination, guarding against over-fitting.

Usage

suggest_k(
  data = NULL,
  time = NULL,
  treatment = NULL,
  environment = NULL,
  n_time = NULL,
  n_env = NULL,
  rule = c("minimum", "median"),
  smoothness = c("conservative", "moderate", "flexible")
)

Arguments

data: Optional data frame. If supplied, time, treatment, and optionally environment are used to compute summaries of the number of unique time points and environments per treatment.
time: Unquoted column name for the time variable, or a character string naming the column.
treatment: Unquoted column name for the treatment / cultivar variable, or a character string naming the column.
environment: Optional unquoted column name for the environment variable, or a character string naming the column.
n_time: Integer. Number of unique time points to use directly (ignored when data is supplied).
n_env: Integer. Number of unique environments to use directly (ignored when data is supplied, or when there is no environment variable).
rule: Character string. How to summarise the distribution of replication counts across treatment-by-environment combinations: "minimum" (default, conservative) or "median".
smoothness: Character string controlling how liberal the recommendations are: "conservative" (default), "moderate", or "flexible".

Value

A named list with the following elements:

time_summary: Named numeric vector with minimum, median, and maximum unique time points per treatment-by-environment combination.
environment_summary: Named numeric vector with minimum, median, and maximum unique environments per treatment (or NULL when no environment variable is given).
effective_n_time: The effective number of time points chosen by rule.
effective_n_env: The effective number of environments chosen by rule, or NULL.
k_smooth: Recommended basis dimension for the global time smooth.
k_trt: Recommended basis dimension for the treatment-specific smooth.
k_env: Recommended basis dimension for the environment random effect.
gamma: Recommended penalisation multiplier.
message: A short interpretation message.

Details

The function computes, for every treatment-by-environment combination, the number of unique time points at which observations are available. It then summarises these counts using the chosen rule to obtain effective_n_time. Similarly it computes, per treatment, the number of unique environments, and summarises using the same rule to obtain effective_n_env.

Recommended k values follow these heuristics:

k_smooth: global smooth over time; capped below effective_n_time - 1.
k_trt: treatment-specific smooth; capped below k_smooth.
k_env: environment random effect; capped below effective_n_env.
gamma: penalty multiplier; higher values encourage smoother fits.

Examples

# Using explicit values
suggest_k(n_time = 5, n_env = 8)
#> -- suggest_k: GAM parameter recommendations ----------------------
#>   effective_n_time = 5 (minimum across combinations)
#>   effective_n_env = 8 (minimum across treatments)
#>   smoothness = 'conservative'
#>   k_smooth = 4, k_trt = 3, k_env = 3, gamma = 1.8 
#> 
#> Suggested call to functional_curves():
#>   functional_curves(..., k_smooth = 4, k_trt = 3, k_env = 3, gamma = 1.8)

# Inferring from a data frame (unquoted column names)
df <- data.frame(
  time      = rep(1:6, times = 6),
  cultivar  = rep(c("A", "B", "C"), each = 12),
  env       = rep(rep(c("E1", "E2"), each = 6), times = 3),
  severity  = runif(36, 0, 0.5)
)
suggest_k(
  data        = df,
  time        = time,
  treatment   = cultivar,
  environment = env,
  rule        = "minimum",
  smoothness  = "conservative"
)
#> -- suggest_k: GAM parameter recommendations ----------------------
#>   effective_n_time = 6 (minimum across combinations)
#>   effective_n_env = 2 (minimum across treatments)
#>   smoothness = 'conservative'
#>   k_smooth = 5, k_trt = 4, k_env = 2, gamma = 1.8 
#> 
#> Suggested call to functional_curves():
#>   functional_curves(..., k_smooth = 5, k_trt = 4, k_env = 2, gamma = 1.8)
#> 
#> Time points per treatment-by-environment:
#>   min = 6, median = 6, max = 6
#> Environments per treatment:
#>   min = 2, median = 2, max = 2