3 Fitting models

In this tutorial you will learn how to fit models to multiple actual disease progress curves (DPCs) data obtained from the literature. I will demonstrate how to fit and select the models using a new R package called epifitter. A few user friendly functions will help us decide which model to choose to obtain the parameters of interest and further compare the epidemics.

To illustrate, I will use two datasets available from Chapter 3 from the book, Study of Plant Disease Epidemics (Madden, Hughes, and van den Bosch 2017). In the book, SAS codes are presented to perform a few analysis. We then provide an alternative code for performing similar analysis, although not perfectly reproducing the results from the book.

3.1 Non-replicated

Here we will compare three DPCs of the incidence of tobacco etch, a virus disease, in peppers. Evaluations of incidence were evaluated at a 7-day interval up to 49 days.The data are available in chapter 4 (page 93). Let’s input the data manually and create a data frame. First column is the assessment time and the other columns correspond to the treatments, called groups in the book, from 1 to 3.

3.1.1 Initial setup

Load essential packages and set parameters recursively.

library(tidyverse)
library(knitr)
library(patchwork)
library(ggthemes)
theme_set(theme_few())
knitr::opts_chunk$set(echo=TRUE, warning=FALSE, message=FALSE)
options(digits = 3)

3.1.2 Entering data

pepper <- 
  tibble::tribble(
   ~t,  ~`1`,  ~`2`,  ~`3`,
   0,  0.08, 0.001, 0.001,
   7,  0.13,  0.01, 0.001,
  14,  0.78,  0.09,  0.01,
  21,  0.92,  0.25,  0.05,
  28,  0.99,   0.8,  0.18,
  35, 0.995,  0.98,  0.34,
  42, 0.999,  0.99,  0.48,
  49, 0.999, 0.999,  0.74
  ) 

3.1.3 Visualize the DPCs

Before proceeding with model selection and fitting, let’s visualize the three epidemics. The code below reproduces quite exactly the top plot of Fig. 4.15 ((Madden, Hughes, and van den Bosch 2017) page 94). The appraisal of the curves might give us a hint on which models are the best candidates.

Because the data was entered in the wide format (each DPCs in a different columns) we need to reshape it to the tidyverse-suitable format, which is the long format. The pivot_longer function will do the job of reshaping from wide to long format so we can finally use the ggplot function to produce the plot.

pepper %>% 
  pivot_longer(2:4, names_to ="treat", values_to = "inc") %>% 
  ggplot (aes(t, inc, 
              linetype = treat, 
              shape = treat, 
              group = treat))+
  geom_point(size =2)+
  geom_line()+
  annotate(geom = "text", x = 15, y = 0.84, label = "1")+
  annotate(geom = "text", x = 23, y = 0.6, label = "2")+
  annotate(geom = "text", x = 32, y = 0.33, label = "3")+
  labs(y = "Disease incidence (y)",
       x = "Time (days)")+
  theme(legend.position = "none")

3.2

Most of the three curves appear to follow a sigmoid shape with the exception of group 3 that resembles an exponential growth, not reaching the maximum value, and thus suggesting an incomplete epidemic. We can easily discard the monomolecular and exponential models and decided on the other two non-flexible models: logistic or Gompertz. To do that, let’s proceed to model fitting and evaluate the statistics for supporting a final decision. There are two modeling approaches for model fitting in epifitter: the linear or nonlinear parameter-estimation methods.

3.2.1 Fitting: single epidemics

Among the several options offered by epifitter we start with the simplest one, which is fit a model to a single epidemics using the linear regression approach. For such, the fit_lin() requires two arguments: time (time) and disease intensity (y) each one as a vector stored or not in a dataframe.

Since we have three epidemics, fit_lin() will be use three times. The function produces a list object with six elements. Let’s first look at the Stats dataframe of each of the three lists named epi1 to epi3.

library(epifitter)

epi1 <- fit_lin(time = pepper$t,  
  y = pepper$`1` )
epi1$Stats
##                 CCC r_squared   RSE
## Gompertz      0.985     0.970 0.591
## Monomolecular 0.984     0.968 0.543
## Logistic      0.978     0.957 0.824
## Exponential   0.784     0.645 0.670
epi2 <- fit_lin(time = pepper$t,  
  y = pepper$`2` )
epi2$Stats
##                 CCC r_squared   RSE
## Logistic      0.996     0.992 0.452
## Gompertz      0.971     0.943 0.841
## Monomolecular 0.925     0.860 1.068
## Exponential   0.897     0.813 1.202
epi3 <- fit_lin(time = pepper$t,  
  y = pepper$`3` )
epi3$Stats
##                 CCC r_squared   RSE
## Logistic      0.983     0.967 0.605
## Gompertz      0.983     0.966 0.226
## Exponential   0.964     0.930 0.771
## Monomolecular 0.859     0.753 0.253

The statistics of the model fit confirms our initial guess that the predictions by the logistic or the Gompertz are closer to the observations than predictions by the other models. There is no much difference between them based on these statistics. However, to pick one of the models, it is important to inspect the curves with the observed and predicted values to check which model is best for all curves.

3.2.2 Fitting: multiple epidemics

Before looking at the prediction, let’s use another handy function that allows us to simultaneously fit the models to multiple DPC data. Different from fit_lin(), fit_multi() requires the data to be structured in the long format where there is a column specifying each of the epidemics.

Let’s then create a new data set called pepper2 using the data transposing functions of the tidyr package.

pepper2 <- pepper %>% 
  pivot_longer(2:4, names_to ="treat", values_to = "inc")

Now we fit the models to all DPCs. Note that the name of the variable indicating the DPC code needs to be informed in strata_cols argument.

epi_all <- fit_multi(
  time_col = "t",
  intensity_col = "inc",
  data = pepper2,
  strata_cols = "treat",
  nlin = FALSE
)

Now let’s select the statistics of model fitting. Again, Epifitter ranks the models based on the CCC (the higher the better) but it is important to check the RSE as well - the lower the better. In fact, the RSE is more important when the goal is prediction.

epi_all$Parameters %>% 
  select(treat, model, best_model, RSE, CCC)
##    treat         model best_model   RSE   CCC
## 1      1      Gompertz          1 0.591 0.985
## 2      1 Monomolecular          2 0.543 0.984
## 3      1      Logistic          3 0.824 0.978
## 4      1   Exponential          4 0.671 0.784
## 5      2      Logistic          1 0.452 0.996
## 6      2      Gompertz          2 0.841 0.971
## 7      2 Monomolecular          3 1.068 0.925
## 8      2   Exponential          4 1.202 0.897
## 9      3      Logistic          1 0.605 0.983
## 10     3      Gompertz          2 0.226 0.982
## 11     3   Exponential          3 0.771 0.964
## 12     3 Monomolecular          4 0.253 0.859

To be more certain about our decision, let’s advance to the final step which is to produce the plots with the observed and predicted values for each assessment time by calling the Data dataframe of the `epi_all list.

epi_all$Data %>%
 filter(model %in% c("Gompertz", "Logistic")) %>% 
  ggplot(aes(time, predicted, shape = treat)) +
  geom_point(aes(time, y)) +
  geom_line() +
  facet_wrap(~ model) +
 coord_cartesian(ylim = c(0, 1)) + # set the max to 0.6
  labs(
    y = "Disease incidence",
    x = "Time (days after emergence)"
  )

Overall, the logistic model seems a better fit for all the curves. Let’s produce a plot with the prediction error versus time.

epi_all$Data %>%
 filter(model %in% c("Gompertz", "Logistic")) %>% 
  ggplot(aes(time, predicted -y, shape = treat)) +
  geom_point() +
  geom_line() +
  geom_hline(yintercept = 0, linetype =2)+
  facet_wrap(~ model) +
 coord_cartesian(ylim = c(-0.4, 0.4)) + # set the max to 0.6
  labs(
    y = "Prediction error",
    x = "Time (days after emergence)"
  )

The plots above confirms the logistic model as good fit overall because the errors for all epidemics combined are more scattered around the non-error line.

  epi_all$Parameters %>%
    filter(model == "Logistic") %>%
    select(treat, y0, y0_ci_lwr, y0_ci_upr, r, r_ci_lwr, r_ci_upr 
)
##   treat       y0 y0_ci_lwr y0_ci_upr     r r_ci_lwr r_ci_upr
## 1     1 0.093504  0.027321   0.27473 0.210    0.166    0.255
## 2     2 0.001373  0.000672   0.00280 0.278    0.254    0.303
## 3     3 0.000813  0.000313   0.00211 0.175    0.143    0.208

We can produce a plot for visual inference on the differences in the parameters.

p1 <- epi_all$Parameters %>%
  filter(model == "Logistic") %>%
  ggplot(aes(treat, r)) +
  geom_point(size = 3) +
  geom_errorbar(aes(ymin = r_ci_lwr, ymax = r_ci_upr),
    width = 0,
    size = 1
  ) +
  labs(
    x = "Time",
    y = "r"
  )

p2 <- epi_all$Parameters %>%
  filter(model == "Logistic") %>%
  ggplot(aes(treat, 1 - exp(-y0))) +
  geom_point(size = 3) +
  geom_errorbar(aes(ymin = y0_ci_lwr, ymax = y0_ci_upr),
    width = 0,
    size = 1
  ) +
  labs(
    x = "Time",
    y = "y0"
  )

p1 | p2

3.3 Designed experiments

In this next exercise, we will work with disease data collected over time in the same plot unit (also called repeated measures) from a designed experiment for evaluating and comparing treatment effects.

Again, we will use a dataset of progress curves shown in page 98 (Madden, Hughes, and van den Bosch 2017). The curves represent the incidence of soybean plants symptomatic for bud blight caused by tobacco streak virus. Four treatments (different planting dates) were evaluated in randomized complete block design with four replicates. There are four assessment in time for each curve. The data was stored as a csv file and will be loaded using read_csv() function and stored as dataframe called budblight.

3.3.1 Loading data

budblight <- read_csv("data/bud-blight-soybean.csv")

Let’s have a look at the first six rows of the dataset and check the data type for each column. There is an additional column representing the replicates, called block.

head(budblight)
## # A tibble: 6 x 4
##   treat  time block     y
##   <chr> <dbl> <dbl> <dbl>
## 1 PD1      30     1  0.1 
## 2 PD1      30     2  0.3 
## 3 PD1      30     3  0.1 
## 4 PD1      30     4  0.1 
## 5 PD1      40     1  0.3 
## 6 PD1      40     2  0.38

3.3.2 Visualizing the DPCs

Let’s have a look at the curves and produce a combo plot figure similar to Fig. 4.17 of the book, but without the line of the predicted values.

p3 <- budblight %>%
  ggplot(aes(
    time, y,
    group = block,
    shape = factor(block)
  )) +
  geom_point(size = 1.5) +
  ylim(0, 0.6) +
  theme(legend.position = "none")+
  facet_wrap(~treat, ncol =1)+
  labs(y = "Disease incidence",
       x = "Time (days after emergence)")
p4 <- budblight %>%
  ggplot(aes(
    time, log(1 / (1 - y)),
    group = block,
    shape = factor(block)
  )) +
  geom_point(size = 2) +
  facet_wrap(~treat, ncol = 1) +
  theme(legend.position = "none")+
  labs(y = "Transformed incidence", x = "Time (days after emergence)")

p3 | p4

3.3.3 Model fitting

Remember that the first step in model selection is the visual appraisal of the curve data linearized with the model transformation. In the case the curves represent complete epidemics (close to 100%) appraisal of the absolute rate (difference in y between two times) over time is also helpful.

For the treatments above, it looks like the curves are typical of a monocyclic disease (the case of soybean bud blight), for which the monomolecular is usually a good fit, but other models are also possible as well. For this exercise, we will use both the linear and the nonlinear estimation method.

3.3.3.1 Linear regression

For convenience, we use the fit_multi() to handle multiple epidemics. The function returns a list object where a series of statistics are provided to aid in model selection and parameter estimation. We need to provide the names of columns (arguments): assessment time (time_col), disease incidence (intensity_col), and treatment (strata_cols).

lin1 <- fit_multi(
  time_col = "time",
  intensity_col = "y",
  data = budblight,
  strata_cols = "treat",
  nlin = FALSE
)

Let’s look at how well the four models fitted the data. Epifitter suggests the best fitted model (1 to 4, where 1 is best) for each treatment. Let’s have a look at the statistics of model fitting.

  lin1$Parameters %>% 
    select(treat, best_model, model, CCC, RSE)
##    treat best_model         model   CCC    RSE
## 1    PD1          1 Monomolecular 0.935 0.0981
## 2    PD1          2      Gompertz 0.904 0.2223
## 3    PD1          3      Logistic 0.871 0.4475
## 4    PD1          4   Exponential 0.828 0.3612
## 5    PD2          1 Monomolecular 0.955 0.0700
## 6    PD2          2      Gompertz 0.931 0.1794
## 7    PD2          3      Logistic 0.906 0.3877
## 8    PD2          4   Exponential 0.880 0.3268
## 9    PD3          1 Monomolecular 0.939 0.0683
## 10   PD3          2      Gompertz 0.929 0.1716
## 11   PD3          3      Logistic 0.909 0.3905
## 12   PD3          4   Exponential 0.890 0.3388
## 13   PD4          1      Gompertz 0.923 0.1747
## 14   PD4          2 Monomolecular 0.895 0.0649
## 15   PD4          3      Logistic 0.891 0.5241
## 16   PD4          4   Exponential 0.874 0.4977

And now we extract values for each parameter estimated from the fit of the monomolecular model.

  lin1$Parameters %>%
  filter(model == "Monomolecular") %>%
  select(treat, y0, r)
##   treat     y0      r
## 1   PD1 -0.573 0.0220
## 2   PD2 -0.522 0.0190
## 3   PD3 -0.449 0.0159
## 4   PD4 -0.362 0.0112

Now we visualize the fit of the monomolecular model (using filter function - see below) to the data together with the observed data and then reproduce the right plots in Fig. 4.17 from the book.

lin1$Data %>%
  filter(model == "Monomolecular") %>%
  ggplot(aes(time, predicted)) +
  geom_point(aes(time, y)) +
  geom_line(size = 0.5) +
  facet_wrap(~treat) +
  coord_cartesian(ylim = c(0, 0.6)) + # set the max to 0.6
  labs(
    y = "Disease incidence",
    x = "Time (days after emergence)"
  )

Now we can plot the means and respective 95% confidence interval of the apparent infection rate (\(r\)) and initial inoculum (\(y_0\)) for visual inference.

p5 <- lin1$Parameters %>%
  filter(model == "Monomolecular") %>%
  ggplot(aes(treat, r)) +
  geom_point(size = 3) +
  geom_errorbar(aes(ymin = r_ci_lwr, ymax = r_ci_upr),
    width = 0,
    size = 1
  ) +
  labs(
    x = "Time",
    y = "r"
  )

p6 <- lin1$Parameters %>%
  filter(model == "Monomolecular") %>%
  ggplot(aes(treat, 1 - exp(-y0))) +
  geom_point(size = 3) +
  geom_errorbar(aes(ymin = y0_ci_lwr, ymax = y0_ci_upr),
    width = 0,
    size = 1
  ) +
  labs(
    x = "Time",
    y = "y0"
  )

p5 | p2

3.3.3.2 Non-linear regression

To estimate the parameters using the non-linear approach, we repeat the same arguments in the fit_multi function, but include an additional argument nlin set to TRUE.

nlin1 <- fit_multi(
  time_col = "time",
  intensity_col = "y",
  data = budblight,
  strata_cols = "treat",
  nlin = TRUE
)

Let’s check statistics of model fit.

nlin1$Parameters %>%
  select(treat, model, CCC, RSE, best_model)
##    treat         model   CCC    RSE best_model
## 1    PD1 Monomolecular 0.938 0.0613          1
## 2    PD1      Gompertz 0.917 0.0699          2
## 3    PD1      Logistic 0.896 0.0770          3
## 4    PD1   Exponential 0.854 0.0880          4
## 5    PD2 Monomolecular 0.967 0.0421          1
## 6    PD2      Gompertz 0.935 0.0573          2
## 7    PD2      Logistic 0.908 0.0666          3
## 8    PD2   Exponential 0.870 0.0767          4
## 9    PD3 Monomolecular 0.957 0.0427          1
## 10   PD3      Gompertz 0.926 0.0544          2
## 11   PD3      Logistic 0.900 0.0620          3
## 12   PD3   Exponential 0.870 0.0689          4
## 13   PD4 Monomolecular 0.918 0.0460          1
## 14   PD4      Gompertz 0.909 0.0479          2
## 15   PD4      Logistic 0.894 0.0508          3
## 16   PD4   Exponential 0.884 0.0527          4

And now we obtain the two parameters of interest. Note that the values are not the sames as those estimated using linear regression, but they are similar and highly correlated.

  nlin1$Parameters %>%
    filter(model == "Monomolecular") %>%
    select(treat, y0, r)
##   treat     y0      r
## 1   PD1 -0.707 0.0238
## 2   PD2 -0.634 0.0206
## 3   PD3 -0.505 0.0167
## 4   PD4 -0.350 0.0109
p7 <- nlin1$Parameters %>%
  filter(model == "Monomolecular") %>%
  ggplot(aes(treat, r)) +
  geom_point(size = 3) +
  geom_errorbar(aes(ymin = r_ci_lwr, ymax = r_ci_upr),
    width = 0,
    size = 1
  ) +
  labs(
    x = "Time",
    y = "r"
  )

p8 <- nlin1$Parameters %>%
  filter(model == "Monomolecular") %>%
  ggplot(aes(treat, y0)) +
  geom_point(size = 3) +
  geom_errorbar(aes(ymin = y0_ci_lwr, ymax = y0_ci_upr),
    width = 0,
    size = 1
  ) +
  labs(
    x = "Time",
    y = "y0"
  )

p7 | p8
Madden, Laurence V., Gareth Hughes, and Frank van den Bosch, eds. 2017. “CHAPTER 4: Temporal Analysis i: Quantifying and Comparing Epidemics.” In, 63–116. The American Phytopathological Society. https://doi.org/10.1094/9780890545058.004.