Book review: Dynamical Biostatistical Models by Commenges and Jacqmin-Gadda
Most statistical models are purely phenomenological, which is a polite way of saying that they are made up. While such models may draw on theoretical and commonsensical ideas, their mathematical forms are not derived from existing principles in any precise way, but are merely posited on the basis of goodness of fit to data, ease of interpretation, analytical and computational tractability, and other practical considerations. The statistical paradigm of scientific modeling has the advantage of being widely applicable, since it does not depend on a deep understanding of the system under study. In many fields of social science, whose systems defy neat theoretical explanation, statistical modeling is the dominant form of scientific modeling.
Especially in the physical and life sciences, there is another tradition, much older than the statistical paradigm,1 that puts mechanistic models at the fore. Mechanistic models attempt to identify the unobserved physical processes, or mechanisms, that drive and explain observed phenomena. The prototypical form of a mechanistic model is a system of ordinary or partial differential equations, whose dynamical laws are derived exactly or approximately from general physical principles. The advantage of this paradigm is that, at least when it works, mechanistic models can be expected to have greater generalizability than statistical models, since valid physical laws latch onto fundamental and persistent regularities in nature. A purely phenomenological model, by contrast, usually cannot be expected to generalize beyond the range of the data to which it is fit.
There seems to be fairly little interaction between the statistical and mechanistic cultures of scientific modeling.2 This is surprising because many mechanistic models have free parameters that are unknown and must be estimated from noisy data, giving rise to all the usual problems of statistical inference. The book under review, Dynamical Biostatistical Models (Commenges and Jacqmin-Gadda 2015), is a rare example of a work that tries to bring together both viewpoints. One can imagine at least two different approaches to writing such a book: starting from statistics and showing how familiar statistical concepts relate and apply to mechanistic models, or starting from dynamical systems and showing how statistical inference may be incorporated. The present book takes the former path. The authors, Daniel Commenges and Hélène Jacqmin-Gadda, are biostatisticians and they present a good deal of conventional statistics before making contact with dynamical systems in the final chapters of the book.
The book has three main themes, to be explored in subsequent sections: statistical models for longitudinal data, using random effects to capture within-subject correlation; survival analysis, or models for time-to-event data; and extensions of these ideas to multistate, ODE, and SDE models, including aspects of causal inference. All of the models treated are “dynamical” in that they model phenomena which evolve over time. They are illustrated with data from biomedical studies, especially epidemiology, but are applicable to other domains. With the exception of one example, nearly the last in the book, the statistical methodology is thoroughly frequentist, favoring the likelihood school of inference.3
Models of longitudinal data
Longitudinal data, or panel data, consists of repeated measurements over time on a set of subjects. According to the authors, longitudinal data is distinguished from time series data by having a relatively large number of subjects or observational units but relatively few measurements per subject, which may be taken at irregular times (Commenges and Jacqmin-Gadda 2015, 63). To illustrate, the response to treatment over time of patients with a chronic disease is longitudinal data, whereas the daily number of new COVID-19 cases in the United States is time series data.
In this text, the workhorse models for longitudinal data are mixed effects models, where the fixed effects capture population-level trends and the random effects capture individual variation. The simplest version of this model is the linear mixed model
where
is the number of subjects is the number of measurements taken on subject is the response at measurement on subject are predictors corresponding to global parameters are predictors corresponding to individual effects is the measurement error.
The parameters of this model are
where
Despite being a standard topic in statistics, mixed effects are a notorious source of confusion for beginners and professionals alike. Commenges and Jacqmin-Gadda do an admirable job of explaining why the model takes this form. First, in a standard linear model, all the variables
which is nonzero and increasing with time. Moreover, making the individual effects
Chapters 4 and 5 present a wide array of extensions to the linear mixed model for longitudinal data, namely
- multivariate linear mixed models, for correlated multivariate responses;
- generalized linear mixed models (GLMMs),4 such as the mixed logistic model for binary responses and the mixed Poisson model for count data;
- non-linear5 and curvilinear mixed models;
- mixed models with latent classes, a type of mixture model; and
- handling of missing data.
Concepts like generalized linear models (GLMs), mixture models, and missing data are mainstays of statistics, but it is still helpful to see how they manifest for mixed models of longitudinal data.
Among the less familiar ideas is the “curvilinear model” drawn from work by the authors and their collaborators.6 A curvilinear mixed model has the form
where
Curvilinear mixed models should be compared with GLMMs, which have the form
where
Survival analysis
Another common type of temporal data is time-to-event data: the time, for each subject, at which an event of interest occurs. The analysis of time-to-event data is called survival analysis because the event is prototypically death or something equally morbid. But happier events are also allowed. For example, the event could be recovery from a nonlethal infectious disease. This interpretation is the source of an analogy between survival analysis and compartmental models in epidemiology, as we will see later. Although survival analysis is very important in certain application areas, such as medical trials and industrial reliability experiments, it is often not treated as a core topic in statistics curricula.
Modeling in survival analysis is centered around two crucial functions. Letting the time of death (or other event of interest) be a continuous random variable
and the hazard function
Thus,
is another useful quantity.
The hazard function plays the role in survival analysis that the probability density function plays in other parts of statistics. The hazard function is some ways more convenient for specifying models, since any function
Chapters 3 and 6 present methods for survival analysis, beginning with the Kaplan-Meier estimator, a nonparametric estimator of the survival function, and the Cox proportional hazards model, a semiparametric model defined through the hazard function. They are among the most widely used statistical methods in medicine.7 We focus on the Cox model since it is more strongly connected to other themes of the book.
The Cox proportional hazards model is used to study how the hazard depends upon covariates of interest. It postulates that the hazard functions for
where both the baseline hazard function
Chapter 6 surveys extensions to survival analysis and the proportional hazards model, such as
- competing risks models, having not a single event of interest but a set of mutually exclusive alternatives;
- frailty models, accounting for grouped data, such as subjects belonging to the same family, or recurrent events, such as successive asthma attacks, through multiplicative random effects in the hazard function;
- cure models, for when a nonnegligible portion of the population is not at risk, perhaps because of immunity to a disease.
In addition, Chapter 8 on “joint models” connects survival analysis with the previous theme, explaining how to accommodate both time-to-event data and longitudinal data within a single model.
Of these many topics we discuss only the competing risks model. As a typical scenario, elderly patients with cancer have nonnegligible risk of dying for unrelated reasons, so when estimating the risk due to cancer, the event “death from cancer” should be distinguished from “death from other causes” (Commenges and Jacqmin-Gadda 2015, sec. 6.2.5). Here there are two competing risks or causes. In general, a competing risks model has
where
where the constraint
Multistate and mechanistic models
Multistate models, the topic of Chapter 7, make a link between survival analysis and the sprawling field of dynamical stochastic processes. Survival analysis models are multistate models of a particularly simple kind, having two states, “alive” and “dead,” and a single transition, “death.”
Generalizing this, competing risks models with
Another simple but important multistate model is the irreversible illness-death model, inserting a new state “ill” between “alive” and “dead” (Commenges and Jacqmin-Gadda 2015, sec. 7.2.6 & 7.5.6).
In this model, it is also possible to transition directly from “healthy” to “dead,” accounting for the possibility of death by causes other than the illness under study.
As mathematical entities, multistate models are continuous-time Markov chains, which may be characterized by their transition probabilities or transition intensities. The transition probabilities of a multistate process
These assemble into a stochastic matrix
Given the transition probabilities, the transition intensities are defined by
The quantity
For given
For modeling purposes, a multistate model is most conveniently specified through its transition intensities. The transition intensity
where the covariates
Multistate models are reminiscent of compartmental models in epidemiology. If, in the illness-death model, you replace “healthy” with
However, statistics is relevant to mechanistic models such as the SIR model when, as often happens, they are only partially and noisily observed. Chapter 9, the last in the book, contains a brief introduction to statistical inference for mechanistic models based on ordinary or stochastic differential equations. An ODE model starts from a model of the system
where
where the
Just as for other kinds of longitudinal data, the within-subject correlation is captured by a mixed model, such as
where
Another topic of Chapter 9 is stated in its title: “The dynamic approach to causality.” I will not discuss this here, intriguing though it is, because the presentation is somewhat sketchy and I am not competent to compare the dynamic approach with other approaches to causality, such as potential outcomes. The material on causality is drawn from research by the first author, Commenges, in collaboration with Gégout-Petit (Commenges and Gégout-Petit 2009).
Summary
As suggested by the breadth of this review, Dynamical Biostatistical Models covers an extraordinary amount of material for a book that is, excluding appendices, less than three hundred pages long. Inevitably, this means that the book is more a survey than an encyclopedia, but it serves that purpose well. The prose, like the exposition, is succinct and unadorned. The text is translated from the original French and at times this shows, most obviously in the several places where the authors forget to translate “et” as “and.” However, the occasional infelicity never inhibits the readability of the text. I would recommend the book to those with basic training in statistics who wish learn about dynamical statistical methods, particularly as used in medicine and epidemiology.
References
Footnotes
While rudimentary statistical analysis was already used by the astronomers of the late 18th century, the statistical paradigm in its modern form is barely a hundred years old.↩︎
In a famous essay (Breiman 2001), Leo Breiman draws a different cultural distinction, between the statistical mainstream (the “data modeling culture”) and machine learning (the “algorithmic modeling culture”). From the viewpoint of this review, the culture of statistical modeling is the intermediate position, taking a view of models that is more realistic than the instrumentalism of machine learning but less realistic than in the mechanistic tradition.↩︎
According to the likelihood school, statistical inference is performed by maximizing the likelihood or one of its innumerable variants: marginal likelihood, conditional likelihood, partial likelihood, profile likelihood, penalized likelihood, and so on. Chapter 2 of Dynamical Biostatistical Models briskly reviews these ideas but for a fuller account the reader should consult another text, such as (Pawitan 2001).↩︎
See, for example, (Stroup 2012).↩︎
See (Lindstrom and Bates 1990).↩︎
Be warned that the term “curvilinear” has no definite technical meaning in statistics and that the usage here is idiosyncratic. Versions of the curvilinear mixed model are featured in research papers by the authors (Proust et al. 2006; Proust-Lima, Letenneur, and Jacqmin-Gadda 2007).↩︎
According to Google Scholar, the original papers on the Kaplan-Meier estimator and the Cox proportional hazards model have each been cited over 50,000 times.↩︎
As the form of the hazard function suggests, the proportional hazards model is related to exponential families and GLMs (Sundberg 2019, sec. 9.7.1).↩︎
The standard SIR model has no direct transition
, but it may be added to model control measures or vaccination, such as in (Reluga and Medlock 2007).↩︎The distinction between models for the system and for the observations is emphasized by the authors, who remark that it is commonplace in stochastic processes and control theory but not in biostatistics (Commenges and Jacqmin-Gadda 2015, sec. 9.1.3.1). It also figures prominently in the work of Patrick Suppes, as noted in my review of Representation and Invariance.↩︎