New website and blog

I’ve mode to a new website: All future posts will appear only at


Biomarker-guided clinical trial designs

A biomarker is a measured characteristic of a patient that relates to biological processes. Genetic markers, antigen presence, and chemical concentrations are all examples of biomarkers. Biomarkers may be classified as prognostic, predictive, surrogate endpoints, or some combination of the three. Prognostic biomarkers give information about the expected course of a disease absent intervention, while predictive biomarkers indicate how various treatments are expected to perform.
Surrogate endpoints are endpoints that are not the endpoint of primary interest, but for which the treatment effect predicts the treatment effect on the primary endpoint. I’ll focus here on predictive biomarkers.

In clinical practice, predictive biomarkers are attractive for determining how to treat particular patients. In the clinical trial setting, then, a common goal is to use predictive biomarkers in order to determine to whom a treatment should be given. Specifically, if a predictive biomarker has been identified, a trial may seek to determine whether a treatment should be developed for the entire population, or only for a subset defined by that biomarker. Such a determination is the goal of most biomarker-guided clinical trial designs.

Stratified designs

Perhaps the most direct approach to determining whether a treatment should be developed for the entire population versus a subset is to enroll a patients from the broad population and then examine the treatment effect stratified by biomarker. Trials using this strategy are said to use stratified designs. The difference between a stratified design and the simple randomized control trial is that patients are not enrolled in the stratified design if their biomarker status cannot be determined (Mandrekar et al, 2013). This ensures that subgroup membership is not subject to missing data.

At the conclusion of the trial, treatment effect is tested in strata defined by the biomarker. In a sequential testing approach, a single hypothesis (e.g., a treatment effect with respect to a single endpoint) is either tested first in a subgroup and then in the broad population if the first test is significant(Bauer, 1991). This approach is called a closed testing procedure and is especially useful when there is substantial prior evidence that the treatment effect is strongest in the subgroup and that the the subgroup is sufficiently large in the population to provide adequate power. Alternatively, when the treatment is expected to be broadly effective, the order of the tests may be reversed.
Both of these sequential testing procedures maintain the familywise type I error rate at the nominal level of the individual tests.

Another approach to stratified designs is the marker-by-treatment interaction design (Sargent et al, 2005), in which the biomarker is used to partition the population into two groups. Patients with valid (but not necessarily positive) biomarker test results are enrolled and then randomized separately within each biomarker stratum. The sample size of the study is determined in order to provide sufficient power to test the primary hypothesis independently in each stratum.

Finally, treatment-by-biomarker interaction tests may be carried out prior to the primary hypothesis tests, and subset-specific hypotheses tested only if the corresponding interaction test is significant (Pocock et al, 2002). However, it must be noted that the sample size necessary to power such interaction tests are often only feasible in phase III settings. Continue reading “Biomarker-guided clinical trial designs”

A restaurant drawing setup

A few weeks ago a restaurant offered my wife and I entry into a weekly drawing for a gift certificate. They mentioned a little twist by which a first-time entrant would have their ticket remain in the pot until they win. I wondered how such a drawing would play out and if it gave us a reasonably higher chance of winning over a normal drawing in which the pot is cleared each week.

I don’t know exactly how this drawing is carried out, but let’s hypothesize the following setup: first-time entrants place a red ticket in the pot and repeat entrants place a blue ticket in the pot. Each week a ticket is drawn from the pot (the winner), and then all of the blue tickets are removed from the pot. Notice that if the blue ticket of a repeat customer is drawn, their red ticket remains in the pot. This would be the case if the restaurant didn’t bother to go through all the red tickets to find the one corresponding to the drawn blue ticket, which seems reasonable if they were doing this by hand.

Imagine that n red tickets and m blue tickets are added each week, and that Ni red tickets have built up from previous weeks. The probability that a red ticket will win the drawing is (Nin) / (Ninm), and the probability that a blue ticket will win is m / (Ninm). If a red ticket wins, Ni+1 = Nin – 1 red tickets are carried over to the next week, otherwise Ni+1Nin are carried over.

Clearly, if n > 1 then the Ni increase to infinity because Ni+1Ni every week. For the case of n = 1, we know that Ni never decreases, and that for each week there is a positive probability that it increases. Since there is always a positive probability that Ni increases, it will do so infinitely many times, and since it never decreases, it goes to infinity.

Adaptive randomization: motivation and issues

Randomized controlled trials (or randomized comparative trials) are the gold standard for scientific inference in medicine. However, not all RCTs have balanced randomization ratios, i.e. patients may not be equally likely to be randomized to each treatment. Motivations for unbalancing the randomization ratio may include giving a promising treatment rather than the control to the majority of the trial participants, which is often considered ethically desirable.

However if there are two treatments and neither is known to be more promising than the other, adaptive randomization (AR) techniques are sometimes used to unbalance the randomization ratio as information about the treatments accumulates. One intuitive Bayesian approach to AR is Thompson sampling, which randomizes patients to each treatment with probability equal to the posterior probability that the treatment maximizes the expected benefit. A modification of Thompson sampling is to shrink the randomization probabilities toward 1/2 (or 1/k for k treatments).

A competing method to AR is the group sequential (GS) trial, which has balanced randomization and stops the trial early at one of the pre-specified interim points if it is clear that one treatment is superior. Group sequential trials don’t try to randomize more patients to the superior treatment, but rather ends randomization as soon as possible so that future patients may be given the superior treatment.

A recent paper by Thall, Fox, and Wathen pits AR (both shrunken and unshrunken) against GS, with some disappointing results for AR. Their simulations yield the following results.

  1. AR causes more variability in the sample sizes of each arm than equal randomization (including GS).
  2. AR can have an unacceptably high probability of assigning significantly more patients to the inferior treatment.
  3. AR substantially biases the treatment effect estimator.
  4. AR can lead to very high type I error rates.
  5. AR has much less power to detect treatment effects than GS for a given type I error rate.

More details, including a discussion of parameter drift (a situation in which the prognosis changes over time but not the comparative treatment effect, exacerbating points 3 and 4), are discussed in Thall et al., and it’s quite an accessible and interesting read.

Derivation of the Brier scoring rule from basic requirements

The Brier score is a way of quantifying the accuracy of probabilistic forecasts of binary events, for example the statement that there is an 80% chance that it will rain in a certain area tomorrow. The Brier score for n predictions and events is defined as

S = \sum_{i=1}^{n} (y_i - p_i)^2

where pi is the predicted probability that an event will occur, and yi is 1 if the event occurs and 0 otherwise.

The Brier score can be viewed as a loss function with the pi being the decisions and the yi the outcomes, and is in fact a form of squared error loss. While the squared error loss is a natural choice to a statistician, I wondered if there might be other reasonable choices for scoring this sort of prediction, or if a small set of desirable properties of a scoring function yields the Brier score as the unique solution. It turns out that three reasonable requirements on a loss function l(yx) are enough to specify the Brier score up to a multiplicative constant:

  1. Integrity: if the decision-maker believes that an event has a probability p of occurring, then p should be the unique prediction that minimizes the risk r(xp), or expected loss E[l(Yx)].
  2. Symmetry: l(0; x) = l(1; 1-x).
  3. No penalty for perfect predictions: l(1; 1) = 0 and l(0, 0) = 0 (it turns out one of these implies the other given points 1 and 2).

We start with the integrity requirement. Because the outcome is binary, the risk when the decision-maker believes that an event has probability p of occurring is r(xp) = (1-pl(0; x) + p l(1; x), we can find all the minima by finding the points x at which the first derivative with respect to xr‘(xp) = (1-pl_x(0; x) + p l_x(1; x), is equal to zero, and the second derivative is non-positive.

By the symmetry requirement, r‘(xp) = (1-plx(0; x) + p lx(1; x) = 0 can be rewritten as r‘(xp) = (1-plx(1; 1-x) + p lx(1; x) = 0, and we have (1-plx(0; x) = –p lx(1; x). Thus lx(1; 1-x) = –p and lx(1; x) = c (1-p). Since the second derivative must be non-positive, c must be non-negative. Since the minimizer is to be unique, c cannot be zero. A similar argument yields the requirements that  lx(0; x) = –p and lx(0; 1-x) = c (1-p). Since p is to be the risk minimizer, we have that l(y, p) = c (yp)^2 + k for all p. The lack of penalty for perfect predictions implies that k = 0.

Thus we the Brier score up to a multiplicative constant. Such a factor doesn’t change any meaningful properties of the loss function, so this isn’t really a problem. Why not set it to 1?

Note: additivity is a standard property of loss functions, so I didn’t include it as a separate requirement for this particular case.

A visual guide to the stick-breaking construction of Dirichlet processes

I’ve seen the stick-breaking construction for Dirichlet processes presented several times and read about it in a few places, but what really helped me understand what goes on was visually drawing out the algorithm.

A Dirichlet process D(Fα) centered around distribution F with concentration parameter α is a distribution of probability distributions with expected value F. The concentration parameter α controls how discretized the realizations of the processes are. In the limit α → 0, each of the distributions are concentrated at a single point (but not the same point in every draw). When α → ∞, the drawn distributions are continuous and look like F.

The stick-breaking process expresses the pdf of the realized distribution as f(x) = \sum_{k=1}^{\infty} \beta_k \delta_{x_k}(x) , where the \delta_{x_k} are Dirac delta-distributions centered at xk. The xk are independently identically distributed, and the βk are determined by a “stick-breaking” algorithm: \beta_k = \beta_k^\prime \prod_{i=1}^{k-1} (1-\beta_i^\prime) where the βk‘ are iid Beta(1, α) distributed.

For better or worse, I often picture distributions as probability density functions (pdfs) or histograms, but here it’s especially useful to think of them as cumulative distribution functions (cdfs).

Suppose F is the standard normal distribution, which has the cdf below.

The normal cdf
The normal cdf

If we draw a sample of size n=5 from this distribution and construct the empirical cdf, we get something like the plot below.

The empirical cdf from n=5 observations from the standard normal distribution
The empirical cdf from n=5 observations from the standard normal distribution

Strictly speaking, a Dirichlet process is an infinite process, but for practical purposes the realizations are truncated when the βk become sufficiently small. For the purposes of illustration I’ll simplify even further and stop at k=5. The stick-breaking algorithm can be thought of as starting with a stick of length 1, and breaking it at a proportion β1‘ along its length, and repeating with the remainder of the stick. A visualization is below.

Stick-breaking illustration with β' draws 0.21, 0.16, 0.14, 0.25, 0.24.
Stick-breaking illustration

To use these βs to construct the realized cdf, we note that the empirical cdf has a jump of 0.2 at each observation, and simply replace 0.2 with βk for each observation k:

The realized cdf of the Dirichlet process

An R snippet for eliciting priors via quantiles

It’s fine to use a vague prior for a Bayesian analysis when we really don’t have any idea about what a parameter is (or if we want to pretend we don’t), but when there is prior knowledge available, it’s often a waste to not draw from it. This prior knowledge might come from a non-statistician scientist, and so it’s important to have a way for them to communicate their knowledge to us in a way that allows us to construct a prior. However, unless the prior we want to construct is a Bernoulli or maybe a normal distribution, eliciting distribution parameters directly is unlikely to work.

Perhaps a more reliable and intuitive way is to specify quantiles, for example “we’re 50% sure that the proportion is less than 0.01 and 99% sure that it’s less than 0.05.” Giving two quantiles in this way is enough to determine a distribution from a two-parameter family, and is much easier to understand than saying that our prior knowledge follows a Beta(a, b) distribution.

Unless you’re much better at guessing parameters from quantiles than I am, it’s helpful to have a function that takes in the quantiles and returns the corresponding parameter values. Here’s the function I use for normal, beta, and gamma distributions, which are the three I most commonly use for priors (along with inverse-gamma priors, which can be obtained by inverting the desired quantiles in the gamma case).

params.from.quantiles <- function(q, p, family="normal",
                                  start=c(0.5, 0.5)) {
  # Finds parameter values of distributions that closely match
  # the specified quantiles.
  # Args:
  #   q: A vector of quantiles (length 2, strictly increasing)
  #   p: A vector of percentiles (length 2, strictly increasing)
  #   family: One of "normal", "beta", or "gamma" specifying
  #           which family of distributions to use
  #   start: starting values for parameter search
  # Returns:
  #    A list containing a named vector of parameter values (par),
  #    a named vector containing the fitted quantiles (cdf),
  #    and the family used (family)
  # basic input validation
  stopifnot(length(p) == 2,
              length(q) == 2,
              p[1] < p[2],
              q[1] < q[2],
              0 < p[1],
              p[2] < 1)
  # select distribution function
  g <- switch(family,
  # a wrapper of the cdf for optim
  f <-function(theta) {
    # if parameter values are invalid, return NA
      sum((g(q, theta[1], theta[2]) - p)^2)
    }, warning=function(w) {
  # find best parameter values
  sol <- optim(start, f)
  # prepare output
  out <- list()
  out$par <- sol$par
  names(out$par) <- switch(family,
                           normal=c("mean", "sd"),
                           beta=c("a", "b"),
                           gamma=c("shape", "rate"))
  out$cdf <- g(q, out$par[1], out$par[2])
  names(out$cdf) <- q
  out$family <- family

An example:

> params.from.quantiles(q=c(0.01, 0.05), p=c(0.5, 0.99), family="beta")
         a          b 
  1.377637 105.506531 

     0.01      0.05 
0.4999679 0.9899850 

[1] "beta"

The function is also available as a gist on Github.

Parallel Gibbs sampling using conditional independence and graph colorings

A Gibbs sampler is a sequential method for producing samples from the joint distribution of several variables by sampling from the simpler conditional distributions of each variable. That is, to sample from a vector-valued random variable X, we can begin with an initial value X(0), and in the ith step of the algorithm sample the coordinate xj(i) from the conditional distribution p(x_j | x_1^{(i)}, \ldots, x_{j-1}^{(i)}, x_{j+1}^{(i-1)}, \ldots, x_n^{(i-1)}). A (correlated) sample is thus obtained from the approximate joint distribution of X.

This iterative procedure can take a long time, though, especially when there are many variables or when sampling from even the conditional distributions is computationally expensive. However, parallelization may be able to help. Naively sampling as many coordinates as possible in parallel usually won’t work because, in the general case, the conditional distribution of each coordinate may depend on every other coordinate. However, in cases where two or more coordinates are independent given the rest (think hierarchical models) they can be sampled in parallel.

We can represent the dependence structure of the joint distribution by a Markov random field (MRF), that is, an undirected graph which has the property that any two subsets of variables are conditionally independent given a separating subset. Now consider a graph coloring (of vertices), which is a labeling (coloring) of the vertices in such a way that no two vertices of the same color share an edge between them. The definition of an MRF then implies that all variables represented by vertices of the same color are independent given the variables represented by other colors.

Thus we can revise the standard Gibbs sampler to, instead of sampling from p(x_j | x_1^{(i)}, \ldots, x_{j-1}^{(i)}, x_{j+1}^{(i-1)}, \ldots, x_n^{(i-1)}), sample from p(x_{c_j,1}, \ldots, x_{c_j,n_j} | x_{c_1,1}^{(i)}, \ldots, x_{c_{j-1}}^{(i)}, x_{c_{j+1}}^{(i-1)}, \ldots, x_{c_k,n_k}^{(i-1)}) = \prod_{l=1}^{n_j} p(x_{c_j, l} |x_{c_1,1}^{(i)}, \ldots, x_{c_{j-1}}^{(i)}, x_{c_{j+1}}^{(i-1)}, \ldots, x_{c_k,n_k}^{(i-1)}) where cj is the jth color and nj is the number of variables with color cj. Then, since all of these variables are conditionally independent, each coordinate can be sampled in parallel.

Does this look normal to you? (Q-Q plots)

One question that always seems to come up in introductory statistics classes is how to tell if a sample is “normal enough” from a normal quantile (or Q-Q) plot. Students are told that the points on the plot should form a more or less straight line, and that they should be especially wary of bow shapes and S shapes, which signify departures from normality in the forms of skew and tail weight. When there are plenty of sample points this often isn’t a problem, but when there are relatively few (say, around 30, often cited as a rough minimum for making normality assumptions) there will usually be noticeable deviations from the diagonal, even when the underlying population really is normal. So, how can we help students use Q-Q plots before their intuition has been properly built up?

Can you tell which plot does not come from a normal sample?
Can you tell which plot does not come from a normal sample? Answer: the bottom right is from a gamma(4, 4) distribution.

We could compare the Q-Q plot to some of the same size sampled from a normal population with mean and standard deviation equal to those of the sample, but perhaps we could do one better by seeing if we can identify the sample under study from among normal samples.

Here’s a function qqfind() that generates randomly hides the sample Q-Q plot among three Q-Q plots of samples from the hypothetical normal population. We could say that if we’re unable to clearly identify the real sample, then it’s “normal enough”.

qqfind <- function(x) {
  mean <- mean(x)
  sd <- sd(x)
  par(mfrow=c(2, 2))
  real <- sample(1:4, 1)
  for (i in 1:4) {
    if (i == real) {
    } else {
      y <- rnorm(length(x), mean, sd)
  par(mfrow=c(1, 1))

Continue reading “Does this look normal to you? (Q-Q plots)”

Closures in R

In my earlier post Building a Bayesian counterpart to R’s lm() function, I wrote a blm() function that took arguments similar to those passed to the usual lm() function and returned information about the posterior distribution of linear model parameters. The marginal posterior of the regression coefficients in that model is a multivariate t distribution with a non-centrality vector, dispersion matrix, and degrees of freedom as parameters. In order to sample from this distribution given a fit object, we could load the mvtnorm package and pass the parameters to the rmvt() function as follows:

sample <- rmvt(n=10,

This isn’t so bad, but it could be a lot easier if we could sample it by calling the following:

sample <- fit$rmarginal(n=10)

Simpler, right? The fit object would include a function rmarginal() which samples from the t distribution with the correct parameters. In order to do this we use an R concept called a closure, which is a function packaged along with an environment, basically a list of all the variables in the scope in which the function was declared, along with their values. To create the rmarginal() function, we find the code near the end of the fit.blm() function:

posterior.params <- list(mean=mean,


and change it to the following:

posterior.params <- list(mean=mean,

  rmarginal <- function(n) {


When we declare the rmarginal() function we reference posterior.params$t.cov, along with the other distribution parameters. Since rmarginal() relies on these variables which are defined in the scope of the fit.blm() function, it carries them along with itself. This combination of the function and the data attached to it is the closure. In order to draw a sample of size ten from the marginal posterior we can then simply call fit$rmarginal(10) and get the appropriate result!