Computationally reproducible analyses could be especially useful to subjective Bayesians

When using prior distributions to summarize subjective prior beliefs going into a statistical analysis, we ought to make it easy for people with differing prior beliefs to come to their own conclusions using the same data. Ideally, we could simply report the likelihood function of the data under our model and allow others to multiply it by their own prior distributions. In practice, however, it’s more realistic to provide software that can “reproduce” the analysis under different priors. As for the original presentation of results, a starting point could be to show results under a few different priors that readers might reasonably hold (perhaps including reference priors).

Building a Bayesian counterpart to R’s lm() function

One of my recent projects used a simple Bayesian version of the linear model. While the function I wrote to fit the model certainly worked, I was interested in making it more robust and easier to use by imitating the form of R’s built-in lm() function, which handles the frequentist linear model. In this post I’ll take apart the existing lm() function, describe a simple Bayesian linear model with conjugate priors, and build a counterpart blm() function. Along the way I’ll go through some properties of the R language that don’t come up in everyday data analytical use, but can be immensely helpful when writing your own methods. Specifically, I’ll explain how to take a full data frame and use a model formula and a subsetting criterion to produce a response vector and a design matrix that can be used directly for model fitting.

Dissecting R’s lm() function

The source code of any function in R can be viewed by entering the name of the function into the console. Sometimes the source code isn’t terribly helpful, especially if it just sends you to a C call, but we get lucky in the case of lm(), which provides a good explanation of what’s going


function (formula, data, subset, weights, na.action, method = "qr",
          model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE,
          contrasts = NULL, offset, ...)
{
  ret.x <- x
  ret.y <- y
  cl <- match.call()
  mf <- match.call(expand.dots = FALSE)
  m <- match(c("formula", "data", "subset", "weights", "na.action",
               "offset"), names(mf), 0L)
  mf <- mf[c(1L, m)]
  mf$drop.unused.levels <- TRUE
  mf[[1L]] <- quote(stats::model.frame)
  mf <- eval(mf, parent.frame())
  if (method == "model.frame")
    return(mf)
  else if (method != "qr")
    warning(gettextf("method = '%s' is not supported. Using 'qr'",
                     method), domain = NA)
  mt <- attr(mf, "terms")
  y <- model.response(mf, "numeric")
  w <- as.vector(model.weights(mf))
  if (!is.null(w) && !is.numeric(w))
    stop("'weights' must be a numeric vector")
  offset <- as.vector(model.offset(mf))
  if (!is.null(offset)) {
    if (length(offset) != NROW(y))
      stop(gettextf("number of offsets is %d, should equal %d (number of observations)",
                    length(offset), NROW(y)), domain = NA)
  }
  if (is.empty.model(mt)) {
    x <- NULL
    z <- list(coefficients = if (is.matrix(y)) matrix(, 0,
                                                      3) else numeric(), residuals = y, fitted.values = 0 *
                y, weights = w, rank = 0L, df.residual = if (!is.null(w)) sum(w !=
                                                                                0) else if (is.matrix(y)) nrow(y) else length(y))
    if (!is.null(offset)) {
      z$fitted.values <- offset
      z$residuals <- y - offset
    }
  }
  else {
    x <- model.matrix(mt, mf, contrasts)
    z <- if (is.null(w))
      lm.fit(x, y, offset = offset, singular.ok = singular.ok,
             ...)
    else lm.wfit(x, y, w, offset = offset, singular.ok = singular.ok,
                 ...)
  }
  class(z) <- c(if (is.matrix(y)) "mlm", "lm")
  z$na.action <- attr(mf, "na.action")
  z$offset <- offset
  z$contrasts <- attr(x, "contrasts")
  z$xlevels <- .getXlevels(mt, mf)
  z$call <- cl
  z$terms <- mt
  if (model)
    z$model <- mf
  if (ret.x)
    z$x <- x
  if (ret.y)
    z$y <- y
  if (!qr)
    z$qr <- NULL
  z
}

If you look closely, it becomes apparent that lm() doesn’t actually perform any of the actual model fitting, but is really a wrapper function that converts the arguments into a friendly format for lm.fit(), which is called on line 45. This is okay with us, because the method for fitting the Bayesian model will be entirely different from the method for fitting the frequentist model. We will probably have a blm.fit() function that does all the fitting work, given data in a friendly format, and the user-facing function blm() that makes the whole process easy for the user. This blm() function performs much of the same functions as lm() and thus could look fairly similar.

Continue reading “Building a Bayesian counterpart to R’s lm() function”

NYT: What 2,000 Calories looks like

New York Times: What 2,000 Calories looks like.

Takeaways:

  1. Make your own meals,
  2. Skip the sugary drinks,
  3. Comparing apples to oranges is horrifically unfair: Potbelly’s sandwich was a “big Italian sandwich” (looks like a 12″) with mayo weighing in at 1,088 kcal, and Subway’s was a 6″ cold-cut combo without mayo for 375 kcal. But Subway’s 6″ Italian BMT starts at 410 kcal; double that for a 12″ and add a tablespoon of mayo (110 kcal) for 930 kcal!

Pocock et al (2002) on subgroup analysis, twelve years later

One the most frequently cited papers in the subgroup analysis literature is “Subgroup analysis, covariate adjustment and baseline comparisons in clinical trial reporting: current practice and problems” (PDF), published in Statistics in Medicine by Pocock, Assmann, Enos, and Kasten, which has amassed 481 citations since its publication in 2002. The purpose of the paper, as described in the abstract, is to examine the uses of baseline data in clinical trial reporting in major medical journals, discuss major issues with those uses, and to offer recommendations for future practice. I’ll review the points made in the paper, with a focus on the subgroup analysis discussion, discuss the ways in which these points are cited in the subsequent literature, and offer my own opinions on the matter.

Pocock et al on subgroup analysis

The definition of subgroup analysis used by Pocock et al is the use of baseline data to determine if any treatment effects in patient outcomes are consistent across all types of patients, or depend on one or more baseline variables. The authors report on fifty trials in aggregate published in the British Medical Journal (BMJ), the Journal of the American Medical Association (JAMA), the Lancet, and the New England Journal of Medicine (NEJM). They note that 35 of 50 (70%) of the trials reported contained subgroup analyses, and discuss four major points in performing such analyses.

  1. Most trials only have sufficient power to detect the main effect. That is, trials are powered with the intention of detecting an average (or overall) difference in expected outcomes between treatment and control. Therefore, any heterogeneity in treatment effect is likely to go undetected unless the difference in treatment effect is very large or the subgroups under consideration have large subsample sizes.
  2. When subgroup analyses are not prespecified there is a plethora of analyses that could have been performed but are unreported. In the presence of multiple baseline covariates and levels, “data dredging” (performing multiple analyses and only reporting those with the most interesting outcomes) must be guarded against. However, in most of the trials reviewed it was impossible for Pocock et al to determine what the predefined subgroup analyses were, or if any existed at all.
  3. Statistical tests for interaction are the most appropriate methods for making subgroup inferences, but are often not used. Pocock et al argue that interaction tests directly examine the strength of evidence for the treatment effect varying between subgroups, and are thus the most useful approach for evaluating subgroup analyses. They further argue that the low statistical power of interaction tests impart great value to the procedure because it makes them the most effective statistical tool in inhibiting false or premature subgroup findings.
  4. The appropriate inclusion of subgroup analyses in study interpretations and conclusions is unclear. The authors argue that investigators have a responsibility to make a conclusion about how generalizable the treatment effect is, they must also be careful to avoid exaggerated subgroup claims. The authors list three classes of subgroup claims: all-or-nothing interactions in which the treatment effect existed only in particular subgroups, quantitative interactions in which the treatment effect was always present but varied in magnitude between subgroups, and qualitative interactions, in which the treatment effect is in opposite directions in different subgroups, which are thought to be rare and highly implausible.

Continuing the fourth point, the authors offer the following suggestion (emphasis added):

In general, once the statistical strength of evidence for interaction is documented correctly, one relies on the wise judgement of the triallists (and also journal referees and editors) in deciding what emphasis any subgroup finding should receive. Both our survey and other experiences lead us to the view that at present subgroup analyses are overinterpreted by authors (and probably readers as well) and that much greater caution needs to be exercised when drawing conclusions on subgroups. Biological plausibility, the number of subgroup analyses performed, their prespecification and the trial’s size all need to be considered alongside the statistical strength of evidence when weighing up the all too likely case that any particular subgroup finding, no matter how intriguing, is prone to be an exaggeration of the truth. In this regard Bayesian strategies for subgroup analysis, including tests of qualitative interactions, are an interesting development.

Citations, developments, and comments

Most citations of Pocock et al in methods papers (especially those that aren’t just bemoaning the current state of practice) refer to one of the following:

  1. Interaction tests are the most appropriate test for treatment effect heterogeneity,
  2. Interaction tests are underpowered,
  3. Biological plausibility should be considered when interpreting subgroup results.
  4. Qualitative interactions are rare and highly implausible,

Item (ii) is often used as an argument against interaction tests for detecting treatment effect homogeneity, contradicting item (i), which of course not referenced in the same papers. Now, Pocock et al do mention that lack of power is often used as an argument against interaction tests in point (3), but then immediately go on to say that they view it as an asset rather than a liability in support of item (i). Authors are by all means free to argue that (ii) is a liability, but citing Pocock et al in support of that claim is misleading.

In the context of linear models, I do agree with Pocock et al that interaction tests are the most appropriate tests for detecting treatment effect heterogeneity, especially within the Bayesian paradigm when the priors on interaction terms may be tweaked to control the prior probability of specific treatment effect heterogeneities. However, there have been interesting developments in other models where the meaning of an “interaction test” may be somewhat different (cf classification and regression trees, Gaussian processes). I think that it could serve us well to view the argument of Pocock et al as supporting methods which, as they put it, “directly examine the strength of evidence for the treatment difference[effect] varying between subgroups.” What they argue against in particular is presenting p-values for treatment effects in each subgroup. Such a practice can be misleading because, when the overall treatment effect is statistically significant, it’s likely that some subgroup tests will show a significant treatment effect while others do not, depending on chance and smaller subgroup sizes. What testing subgroup-specific p-values instead does is examine the evidence that each particular subgroup has a treatment effect, under the assumption(!) that the treatment effects in different subgroups are independent!

The other two common references are regarding the role “plausibility” plays in subgroup analyses, and are used in the support of certain prior specifications in Bayesian methods. Including plausibility as a consideration in prior specification formalizes knowledge from the field of study into the statistical method, which acts both as a way to make the method more powerful in detecting true heterogeneities and as a safeguard against spurious subgroup results. Tuning priors for biological plausibility may include favoring interactions on some covariates such as age, sex, race, and previously identified genetic markers over others such as genetic markers that are collected for exploratory analyses but haven’t been indicated in previous studies. Other prior specifications explicitly make qualitative interactions unlikely or even impossible.

One issue I take with Pocock et al is the apparently unqualified statement that qualitative interactions, those in which the treatment effect changes signs between subgroups, are unlikely and highly implausible. I believe that the context Pocock et al were referring to is the case in which a treatment is being compared to a placebo or no-treatment control. In this case, it does seem unlikely that a treatment would have opposite effects on different subgroups. However, many trials pit experimental treatments not against a placebo but against another treatment that is considered the standard of care. In that case, a quantitative interaction in the experimental treatment effect versus a placebo could very easily result in a quantitative interaction in the experimental treatment versus the standard of care control. Additionally, a quantitative interaction in the treatment effect versus either the standard of care or a placebo can lead to a qualitative interaction with respect to a utility function that incorporates both primary response and safety endpoints.

On a different note, I’d like to return to my earlier statement that testing subgroup-specific p-values examines the evidence that each particular subgroup has a treatment effect, under the assumption that the treatment effects in different subgroups are independent. While the independence assumption is (obviously? usually? hopefully?) nonsense, it raises a question I find quite interesting: if dependence of these subgroup-specific treatment effects is properly accounted for, might we in many situations be more interested in the evidence for treatment effects in each subgroup than whether or not they’re different? In fact, this is a direction I’m taking with my thesis and its constituent papers.

Itasca: a Beamer theme

I think most of the default Beamer themes look terrible. Clashing colors, inconsistencies between flat and “3D” graphics (I’m looking at you, ball bullet points), and general visual clutter bring back memories of the early 90’s internet. In my opinion, the least offensive themes are the simplest – black text on a white background, perhaps with a single blue accent color for titles, bullet points, et cetera.

Themes like this one are quite common in biostatistics presentations. Sometimes better bullet points are used. From the Beamer Theme Matrix.
Themes like this one are quite common in biostatistics presentations. Sometimes better bullet points are used. From the Beamer Theme Matrix.

The theme pictured above has a few things I like and a few I don’t.

The good:

  1. Mostly black text on white background – easy to read.
  2. Blue highlights make it easy to pick out important bits – titles and bullet points.
  3. Section titles provide context for each slide.
  4. Current page number makes it

The bad:

  1. Too many colors gets distracting.
  2. Clunky bullet points.
  3. Unnecessary information abounds – presentation title, subsection titles, author, date, total page number
  4. Presentation controls

I’ll get to an explanation of why I find points 5-8 bad, but first, here’s a slide from my own attempt at a theme:

itasca-example
Itasca theme – click to enlarge.

This slide actually looks a lot like the blog with the theme I’m currently using. Consistency, right? The full example presentation is available on GitHub.

Continue reading “Itasca: a Beamer theme”

Familywise error and false discovery rates

When performing a hypothesis test the most important objective is to control the two basic error rates:

  1. Type I error rate: the probability that the test rejects the null hypothesis H0 when in fact H0 is true, and
  2. Type II error rate: the probability that the test fails to reject H0 when H0 is false.

The usual procedure is to fix a maximum acceptable type I error rate (called the level) of the test, and then attempt to minimize the type II error rate (i.e. maximize power, 1 – type II error rate) under that constraint. However, when multiple hypothesis tests are performed, the probability of making at least one error may increase dramatically.

Familywise error rate

The familywise error rate (FWER) of a procedure for testing multiple hypotheses refers to the probability of making at least one type I error. If n tests are performed, and each has an individual type I error rate of α, then the FWER for the set of tests is at least α and at most min(1, ). These bounds are known as the Bonferroni inequalities, and lead to the Bonferroni adjustment, a method of controlling the familywise error rate. Standard methods for controlling the FWER include the aforementioned Bonferroni adjustment for arbitrary collections of tests and Working-Hotelling-Scheffé simultaneous confidence bands for linear regression.

False discovery rate

Given a set of tests which have rejected their null hypotheses, the false discovery rate (FDR) is the expected proportion of those rejections that have been made incorrectly, that is, the expected proportion of rejected null hypotheses which are in fact true. The false discovery rate was introduced by Benjamini & Hochberg (1995) which included the following procedure for controlling the FDR for a set of independent hypotheses:

  1. Let H1, …, Hm be tested null hypotheses with corresponding p-values P1, …, Pm.
  2. Order the p-values in increasing order and denote by P(1), …, P(m).
  3. For a given false discovery rate α, find the largest k such that P(k) ≤ (k/m)α$.
  4. Reject all H(i) for 1 ≤ i ≤ k.

 Comparison

The primary difference between controlling the familywise error rate and the false discovery rate is that the FWER is stricter and attempts to prevent any false rejections while the FDR is more permissive in allowing some false rejections as long as the proportion of rejections that are false is controlled. This leads to procedures that control the FDR being more powerful than those which control the FWER at the expense of more frequent type I error – the classic type I error / power trade-off.

Which procedures to use depends, as usual, on the goals of the analysis. In a regulatory setting it’s desired to prevent ineffective treatments from being approved, so the usual approach controlling the FWER within a confirmatory study. In contrast, exploratory studies in genetics can perform thousands of tests with the goal of narrowing down the set of hypotheses, and so controlling the FDR is a much more useful approach.

Remembering the meaning behind the numbers

Rebecca Steorts gave a talk at the University of Minnesota this afternoon on record linkage and de-duplication with an application to the ongoing Syrian Civil War.

I often deal with survival plots like the following:

chemo-surv

While a plot like this represents deaths, I’m able to see them in the abstract in part because I can consider the deaths unavoidable. On top of that, the area between the curves communicates hope for the future in the form of improved treatments. However, Rebecca presented us with figures like the following:

By GraysonWiki & Futuretrillionaire [CC-BY-SA-3.0], via Wikimedia Commons
By GraysonWiki & Futuretrillionaire [CC-BY-SA-3.0], via Wikimedia Commons
Though not graphic, these images viscerally affected me. Each bar represents hundreds to thousands of deaths in a week that were not only unnecessary, but undeniably our (humanity’s) own fault. Today, the death toll is around 200,000.

I got into biostatistics to help to treat and prevent disease, and I felt helpless sitting in the auditorium hearing about casualties of war. Now, writing this, I realize that while my work likely won’t contribute to the prevention of war casualties, I have an opportunity to do a small part by voting carefully in the midterm election next week.

Subgroup analysis in clinical trials: importance and challenges

The term “subgroup analysis” can mean vastly different things to different people. In clinical trials, however, the central idea is that not everyone responds to treatment in the same way, and we can understand this varied response by dividing the population into subgroups and examining the responses within subgroups and how they differ between subgroups.

Why subgroup analysis is important

Doctor: This vaccine has been shown to decrease the incidence of cervical cancer in the general population.

Patient: But I’m a man! I don’t have a cervix!

The above is admittedly an absurd example, but it illustrates the point of subgroup analysis: the average treatment effect may not accurately reflect that treatment’s benefit to a specific patient. Perhaps more realistically, we could imagine that a medication meant to control hypertension might work better on patients of a certain race, or that an Alzheimer’s disease treatment might only be effective on carriers of a particular allele.

Practitioners take the characteristics of a specific patient into account when choosing a course of treatment. Perhaps most commonly used are easily-observable traits such as age, weight, sex, and race, and factors related to medical history such as previous or concurrent ailments and family history. However, the primary information reported in randomized clinical trials (RCTs) is overall, or population-averaged, effect. While the overall effect is useful on the aggregate scale on which epidemiologists and policymakers operate, a conclusion that the overall treatment effect reflects the performance on a specific individual relies on the spurious assumption that the treatment has the same effect on everyone (or that the particular patient happens to be precisely average in every relevant trait). Therefore, in order for practitioners to make better decisions, they must be provided with information about how the treatment effect varies across certain groups.

Not only the treatment effect may vary across groups, but also side-effects. For example, it’s common practice to require special evidence for safety for young children and pregnant women.

Why subgroup analysis is hard

The difficulties in subgroup analysis stem primarily from the division of one large sample into many smaller samples.The smaller sample size in each subgroup leads to less reliable estimates and less power to detect non-zero treatment effects within each subgroup. Additionally, because there are many subgroups and therefore many hypotheses being tested, the issue of multiple testing arises: controlling the type I error rate for each test at a level α can lead to a significantly higher probability of making at least one type I error. The usual basic corrections for multiplicity such as the Bonferroni adjustment tend to be conservative because they do not take into account the dependence between test statistics, which can be quite strong in the subgroup analysis setting. This further reduces the power to detect treatment effects within subgroups.

Another obstacle in subgroup analysis is difficulty in the interpretation of results. Because subgroup-specific tests have such little power, failure to reject the null hypothesis is not strong evidence that the null hypothesis is true. Additionally, due to the fact that the subgroup-specific estimates of the treatment effect are highly variable, it may not be wise for a practitioner to put credence in that estimate instead of the overall estimate of the treatment effect. In fact, it is often the case that the overall estimate is a more reliable estimate of the treatment effect for a subgroup than the estimate derived from patients in that subgroup only. One remedy to this dilemma is to take advantage of shrinkage: a weighted average of the overall and subgroup-specific estimates can often provide a better estimate of the true subgroup-specific treatment effect than either of its components.

Ignoring these difficulties – multiplicity, lack of power, unreliability of estimates – lead to the most common mistakes in performing and interpreting the results of subgroup analyses, and learning to control them is essential in producing reliable results.

Plot that! Public trust in physicians and care satisfactions among industrialized countries

The New England Journal of Medicine published a perspective piece yesterday entitled Public Trust in Physicians – U.S. Medicine in International Perspective. The gist of the article was that surveys have indicated a sharp decline in Americans’ trust in physicians over the last fifty years. One passage that stood out to me was the following:

Indeed, the United States is unique among the surveyed countries in that it ranks near the bottom in the public’s trust in the country’s physicians but near the top in patients’ satisfaction with their own medical treatment.

The article includes a table that ranks twenty-nine industrialized countries by the percentage of people who say that their doctors can be trusted and again by the percentage of people who say that they were satisfied with the treatment they received during their last visit. The table is reproduced at the end of this post.

As much numerical information is present in the table, I think that data like this could be more easily digestible in a graphical format. Even just taking the countries and the percent estimates, we can produce the following plot.

Trust in Physicians vs Care Satisfaction

Now we can more easily see the trend between trust and satisfaction, as well as the outliers. We see that trust in physicians tends to increase with care satisfaction, as expected. Additionally, even though Americans trust their physicians less than citizens of other countries considered here, their satisfaction with the care they receive is among the highest. We can also immediately see the reverse analogue: the Taiwanese have a level of trust in their physicians on par with the primary good cluster, but are very poorly satisfied with the care they receive.

Examining the primary extremes of the distribution, we see the Netherlands, Switzerland, and Denmark performing exceptionally well in both trust and satisfaction, and Russia, Poland, and Bulgaria lagging behind in both. Finally, note the ranges of these scores: trust ranges from nearly 90% to just below 50% while satisfaction begins near 0% and breaks 60% in only two countries. Even among the best performing countries here, satisfaction with care seems fairly low.

Continue reading “Plot that! Public trust in physicians and care satisfactions among industrialized countries”

The IMDB user ratings of the mathematical thriller Travelling Salesman are unusually polarized

Today I came across the Quora question If I just proved that P = NP, how do I start taking over the world? One of the answers mentioned the 2012 film Travelling Salesman, a cerebral thriller which explores some possible ramifications of being able to solve all NP problems in polynomial time. The trailer is below.

I had seen the trailer back when the movie was first released but never ended up seeing the movie. If I recall correctly I expected it to be corny, inaccurate, or both. After watching the trailer again I headed to its IMDB page to see how it was received. I wasn’t surprised to see a “weighted average” rating of 5.6 (IMDB uses a weighting system to prevent vote stuffing). However, the user reviews were overwhelmingly positive. Curious, I checked the histogram of user ratings for the movie.

ts

Usually, the ratings of movies that are neither exceptionally highly nor exceptionally poorly rated have distributions that are roughly a truncated bell shape with small bumps at “1” and “10”, the minimum and maximum scores. A pair of examples are shown below.

pi prim

It’s generally accepted that people who have strong opinions (e.g. “1” or “10”) are more likely to rate and review, but the unusual relative size of the bumps in the distribution for Travelling Salesman make me wonder if there’s something else going on, perhaps a somewhat unique audience that has been particularly polarized by some aspect of the film. A less interesting explanation is that the groups of people who rate exceptionally high or low do so early, and as more people rate the movie, the relative size of the bumps decreases. Either way, I’m intrigued enough to watch the film sometime soon. Tonight, though, my fiancée and I have the second episode of a two-part Deep Space 9 sequence whose first episode turned out to be a cliff-hanger.

Update (October 26, 2014): We watched the movie last night, and I’m feeling more confident about the polarizing aspect hypothesis. While the mathematics were a central plot point, the movie turned out to be more about power and responsibility immediately after a proof that P=NP was found. On top of that, the movie consists almost entirely of an argument in a conference room. So, I suspect that the audience who enjoy philosophical discussions of power and politics would have enjoyed the film, the audience that was expecting either (a) math or (b) actual depiction of consequences would be disappointed. Absurdly, I was most disappointed that the details of the proof itself were never discussed.