Strong Inference

The First Release of PyMC3

2017-01-10T00:00:00-06:00

On Monday morning the PyMC dev team pushed the first release of PyMC3, the culmination of over 5 years of collaborative work. We are very pleased to be able to provide a stable version of the package to the Python scientific computing community. For those of you unfamiliar with the history and progression of this project, PyMC3 is a complete re-design and re-write of the PyMC code base, which was primarily the product of the vision and work of John Salvatier. While PyMC 2.3 is still actively maintained and used (I continue to work with it in a number of project myself), this new incarnation allows us to be able to provide newer methods for Bayesian computation to a degree that would have been impossible impossible previously.

While PyMC2 relied on Fortran extensions (via f2py) for most of the computational heavy-lifting, PyMC3 leverages Theano, a library from the LISA lab for array-based expression evaluation, to perform its computation. What this provides, above all else, is fast automatic differentiation, which is at the heart of the gradient-based sampling and optimization methods currently providing inference for probabilistic programming. While the addition of Theano adds a level of complexity to the development of PyMC, fundamentally altering how the underlying computation is performed, we have worked hard to maintain the elegant simplicity of the original PyMC model specification syntax. Since the beginning (over 13 years ago now!), we have tried to provide a simple, black-box interface to model-building, in the sense that the user need only concern herself with the modeling problem at hand, rather than with the underlying computer science.

As a point of comparison, here is what a simple hierarchical model (taken from Gelman et al.'s book) looked like under PyMC 2.3:

# Priors
alpha = Normal('alpha', 0, 0.01)
beta = Normal('beta', 0, 0.01)

# Transformed variables
theta = Lambda('theta', lambda a=alpha, b=beta, d=dose: invlogit(a + b * d))

# Data likelihood
deaths = Binomial('deaths', n=n, p=theta, value=array([0,1,3,5]), observed=True) 

# Instantiate a sampler, and run
M = MCMC(locals())
M.sample(10000, burn=5000)

and here is the same model in PyMC3:

with Model() as bioassay_model:

    alpha = Normal('alpha', 0, sd=100)
    beta = Normal('beta', 0, sd=100)

    theta = invlogit(alpha + beta*dose)

    deaths = Binomial('deaths', n=n, p=theta, observed=array([0,1,3,5]))

    trace = sample(2000)

If anything, the model specification has simplified, for the majority of models.

Though the version 2 and version 3 models are superficially similar (by design), there are very different things happening underneath when sampleis called in either case. By default, the PyMC3 model will use a form of gradient-based MCMC sampling, a self-tuning form of Hamiltonian Monte Carlo, called NUTS. Gradient based methods serve to drastically improve the efficiency of MCMC, without the need for running long chains and dropping large portions of the chains due to lack of convergence. Rather than conditionally sampling each model parameter in turn, the NUTS algorithm walks in k-space (where k is the number of model parameters), simultaneously updating all the parameters as it leap-frogs through the parameter space. Models of moderate complexity and size that would normally require 50,000 to 100,000 iterations now typically require only 2000-3000.

When we run the PyMC3 version of the model above, we see this:

Auto-assigning NUTS sampler...
Initializing NUTS using advi...
Average ELBO = -6.2597: 100%|████████████████████████████████████████| 200000/200000 [00:11<00:00, 16873.12it/s]
Finished [100%]: Average ELBO = -6.27
100%|██████████████████████████████████████████████████████████████████████| 2000/2000 [00:02<00:00, 928.24it/s]

Unless specified otherwise, PyMC3 will assign the NUTS sampler to all the variables of the model. This happens here because our model contains only continuous random variables; NUTS will not work with discrete variables because it is impossible to obtain gradient information from them. Discrete variables are assigned the Metropolissampling algorithm (step method, in PyMC parlance). The next thing that happens is that the variables' initial values are assigned using Automatic Differentiation Variational Inference (ADVI). This is an approximate Bayesian inference algorithm that we have added to PyMC — more on that later. Though it can be used for inference in its own right, here we are using it merely to find good starting values for NUTS (in practice, this is important for getting NUTS to run well). Its an excessive step for small models like this, but it is the default behavior, designed to try and guarantee a good MCMC run.

Another nice innovation includes some new plotting functions for visualizing the posterior distributions obtained with the various estimation methods. Let's look at the regression parameters from our fitted model:

plot_posterior(trace, varnames=['beta', 'alpha'])

plot_posterior generates histograms of the posterior distribution that is annotated with summary statistics of interest, in the style of John Kruschke's book. This is just one of several options for visualizing output.

The addition of variational inference (VI) methods in version 3.0 is a transformative change to the sorts of problems you can tackle with PyMC3. I showed it being used to intialize a model that was ultimately fit using MCMC, but variational inference can be used as a tool for obtaining statistical inference in its own right. Just as MCMC approximates a complex posterior by drawing dependent samples from its posterior distribution, variational inference performs an approximation by replacing the true posterior with a more tractable form, then iteratively changes the approximation so that it resembles the posterior distribution as closely as it can, in terms of the information distance between the two distributions. Where MCMC uses sampling, VI uses optimization to estimate the posterior distribution. The benefit to you in doing this is that Bayesian models informed by very large datasets can be fit in a reasonable amount of time (MCMC notoriously scales poorly with data size); the drawback is that you only get an approximation to the posterior, and that appoximation can be unacceptably poor for some applications. Nevertheless, improvements to variational inference methods continue to roll in, and some have the potential to drastically improve the quality of the approximation. The key advance that allowed PyMC3 to implement variational methods was the development of automated algorithms for specifying a variational approximation generally, across a wide variety of models. In particular, Alp Kucukelbir and colleagues' introduction of Automatic Differentiation Variational Inference (ADVI) two years ago made VI relatively easy to apply to arbitrary models (again, assuming the model variables are continuous). Here it is, in action, fitting the same model we used NUTS to estimate before:

with model:
    advi_fit = advi(n=10000)

Average ELBO = -6.2765: 100%|████████████████████████████████████████████████| 100000/100000 [00:05<00:00, 17072.45it/s]
Finished [100%]: Average ELBO = -6.2835

ADVI returns the means and standard deviations of the approximating distribution after it has converged to the best approximation. These values can be used to sample from the disribtution:

with model:
    trace = sample_vp(advi_fit, 10000)

As we push past the PyMC3 3.0 release, we have a number of innovations either under development or in planning. For example, in order to improve the quality of approximations using variational inference, we are looking at implementing methods that transform the approximating density to allow it to represent more complicated distributions, such as the application of normalizing flows to ADVI; this work is being led by Taku Yoshioka. Thomas Wiecki is currently working on adding Stein Variational Gradient Descent to the suite of VI algorithms, which should allow much larger datasets to be fit to PyMC models. To more easily accommodate the number of different VI algorithms that are being developed, Maxim Kochurov is leading the development of a flexible base class for variational methods that will unify their interfaces. Work is also underway to allow PyMC3 to take advantage of computation on GPUs, something that Theano allows us to do, but requires some engineering to allow it to work generally. These are just a few notable enhancements, along with all of the incremental but steady improvement throughout the code base.

When I began the PyMC project as a postdoctoral fellow back in 2003, it was intended only as a set of functions and classes for personal use, to simplify the business of building and iterating through sets of models. At the time, the world of Bayesian computation was dominated by WinBUGS, a truly revolutionary piece of software that made hierarchical modeling and MCMC available to applied statisticians and other scientists who would otherwise been unable to consider these approaches. All the same, the BUGS language was not ideal for all problems and workflows, so if you needed something else you were forced to write your own software. We live in a very different scientific computing world today; for example, there are, as of this writing, no fewer than six libraries for building Gaussian process models in Python! The ecosystem for probabilistic programming and Bayesian analysis is rich today, and becoming richer every month, it seems.

I'd like to take the opportunity now to thank the ever-changing and -growing PyMC development team for all of their hard work over the years. I've been truly awe-stricken by the level of talent and degree of comittment that the project has attracted over the years. Some contributors added value to the project for very short intervals, perhaps in order to facilitate the completion of their own work, and others have stuck around through multiple releases, not only implementing exciting new functionality, but also taking on more mundane chores like squashing bugs and refactoring old code. Of course, every bit helps. Thanks again.

Finally, I'd like to extend an invitation to all who are interested (or just curious) to come on board and contribute. Now is an exciting time to be a part of the team, with novel methodological innovations in Bayesian modeling arriving at such a rapid pace, and with data science coming into its own as a field. We welcome contributions to all aspects of the project: code development, issue resolution, documentation writing—simply trying out PyMC3 on your own problem and reporting what does and doesn't work is even a great way to get involved. It doesn't take much to get started!

Calculating Bayes factors with PyMC

2014-11-30T00:00:00-06:00

Statisticians are sometimes interested in comparing two (or more) models, with respect to their relative support by a particular dataset. This may be in order to select the best model to use for inference, or to weight models so that they can be averaged for use in multimodel inference.

The Bayes factor is a good choice when comparing two arbitrary models, and the parameters of those models have been estimated. Bayes factors are simply ratios of marginal likelihoods for competing models:

$$ \text{BF}_{i,j} = \frac{L(Y \mid M_i)}{L(Y \mid M_j)} = \frac{\int L(Y \mid M_i,\theta_i)p(\mid \theta_i \mid M_i)d\theta}{\int L(Y \mid M_j,\theta_j)p(\theta_j \mid M_j)d\theta} $$

While passingly similar to likelihood ratios, Bayes factors are calculated using likelihoods that have been integrated with respect to the unknown parameters. In contrast, likelihood ratios are calculated based on the maximum likelihood values of the parameters. This is an important difference, which makes Bayes factors a more effective means of comparing models, since it takes into account parametric uncertainty; likelihood ratios ignore this uncertainty. In addition, unlike likelihood ratios, the two models need not be nested. In other words, one model does not have to be a special case of the other.

Bayes factors are called Bayes factors because they are used in a Bayesian context by updating prior odds with information from data.

Posterior odds = Bayes factor x Prior odds

Hence, they represent the evidence in the data for changing the prior odds of one model over another. It is this interpretation as a measure of evidence that makes the Bayes factor a compelling choice for model selection.

One of the obstacles to the wider use of Bayes factors is the difficulty associated with calculating them. While likelihood ratios can be obtained simply by the use of MLEs for all model parameters, Bayes factors require the integration over all unknown model parameters. Hence, for most interesting models Markov chain Monte Carlo (MCMC) is the easiest way to obtain Bayes factors.

Here's a quick tutorial on how to obtain Bayes factors from PyMC. I'm going to use a simple example taken from Chapter 7 of Link and Barker (2010). Consider a short vector of data, consisting of 5 integers:

Y = array([0,1,2,3,8])

We wish to determine which of two functional forms best models this dataset. The first is a geometric model:

$$ f(x|p) = (1-p)^x p $$

and the second a Poisson model:

$$ f(x|\mu) = \frac{\mu^x e^{-\mu}}{x!} $$

Both describe the distribution of non-negative integer data, but differ in that the variance of Poisson data is equal to the mean, while the geometric model describes variance greater the mean. For this dataset, the sample variance would suggest that the geometric model is favored, but the sample is too small to say so definitively.

In order to calculate Bayes factors, we require both the prior and posterior odds:

Bayes factor = Posterior odds / Prior odds

The Bayes factor does not depend on the value of the prior model weights, but the estimate will be most precise when the posterior odds are the same. For our purposes, we will give 0.1 probability to the geometric model, and 0.9 to the Poisson model:

pi = (0.1, 0.9)

Next, we need to specify a latent variable, which identifies the true model (we don't believe either model is "true", but we hope one is better than the other). This is easily done using a Bernoulli random variable, that identifies one model or the other, according to their relative weight.

true_model = Bernoulli('true_model', p=pi[1], value=0)

Here, we use the specified prior weights as the Bernoulli probabilities, and the variable has been arbitrarily initialized to zero (the geometric model).

Next, we need prior distributions for the parameters of the two models. For the Poisson model, the expected value is given a uniform prior on the interval [0,1000]:

mu = Uniform('mu', lower=0, upper=1000, value=4)

This stochastic node can be used for the geometric model as well, though it needs to be transformed for use with that distribution:

p = Lambda('p', lambda mu=mu: 1/(1+mu))

Finally, the data are incorporated by specifying the appropriate likelihood. We require a mixture of geometric and Poisson likelihoods, depending on which value true_model takes. While BUGS requires an obscure trick to implement such a mixture, PyMC allows for the specification of arbitrary stochastic nodes:

@observed
def Ylike(value=Y, mu=mu, p=p, M=true_model):
    """Either Poisson or geometric, depending on M"""
    if M:
        return poisson_like(value, mu)
    return geometric_like(value+1, p)

Notice that the function returns the geometric likelihood when M=0, or the Poisson model otherwise. Now, all that remains is to run the model, and extract the posterior quantities to calculate the Bayes factor.

Though we may be interested in the posterior estimate of the mean, all that we care about from a model selection standpoint is the estimate of true_model. At every iteration, the value of this parameter takes the value of zero for the geometric model and one for the Poisson. Hence, the mean (or median) will be an estimate of the probability of the Poisson model:

In [11]: M.true_model.stats()['mean']

Out[11]: 0.39654545454545453

So, the posterior probability that the Poisson model is true is about 0.4, leaving 0.6 for the geometric model. The Bayes factor in favor of the geometric model is simply:

In [18]: p_pois = M.true_model.stats()['mean']

In [19]: ((1-p_pois)/p_pois) / (0.1/0.9)

Out[19]: 13.696011004126548

This value can be interpreted as strong evidence in favor of the geometric model.

If you want to run the model for yourself, you can download the code here.

Burn-in, and Other MCMC Folklore

2014-08-09T00:00:00-05:00

I have been slowly working my way through The Handbook of Markov Chain Monte Carlo, a compiled volume edited by Steve Brooks et al. that I picked up at last week's Joint Statistical Meetings. The first chapter is a primer on MCMC by Charles Geyer, in which he summarizes the key concepts of the theory and application of MCMC. In a particularly provocative passage, Geyer rips several of the traditional practices in setting up, running and diagnosing MCMC runs, including multi-chain runs, burn-in and sample-based diagnostics. Though they are applied regularly, these steps are simply heuristics that are applied to either aid in reaching or identifying the equilibrium distribution of the Markov chain. There are no guarantees on the reliability of any of them.

In particular, he questions the utility of burn-in:

Burn-in is only one method, and not a particuarly good method, for finding a good starting point.

I can't disagree with this, though I have always viewed MCMC sampling (for most models that I have dealt with) as being cheap enough that there is little cost to simply throwing away thousands of them. I have often thrown away as many as the first 90 percent of my samples! However, as Geyer notes, there are better ways of getting your chain into a decent region of its support without throwing anything away.

One method is to use an approximation method on your model before applying MCMC. For example, the maximum a posteriori (MAP) estimate can be obtained using numerical optimization, then used as the initial values for an MCMC run. It turns out to be pretty easy to do in PyMC. For example, using the built-in bioassay example:

In [3]: from pymc.examples import gelman_bioassay

In [4]: from pymc import MAP, MCMC

In [5]: M = MAP(gelman_bioassay)

In [6]: M.fit()

This yields MAP estimates for all the parameters in the model, which are less likely to be true modes as the complexity of the model increases, but are a pretty good bet to be a decent starting point for MCMC.

In [7]: M.alpha.value
Out[7]: array(0.8465802225061101)

All that remains is to move these estimates into an MCMC sampler. While one could manually plug the values of each node into the model specification, its easiest just to extract the variables from the MAP estimator, and use them to instantiate an MCMC object:

In [8]: M.variables
Out[8]: 
set([<pymc.PyMCObjects.Stochastic 'alpha' at 0x10f78e810>,
     <pymc.PyMCObjects.Stochastic 'beta' at 0x10f78e910>,
     <pymc.PyMCObjects.Deterministic 'theta' at 0x10f78e9d0>,
     <pymc.distributions.Binomial 'deaths' at 0x10f78ea50>,
     <pymc.CommonDeterministics.Lambda 'LD50' at 0x10f78ec10>])

In [9]: MC = MCMC(M.variables)

In [10]: MC.sample(1000)
Sampling: 100% [0000000000000000000000000000000000000000000000] Iterations: 1000

Notice that I did not pass a burn argument to MCMC, which defaults to zero. As is evident from the graphical output of the posteriors, this results in what appears to be a homogeneous chain, and which is hopefully already at its equilibrium distribution.

What the MCMC practitioner fears is using a chain for inference that has not yet converged to its target distribution. Unfortunately, diagnostics cannot reliably alert you to this, nor does starting a model in several chains from disparate starting values guarantee this. There is also no magical threshold to distinguish convergence from pre-convergence regions in a MCMC trace. Geyer insists that only running chains for a very, very long time will inspire confidence:

Your humble author has a dictum that the lease one can do is make an overnight run. ... If you do not make runs like that, you are simply not serious about MCMC.

Implementing Dirichlet processes for Bayesian semi-parametric models

2014-03-07T00:00:00-06:00

Semi-parametric methods have been preferred for a long time in survival analysis, for example, where the baseline hazard function is expressed non-parametrically to avoid assumptions regarding its form. Meanwhile, the use of non-parametric methods in Bayesian statistics is increasing. However, there are few resources to guide scientists in implementing such models using available software. Here, I will run through a quick implementation of a particular class of non-parametric Bayesian models, using PyMC.

Use of the term "non-parametric" in the context of Bayesian analysis is something of a misnomer. This is because the first and fundamental step in Bayesian modeling is to specify a full probability model for the problem at hand. It is rather difficult to explicitly state a full probability model without the use of probability functions, which are parametric. It turns out that Bayesian non-parametric models are not really non-parametric, but rather, are infinitely parametric.

A useful non-parametric approach for modeling random effects is the Dirichlet process. A Dirichlet process (DP), just like Poisson processes, Gaussian processes, and other processes, is a stochastic process. This just means that it comprises an indexed set of random variables. The DP can be conveniently thought of as a probability distribution of probability distributions, where the set of distributions it describes is infinite. Thus, an observation under a DP is described by a probability distribution that itself is a random draw from some other distribution. The DP (lets call it $G$) is described by two quantities, a baseline distribution $G_0$ that defines the center of the DP, and a concentration parameter $\alpha$. If you wish, $G_0$ can be regarded as an a priori "best guess" at the functional form of the random variable, and $\alpha$ as a measure of our confidence in our guess. So, as $\alpha$ grows large, the DP resembles the functional form given by $G_0$.

To see how we sample from a Dirichlet process, it is helpful to consider the constructive definition of the DP. There are several representations of this, which include the Blackwell-MacQueen urn scheme, the stick-breaking process and the Chinese restaurant process. For our purposes, I will consider the stick-breaking representation of the DP. This involves breaking the support of a particular variable into $k$ disjoint segments. The first break occurs at some point $x_0$, determined stochastically; the first piece of the notional "stick" is taken as the first group in the process, while the second piece is, in turn, broken at some selected point $x_1$ along its length. Here too, one piece is assigned to be the second group, while the other is subjected to the next break, and so on, until $k$ groups are created. Associated with each piece is a probability that is proportional to its length; these $k$ probabilities will have a Dirichlet distribution -- hence, the name of the process. Notice that $k$ can be infinite, making $G$ an infinite mixture.

We require two random samples to generate a DP. First, take a draw of values from the baseline distribution:

$$ \theta_1, \theta_2, \ldots \sim G_0 $$

then, a set of draws $v_1, v_2, \ldots$ from a $\text{Beta}(1,\alpha)$ distribution. These beta random variates are used to assign probabilities to the $\theta_i$ values, according to the stick-breaking analogy. So, the probability of $\theta_1$ corresponds to the first "break", and is just $p_1 = v_1$. The next value corresponds to the second break, which is a proportion of the remainder from the first break, or $p_2 = (1-v_1)v_2$. So, in general:

$$ p_i = v_i \prod_{j=1}^{i-1} (1 - v_j) $$

These probabilities correspond to the set of draws from the baseline distribution, where each of the latter are point masses of probability. So, the DP density function is:

$$ g(x) = \sum_{i=1}^{\infty} p_i I(x=\theta_i) $$

where $I$ is the indicator function. So, you can see that the Dirichlet process is discrete, despite the fact that its values may be non-integer. This can be generalized to a mixture of continuous distributions, which is called a DP mixture, but I will focus here on the DP alone.

Example: Estimating household radon levels

As an example of implementing Dirichlet processes for random effects, I'm going to use the radon measurement and remediation example from Gelman and Hill (2006). This problem uses measurements of radon (a carcinogenic, radioactive gas) from households in 85 counties in Minnesota to estimate the distribution of the substance across the state. This dataset has a natural hierarchical structure, with individual measurements nested within households, and households in turn nested within counties. Here, we are certainly interested in modeling the variation in counties, but do not have covariates measured at that level. Since we are more interested in the variation among counties, rather than the particular levels for each, a random effects model is appropriate. Whit Armstrong was kind enough to code several parametrizations of this model in PyMC, so I will use his code as a basis for implementing a non-parametric random effect for radon levels among counties.

In the original example from Gelman and Hill, measurements are modeled as being normally distributed, with a mean that is a hierarchical function of both a county-level random effect and a fixed effect that accounted for whether houses had a basement (this is thought to increase radon levels).

$$ y_i \sim N(\alpha_{j[i]} + \beta x_i, \sigma_y^2) $$

So, in essence, each county has its own intercept, but shares a slope among all counties. This can easily be generalized to both random slopes and intercepts, but I'm going to keep things simple, in order to focus in implementing a single random effect.

The constraint that is applied to the intercepts in Gelman and Hill's original model is that they have a common distribution (Gaussian) that describes how they vary from the state-wide mean.

$$ \alpha_j \sim N(\mu_{\alpha}, \sigma_{\alpha}^2) $$

This comprises a so-called "partial pooling" model, whereby counties are neither constrained to have identical means (full pooling) nor are assumed to have completely independent means (no pooling); in most applications, the truth is somewhere between these two extremes. Though this is a very flexible approach to accounting for county-level variance, one might be worried about imposing such a restrictive (thin-tailed) distribution like the normal on this variance. If there are counties that have extremely low or high levels (for whatever reason), this model will fit poorly. To allay such worries, we can hedge our bets by selecting a more forgiving functional form, such as Student's t or Cauchy, but these still impose parametric restrictions (e.g. symmetry about the mean) that we may be uncomfortable making. So, in the interest of even greater flexibility, we will replace the normal county random effect with a non-parametric alternative, using a Dirichlet process.

One of the difficulties in implementing DP computationally is how to handle an infinite mixture. The easiest way to tackle this is by using a truncated Dirichlet process to approximate the full process. This can be done by choosing a size $k$ that is sufficiently large that it will exceed the number of point masses required. By doing this, we are assuming

$$ \sum_{i=1}^{\infty} p_i I(x=\theta_i) \approx \sum_{i=1}^{N} p_i I(x=\theta_i) $$

Ohlssen et al. 2007 provide a rule of thumb for choosing $N$ such that the sum of the first $N-1$ point masses is greater than 0.99:

$$ N \approx 5\alpha + 2 $$

To be conservative, we will choose an even larger value (100), which we will call N_dp. The truncation makes implementation of DP in PyMC (or JAGS/BUGS) relatively simple.

We first must specify the baseline distribution and the concentration parameter. As we have no prior information to inform a choice for $\alpha$, we will specify a uniform prior for it, with reasonable bounds:

alpha = pymc.Uniform('alpha', lower=0.5, upper=10)

Though the upper bound may seem small for a prior that purports to be uninformative, recall that for large values of $\alpha$, the DP will converge to the baseline distribution, suggesting that a continuous distribution would be more appropriate.

Since we are extending a normal random effects model, I will choose a normal baseline distribution, with vague hyperpriors:

mu_0 = pymc.Normal('mu_0', mu=0, tau=0.01, value=0)
sig_0 = pymc.Uniform('sig_0', lower=0, upper=100, value=1)
tau_0 = sig_0 ** -2

theta = pymc.Normal('theta', mu=mu_0, tau=tau_0, size=N_dp)

Notice that I have specified a uniform prior on the standard deviation, rather than the more common gamma-distributed precision; for hierarchical models this is good practice. So, now we that we have N_dp point masses, all that remains is to generate corresponding probabilities. Following the recipe above:

v = pymc.Beta('v', alpha=1, beta=alpha, size=N_dp)
@pymc.deterministic
def p(v=v):
    """ Calculate Dirichlet probabilities """

    # Probabilities from betas
    value = [u*np.prod(1-v[:i]) for i,u in enumerate(v)]
    # Enforce sum to unity constraint
    value[-1] = 1-sum(value[:-1])

    return value

This is where you really appreciate Python's list comprehension idiom. In fact, were it not for the fact that we wanted to ensure that the array of probabilities sums to one, p could have been specified in a single line.

The final step involves using the Dirichlet probabilities to generate indices to the appropriate point masses. This is realized using a categorical mass function:

z = pymc.Categorical('z', p, size=len(set(counties)))

These indices, in turn, are used to index the random effects, which are used as random intercepts for the model:

a = pymc.Lambda('a', lambda z=z, theta=theta: theta[z])

Substitution of the above code into Gelman and Hill's original model produces reasonable results. The expected value of $\alpha$ is approximately 5, as shown by the posterior output below:

Here is a random sample taken from the DP:

But is the model better? One metric for model comparison is the deviance information criterion (DIC), which appears to strongly favor the DP random effect (smaller values are better):

In [11]: M.dic
Out[11]: 2138.7806225675804

In [12]: M_dp.dic
Out[12]: 1993.0894265799602

If you are interested in viewing the model code in its entirety, I have uploaded it to my fork of Whit's code.

Automatic Missing Data Imputation with PyMC

2013-08-18T00:00:00-05:00

A distinct advantage of using Bayesian inference is in its universal application of probability models for providing inference. As such, all components of a Bayesian model are specified using probability distributions for either describing a sampling model (in the case of observed data) or characterizing the uncertainty of an unknown quantity. This means that missing data are treated the same as parameters, and so imputation proceeds very much like estimation. When using Markov chain Monte Carlo (MCMC) to fit Bayesian models it usually requires only a few extra lines of code to impute missing values, based on the sampling distribution of the missing data, and associated (usually unknown) parameters. Using PyMC built from the latest development code, missing data imputation can be done automatically.

Types of Missing Data

The appropriate treatment of missing data depends strongly on how the data came to be missing from the dataset. These mechanisms can be broadly classified into three groups, according to how much information and effort is required to deal with them adequately.

Missing completely at random (MCAR)

If data are MCAR, then the probability of that any given datum is missing is equal over the whole dataset. In other words, each datum that is present had the same probability of being missing as each datum that is absent. This implies that ignoring the missing data will not bias inference.

Missing at random (MAR)

MAR allows for data to be missing according to a random process, but is more general than MCAR in that all units do not have equal probabilities of being missing. The constraint here is that missingness may only depend on information that is fully observed. For example, the reporting of income on surveys may vary according to some measured factor, such as age, race or sex. We can thus account for heterogeneity in the probability of reporting income by controlling for the measured covariate in whatever model is used for infrence.

Missing not at random (MNAR)

When the probability of missing data varies according to information that is not available, this is classified as MNAR. This can either be because suitable covariates for explaining missingness have not been recorded (or are otherwise unavailable) or the probability of being missing depends on the value of the missing datum itself. Extending the previous example, if the probability of reporting income varied according to income itself, this is missing not at random.

In each of these situations, the missing data may be imputed using a sampling model, though in the case of missing not at random, it may be difficult to validate the assumptions required to specify such a model. For the purposes of quickly demonstrating automatic imputation in PyMC, I will illustrate using data that is MCAR.

Implementing imputation in PyMC

One of the recurring examples in the PyMC documentation is the coal mining disasters dataset from Jarrett 1979. This is a simple longitudinal dataset consisting of counts of coal mining disasters in the U.K. between 1851 and 1962. The objective of the analysis is to identify a switch point in the rate of disasters, from a relatively high rate early in the time series to a lower one later on. Hence, we are interested in estimating two rates, in addition to the year after which the rate changed.

In order to illustrate imputation, I have randomly replaced the data for two years with a missing data placeholder value, -999:

disasters_array =   np.array([ 4, 5, 4, 0, 1, 4, 3, 4, 0, 6, 3, 3, 4, 0, 2, 6,
                   3, 3, 5, 4, 5, 3, 1, 4, 4, 1, 5, 5, 3, 4, 2, 5,
                   2, 2, 3, 4, 2, 1, 3, -999, 2, 1, 1, 1, 1, 3, 0, 0,
                   1, 0, 1, 1, 0, 0, 3, 1, 0, 3, 2, 2, 0, 1, 1, 1,
                   0, 1, 0, 1, 0, 0, 0, 2, 1, 0, 0, 0, 1, 1, 0, 2,
                   3, 3, 1, -999, 2, 1, 1, 1, 1, 2, 4, 2, 0, 0, 1, 4,
                   0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1])

Here, the np prefix indicates that the array function comes from the Numpy module. PyMC is able to recognize the presence of missing values when we use Numpy's MaskedArray class to contain our data. The masked array is instantiated via the masked_array function, using the original data array and a boolean mask as arguments:

    masked_values = np.ma.masked_array(disasters_array,
    mask=disasters_array==-999)

Of course, my use of -999 to indicate missing data was entirely arbitrary, so feel free to use any appropriate value, so long as it can be identified and masked (obviously, small positive integers would not have been appropriate here). Let's have a look at the masked array:

masked_array(data = [4 5 4 0 1 4 3 4 0 6 3 3 4 0 2 6 3 3 5 4 5 3 1 4 
    4 1 5 5 3 4 2 5 2 2 3 4 2 1 3 -- 2 1 1 1 1 3 0 0 1 0 1 1 0 0 3 1 
    0 3 2 2 0 1 1 1 0 1 0 1 0 0 0 2 1 0 0 0 1 1 0 2 3 3 1 -- 2 1 1 1 
    1 2 4 2 0 0 1 4 0 0 0 1 0 0 0 0 0 1 0 0 1 0 1], 
    mask = [False False False False False False False False False 
    False False False False False False False False False False False
    False False False False False False False False False False False
    False False False False False False False False True False False 
    False False False False False False False False False False False
    False False False False False False False False False False False
    False False False False False False False False False False False
    False False False False False False False False True False False 
    False False False False False False False False False False False
    False False False False False False False False False False False 
    False False False],
    fill_value = 999999)

Notice that the placeholder values have disappeared from the data, and the array has a mask attribute that identifies the indices for the missing values.

Beyond the construction of a masked array, there is nothing else that needs to be done to accommodate missing values in a PyMC model.

First, we need to specify prior distributions for the unknown parameters, which I call switch (the switch point), early (the early mean) and late (the late mean). An appropriate non-informative prior for the switch point is a discrete uniform random variable over the range of years represented by the data. Since the rates must be positive, I use identical weakly-informative exponential distributions:

# Switchpoint
switch = DiscreteUniform('switch', lower=0, upper=110)
# Early mean
early = Exponential('early', beta=1)
# Late mean
late = Exponential('late', beta=1)

The only tricky part of the model is assigning the appropriate rate parameter to each observation. Though the two rates and the switch point are stochastic, in the sense that we have used probability models to describe our uncertainty in their true values, the membership of each observation to either the early or late rate is a deterministic function of the stochastics. Thus, we set up a deterministic node that assigns a rate to each observation depending on the location of the switch point at the current iteration of the MCMC algorithm:

@deterministic
def rates(s=switch, e=early, l=late):
    """Allocate appropriate mean to time series"""
    out = np.empty(len(disasters_array))
    # Early mean prior to switchpoint
    out[:s] = e
    # Late mean following switchpoint
    out[s:] = l
    return out

Finally, the data likelihood comprises the annual counts of disasters being modeled as Poisson random variables, conditional on the parameters assigned in the rates node above. The masked array is specified as the value of the stochastic node, and flagged as data via the observed argument.

disasters = Poisson('disasters', mu=rates, value=masked_values, observed=True)

If we run the model, then query the disasters node for posterior statistics, we can obtain a summary of the estimated number of disasters in both of the missing years.

In [9]: DisasterModel.disasters.stats()
Out[9]: 
{'95% HPD interval': array([[ 0.,  6.],
       [ 0.,  3.]]),
 'mc error': array([ 0.11645149,  0.03479713]),
 'mean': array([ 2.2246,  0.91  ]),
 'n': 5000,
 'quantiles': {2.5: array([ 0.,  0.]),
               25: array([ 1.,  0.]),
               50: array([ 2.,  1.]),
               75: array([ 3.,  1.]),
               97.5: array([ 7.,  3.])},
 'standard deviation': array([ 1.88206133,  0.92536479])}

Clearly, this is a rather trivial example, but it serves to illustrate how easy it can be to deal with missing values in PyMC. Though not applicable here, it would be similarly easy to handle MAR data, by constructing a data likelihood whose parameter(s) is a function of one or more covariates.

Automatic imputation is a new feature in PyMC, and is currently available only in the development codebase. It will hopefully appear in the feature set of a future release.