One of the unavoidable issues with writing books about software is that for even moderately well-maintained packages, the release version has changed by the time the book is published; this is the case for Idris' book, as NumPy has reached version 1.6 (and many users work from the current codebase on GitHub). Don't let this deter you, however, as the major functionality changes little from version to version, particularly for point releases.
The book begins with thorough, visual installation instructions for the three major platforms (sorry, OS/2 users!). Though the NumPy website includes decent install instructions, this is a welcome chapter, particularly for new users, because it is well-organized and highly visual. Depending on your setup, the instructions for building from source may not be sufficient, at least on OS X, where some configuration is sometimes necessary depending on which version of Xcode (and hence, compilers) is installed.
In addition to installation guidance, the author reveals several avenues for 3rd-party help with NumPy, which is useful since one inevitably outstrips any book's ability to answer all one's questions. Its a good choice of recommended resources, too: mailing lists, IRC, and critically, Stack Overflow.
The catch-phrase of the Packt Beginner's Guide series is Learn by doing: less theory, more results. True to its word, this particular guide is extremely hands on, comprised mostly of step-by-step "Time for Action" recipes for performing particular atomic tasks using NumPy within Python. For the most part, this philosophy works to the user's benefit by providing a ton of useful guides to the core functionality of NumPy. However, in places this causes suffering due to lack of context. For example, what is a Bessel function, and why would I need it? How do I interpret the plot I just generated? To me, you could provide some background without fear of being mistaken for a textbook.
Though the book is technically a beginner's guide, it is stated up front that knowledge of Python is assumed. This is evident early: the reader is taken from the installation guide kiddie pool directly to the deep end of vector operations. The early chapters illustrate how NumPy works closely with other scientific programming packages, most notably Matplotlib and iPython. This includes a very condensed iPython tutorial, which is a boon because everyone doing scientific computing with Python should be using iPython as the default shell.
The early chapters do a solid job of covering NumPy fundamentals, and by fundamentals we are, of course, talking about ndarrays
. The introduction to array data types jumps right in to creating custom types, a topic that is truly useful for those of us applying NumPy to real data. Overall, the range of array methods provided is deep relative to other resources I have encountered.
One of the nice things about the book is that you are almost guaranteed to learn something new, almost irrespective of your level of expertise. For as long as I've been using NumPy, I tend use a particular subset of its functionality, related to the sort of work I do. This book is more comprehensive than that. Who knew you could truncate the values of an array with ndarray.clip()
? Not me. Some of the included workflows should be really useful for a lot of users, such as inserting arbitrary values into a sorted array such that it remains sorted.
A big take-home lesson from the book that is likely to impress the novice user is the manifold advantages of working with arrays, rather than looping over lists, be it by using ufuncs or vectorizing scalar functions. Its not only a performance boon, but from a development standpoint its much easier to read and to maintain.
My only pedantic criticism of the book is the use of some rather awkward English in places: "Get into Terms with Commonly Used Functions" doesn't sound quite right, and "Have a go hero" is an odd little phrase used to encourage hands-on programming that seems to have come straight from "Go Diego Go". This, along with a few function usage mistakes would suggest that another round of edits was in order.
Idris makes heavy use of financial applications for his examples. This is fine, I suppose, but it may be heavy going for those from other fields, given the finance's tendency to use specialized terminology. A broader range of motivating examples would have both shown the breadth of NumPy's potential application and helped the book appeal to a wider audience.
Though full coverage of a package so jammed with functionality as NumPy is impossible, this guide covers a reasonable cross section: Users are introduced to calculating and drawing waveforms, fitting polynomial functions, matrix functions and manipulation, statistical distributions, sorting algorithns and (you guessed it) financial functions.
It was very nice (particularly given the aforementioned constraints) to see a testing chapter, though it is just an overview of the key testing functions. This is a good example of where a little background would have made a big difference--new users are rarely able to appreciate the importance of testing, particularly unit testing, so a little motivation is required to make the concept stick, and to ensure that new NumPy users put them into practice.
Despite my criticisms, I would recommend The NumPy 1.5 Beginner's Guide without hesitation to users unfamiliar with NumPy (provided they already have some Python chops). NumPy is the most important Python package for mathematics, statistics and other numerical computation, and there is currently no better novice introduction in print. Having a copy of my own, it will be nice to have it close at hand to remind me of how much power is contained within the NumPy code base.
Disclaimer -- I was given a complementary electronic copy of the book from the publisher in exchange for writing a review. ↩
I have been slowly working my way through The Handbook of Markov Chain Monte Carlo, a compiled volume edited by Steve Brooks et al. that I picked up at last week's Joint Statistical Meetings. The first chapter is a primer on MCMC by Charles Geyer, in which he summarizes the key concepts of the theory and application of MCMC. In a particularly provocative passage, Geyer rips several of the traditional practices in setting up, running and diagnosing MCMC runs, including multi-chain runs, burn-in and sample-based diagnostics. Though they are applied regularly, these steps are simply heuristics that are applied to either aid in reaching or identifying the equilibrium distribution of the Markov chain. There are no guarantees on the reliability of any of them.
In particular, he questions the utility of burn-in:
Burn-in is only one method, and not a particuarly good method, for finding a good starting point.
I can't disagree with this, though I have always viewed MCMC sampling (for most models that I have dealt with) as being cheap enough that there is little cost to simply throwing away thousands of them. I have often thrown away as many as the first 90 percent of my samples! However, as Geyer notes, there are better ways of getting your chain into a decent region of its support without throwing anything away.
One method is to use an approximation method on your model before applying MCMC. For example, the maximum a posteriori (MAP) estimate can be obtained using numerical optimization, then used as the initial values for an MCMC run. It turns out to be pretty easy to do in PyMC. For example, using the built-in bioassay example:
:::python
In [3]: from pymc.examples import gelman_bioassay
In [4]: from pymc import MAP, MCMC
In [5]: M = MAP(gelman_bioassay)
In [6]: M.fit()
This yields MAP estimates for all the parameters in the model, which are less likely to be true modes as the complexity of the model increases, but are a pretty good bet to be a decent starting point for MCMC.
:::python
In [7]: M.alpha.value
Out[7]: array(0.8465802225061101)
All that remains is to move these estimates into an MCMC sampler. While one could manually plug the values of each node into the model specification, its easiest just to extract the variables from the MAP estimator, and use them to instantiate an MCMC
object:
:::python
In [8]: M.variables
Out[8]:
set([<pymc.PyMCObjects.Stochastic 'alpha' at 0x10f78e810>,
<pymc.PyMCObjects.Stochastic 'beta' at 0x10f78e910>,
<pymc.PyMCObjects.Deterministic 'theta' at 0x10f78e9d0>,
<pymc.distributions.Binomial 'deaths' at 0x10f78ea50>,
<pymc.CommonDeterministics.Lambda 'LD50' at 0x10f78ec10>])
In [9]: MC = MCMC(M.variables)
In [10]: MC.sample(1000)
Sampling: 100% [0000000000000000000000000000000000000000000000] Iterations: 1000
Notice that I did not pass a burn
argument to MCMC, which defaults to zero. As is evident from the graphical output of the posteriors, this results in what appears to be a homogeneous chain, and which is hopefully already at its equilibrium distribution.
What the MCMC practitioner fears is using a chain for inference that has not yet converged to its target distribution. Unfortunately, diagnostics cannot reliably alert you to this, nor does starting a model in several chains from disparate starting values guarantee this. There is also no magical threshold to distinguish convergence from pre-convergence regions in a MCMC trace. Geyer insists that only running chains for a very, very long time will inspire confidence:
]]>Your humble author has a dictum that the lease one can do is make an overnight run. ... If you do not make runs like that, you are simply not serious about MCMC.
One of the highlights of SciPy 2011 was a demonstration of the upcoming release of iPython 0.11 (which is likely to become version 1.0, eventually). Fernando Perez calls iPython a "tool for manipulating namespaces", but this dry, technical definition belies the productivity-boosting power of this project. I have been using iPython for a few years as a replacement for the standard shell because it affords users easier access to debugging tools, command history and the underlying file system. Version 0.11 adds to this a variety of enhancements that further improve its utility and usability. A full account of it many features can be found in the iPython documentation, but I want to highlight a couple of the recent innovations that users might not be aware of.
Its probably impossible to buy a new computer in 2011 that does not have multiple cores, and yet most everyday scientific computing does not take advantage of parallel processing. Some have the impression that because of the Global Interpreter Lock (GIL), Python is not a useful parallel processing platform. In fact, though the GIL prevents concurrency for multithreaded processes, Python allows the GIL to be side-stepped through its multiprocessing
module. In fact, there is a large suite of Python packages that provide multiprocessing for Python in various forms, from simply taking advantage of multiple cores on a single computer to computing on clusters, grids and clouds. iPython not only provides Python users with multiprocessing capability, but also allows for monitoring, debugging and interacting with multiple processes.
iPython facilitates multiprocessing through the use of "engines", which are instances of Python that are started as needed by iPython. These engines, in turn, are governed by a controller process that allows them to be addressed by the user, either directly or through a LoadBalanced
interface that will automatically assign computation to available engines in a smart way.
Using parallel computing via iPython requires one additional step, which instantiates the engines that will be used for multiprocessing tasks. Before starting iPython run ipcluster
from the terminal, as follows:
ipcluster start --n=4
where the n
argument specifies the number of engines to start. On my Intel Core i5 machine, I have 4 cores available to me, so I chose to start an engine on each. Now, iPython has access to all of these engines, via a Client
object:
In [1]: from IPython.parallel import Client
In [2]: c = Client()
In [3]: c.ids
Out[3]: [0, 1, 2, 3]
We can now multiprocess, using all four engines. As a simple example, consider running a set of linear regressions in parallel. The following code simulates some data that we can analyze:
In [5]: %paste
import numpy as np
# generate fake data
nsample = 100
x = np.linspace(0,10, 100)
X = np.c_[np.ones(len(x)), x, x**2]
beta = np.array([1, 0.1, 10])
y = np.dot(X, beta) + np.random.normal(size=nsample)
## -- End pasted text --
We now have a response variable that was generated as a quadratic function of a predictor variable, with some Gaussian noise added. Lets use the scikits.statsmodels package to estimate some alternative models, in parallel.
In [6]: import scikits.statsmodels.api as sm
In [7]: model_set = [sm.OLS(y, X[:,0]), sm.OLS(y, X[:,:-1]),
sm.OLS(y, X), sm.OLS(y, np.c_[X, x**3])]
This series of models are presented in order of increasing complexity, including an intrecept-only model, a simple linear model, a quadratic polynomial model (the generating model) and a cubic polynomial model. We can now fit the models in parallel, and calculate AIC values (an information criterion) so that we may discover which performs best.
In [8]: view = c.load_balanced_view()
In [9]: rc = view.map(lambda m: m.fit().aic, model_set)
In [10]: rc[:]
Out[10]:
[1427.6221389537438,
1153.7084062595984,
319.70035715389247,
321.67688530260961]
Needless to say, it would be excessive to fit a handful of ordinary least squares regression models in parallel, but this shows how simple it is to run computing tasks in parallel using iPython. But, this barely scratches the surface, so I strongly encourage you to read the parallel computing section of the iPython docs.
Another slick enhancement in iPython 0.11 is the introduction of a Qt-based console. This is facilitated by a ZeroMQ layer that allows iPython to run in two processes. If you are willing and able to install Qt and the Qt bindings for Python (either PyQt or PySide) (which can be a non-trivial task for Mac users), you can take advantage of the new iPython Qt console. Launching is a matter of providing a qtconsole
argument:
ipython qtconsole
This spawns a Python window that contains the iPython console that you would have expected in the terminal. An immediate advantage of the Qt console is the automatic availability of tooltips, which pop up whenever the opening parenthesis of a class or function is typed:
Not impressed yet? How about the ability to load remote python scripts, via iPython magic? Here is a PyMC model, loaded directly from a GitHub gist:
Any file appened with .py can be imported this way, local or remote. Not only does this bring the script into the console, but the code can be edited prior to execution. This is really, really nice. How many of us find snippets of code from various places on the internet, which we copy and modify to apply it to our own work? This task just got a little easier.
Perhaps the nicest new feature is the ability to embed matplotlib figures directly into the console. The easiest way to enable this is to include a --pylab=inline
argument when starting iPython. This will place figures generated by matplotlib into the console, just below the function call, rather than in a separate window:
It reminds me a little of Mathematica. Then, what is really nice is that the entire contents of the console can be exported to xml or html, via a contextual menu option:
This saves a document, with figures and all, that can be displayed in a web browser:
One of the improvements afforded by the two-process model implemented in iPython 0.11 is that the kernel can interact with multiple consoles. When a console is launched, you will notice a message from the kernel that provides information for additional remote connections:
[503] 11:21:41 fonnescj: $ ipython qtconsole --pylab=inline
[IPKernelApp] To connect another client to this kernel, use:
[IPKernelApp] --existing --shell=54639 --iopub=54640 --stdin=54641 --hb=54642
By default, the kernel only listens for local connections, but adding external interfaces is as simple as providing the IP address to iPython on startup. But for now, I can add an additional console to the kernel by pasting the flags provided to the ipython
call that starts the second console. This will connect the new console to the already-running kernel. For example, in the screen shot below, I defined a variable b=10
in the bottom console, then attached a new console to the kernel; you can see that d
(along with all the other objects in the namespace) is available to the second console.
Of course, there is much, much more to tell, but I will leave it at that for now, and encourage you to grab a build or clone the source code from the iPython GitHub repo, and use the online documentation to guide you through the trove of goodies that iPython has to offer. While they received no shortage of accolades at SciPy 2011 for their amazing work, I'd like to add my thanks to the iPython development team for their ongoing commitment to improving the usability of Python!
]]>Semi-parametric methods have been preferred for a long time in survival analysis, for example, where the baseline hazard function is expressed non-parametrically to avoid assumptions regarding its form. Meanwhile, the use of non-parametric methods in Bayesian statistics is increasing. However, there are few resources to guide scientists in implementing such models using available software. Here, I will run through a quick implementation of a particular class of non-parametric Bayesian models, using PyMC.
Use of the term "non-parametric" in the context of Bayesian analysis is something of a misnomer. This is because the first and fundamental step in Bayesian modeling is to specify a full probability model for the problem at hand. It is rather difficult to explicitly state a full probability model without the use of probability functions, which are parametric. It turns out that Bayesian non-parametric models are not really non-parametric, but rather, are infinitely parametric.
A useful non-parametric approach for modeling random effects is the Dirichlet process. A Dirichlet process (DP), just like Poisson processes, Gaussian processes, and other processes, is a stochastic process. This just means that it comprises an indexed set of random variables. The DP can be conveniently thought of as a probability distribution of probability distributions, where the set of distributions it describes is infinite. Thus, an observation under a DP is described by a probability distribution that itself is a random draw from some other distribution. The DP (lets call it $G$) is described by two quantities, a baseline distribution $G_0$ that defines the center of the DP, and a concentration parameter $\alpha$. If you wish, $G_0$ can be regarded as an a priori "best guess" at the functional form of the random variable, and $\alpha$ as a measure of our confidence in our guess. So, as $\alpha$ grows large, the DP resembles the functional form given by $G_0$.
To see how we sample from a Dirichlet process, it is helpful to consider the constructive definition of the DP. There are several representations of this, which include the Blackwell-MacQueen urn scheme, the stick-breaking process and the Chinese restaurant process. For our purposes, I will consider the stick-breaking representation of the DP. This involves breaking the support of a particular variable into $k$ disjoint segments. The first break occurs at some point $x_0$, determined stochastically; the first piece of the notional "stick" is taken as the first group in the process, while the second piece is, in turn, broken at some selected point $x_1$ along its length. Here too, one piece is assigned to be the second group, while the other is subjected to the next break, and so on, until $k$ groups are created. Associated with each piece is a probability that is proportional to its length; these $k$ probabilities will have a Dirichlet distribution -- hence, the name of the process. Notice that $k$ can be infinite, making $G$ an infinite mixture.
We require two random samples to generate a DP. First, take a draw of values from the baseline distribution:
$$ \theta_1, \theta_2, \ldots \sim G_0 $$
then, a set of draws $v_1, v_2, \ldots$ from a $\text{Beta}(1,\alpha)$ distribution. These beta random variates are used to assign probabilities to the $\theta_i$ values, according to the stick-breaking analogy. So, the probability of $\theta_1$ corresponds to the first "break", and is just $p_1 = v_1$. The next value corresponds to the second break, which is a proportion of the remainder from the first break, or $p_2 = (1-v_1)v_2$. So, in general:
$$ p_i = v_i \prod_{j=1}^{i-1} (1 - v_j) $$
These probabilities correspond to the set of draws from the baseline distribution, where each of the latter are point masses of probability. So, the DP density function is:
$$ g(x) = \sum_{i=1}^{\infty} p_i I(x=\theta_i) $$
where $I$ is the indicator function. So, you can see that the Dirichlet process is discrete, despite the fact that its values may be non-integer. This can be generalized to a mixture of continuous distributions, which is called a DP mixture, but I will focus here on the DP alone.
Example: Estimating household radon levels
As an example of implementing Dirichlet processes for random effects, I'm going to use the radon measurement and remediation example from Gelman and Hill (2006). This problem uses measurements of radon (a carcinogenic, radioactive gas) from households in 85 counties in Minnesota to estimate the distribution of the substance across the state. This dataset has a natural hierarchical structure, with individual measurements nested within households, and households in turn nested within counties. Here, we are certainly interested in modeling the variation in counties, but do not have covariates measured at that level. Since we are more interested in the variation among counties, rather than the particular levels for each, a random effects model is appropriate. Whit Armstrong was kind enough to code several parametrizations of this model in PyMC, so I will use his code as a basis for implementing a non-parametric random effect for radon levels among counties.
In the original example from Gelman and Hill, measurements are modeled as being normally distributed, with a mean that is a hierarchical function of both a county-level random effect and a fixed effect that accounted for whether houses had a basement (this is thought to increase radon levels).
$$ y_i \sim N(\alpha_{j[i]} + \beta x_i, \sigma_y^2) $$
So, in essence, each county has its own slope, but shares a slope among all counties. This can easily be generalized to both random slopes and intercepts, but I'm going to keep things simple, in order to focus in implementing a single random effect.
The constraint that is applied to the intercepts in Gelman and Hill's original model is that they have a common distribution (Gaussian) that describes how they vary from the state-wide mean.
$$ \alpha_j \sim N(\mu_{\alpha}, \sigma_{\alpha}^2) $$
This comprises a so-called "partial pooling" model, whereby counties are neither constrained to have identical means (full pooling) nor are assumed to have completely independent means (no pooling); in most applications, the truth is somewhere between these two extremes. Though this is a very flexible approach to accounting for county-level variance, one might be worried about imposing such a restrictive (thin-tailed) distribution like the normal on this variance. If there are counties that have extremely low or high levels (for whatever reason), this model will fit poorly. To allay such worries, we can hedge our bets by selecting a more forgiving functional form, such as Student's t or Cauchy, but these still impose parametric restrictions (e.g. symmetry about the mean) that we may be uncomfortable making. So, in the interest of even greater flexibility, we will replace the normal county random effect with a non-parametric alternative, using a Dirichlet process.
One of the difficulties in implementing DP computationally is how to handle an infinite mixture. The easiest way to tackle this is by using a truncated Dirichlet process to approximate the full process. This can be done by choosing a size $k$ that is sufficiently large that it will exceed the number of point masses required. By doing this, we are assuming
$$ \sum_{i=1}^{\infty} p_i I(x=\theta_i) \approx \sum_{i=1}^{N} p_i I(x=\theta_i) $$
Ohlssen et al. 2007 provide a rule of thumb for choosing $N$ such that the sum of the first $N-1$ point masses is greater than 0.99:
$$ N \approx 5\alpha + 2 $$
To be conservative, we will choose an even larger value (100), which we will call N_dp
. The truncation makes implementation of DP in PyMC (or JAGS/BUGS) relatively simple.
We first must specify the baseline distribution and the concentration parameter. As we have no prior information to inform a choice for $\alpha$, we will specify a uniform prior for it, with reasonable bounds:
alpha = pymc.Uniform('alpha', lower=0.5, upper=10)
Though the upper bound may seem small for a prior that purports to be uninformative, recall that for large values of $\alpha$, the DP will converge to the baseline distribution, suggesting that a continuous distribution would be more appropriate.
Since we are extending a normal random effects model, I will choose a normal baseline distribution, with vague hyperpriors:
mu_0 = pymc.Normal('mu_0', mu=0, tau=0.01, value=0)
sig_0 = pymc.Uniform('sig_0', lower=0, upper=100, value=1)
tau_0 = sig_0 ** -2
theta = pymc.Normal('theta', mu=mu_0, tau=tau_0, size=N_dp)
Notice that I have specified a uniform prior on the standard deviation, rather than the more common gamma-distributed precision; for hierarchical models this is good practice. So, now we that we have N_dp
point masses, all that remains is to generate corresponding probabilities. Following the recipe above:
v = pymc.Beta('v', alpha=1, beta=alpha, size=N_dp)
@pymc.deterministic
def p(v=v):
""" Calculate Dirichlet probabilities """
# Probabilities from betas
value = [u*np.prod(1-v[:i]) for i,u in enumerate(v)]
# Enforce sum to unity constraint
value[-1] = 1-sum(value[:-1])
return value
This is where you really appreciate Python's list comprehension idiom. In fact, were it not for the fact that we wanted to ensure that the array of probabilities sums to one, p
could have been specified in a single line.
The final step involves using the Dirichlet probabilities to generate indices to the appropriate point masses. This is realized using a categorical mass function:
z = pymc.Categorical('z', p, size=len(set(counties)))
These indices, in turn, are used to index the random effects, which are used as random intercepts for the model:
a = pymc.Lambda('a', lambda z=z, theta=theta: theta[z])
Substitution of the above code into Gelman and Hill's original model produces reasonable results. The expected value of $\alpha$ is approximately 5, as shown by the posterior output below:
Here is a random sample taken from the DP:
But is the model better? One metric for model comparison is the deviance information criterion (DIC), which appears to strongly favor the DP random effect (smaller values are better):
In [11]: M.dic
Out[11]: 2138.7806225675804
In [12]: M_dp.dic
Out[12]: 1993.0894265799602
If you are interested in viewing the model code in its entirety, I have uploaded it to my fork of Whit's code.
]]>Statisticians are sometimes interested in comparing two (or more) models, with respect to their relative support by a particular dataset. This may be in order to select the best model to use for inference, or to weight models so that they can be averaged for use in multimodel inference.
The Bayes factor is a good choice when comparing two arbitrary models, and the parameters of those models have been estimated. Bayes factors are simply ratios of marginal likelihoods for competing models:
$$ \text{BF}_{i,j} = \frac{L(Y \mid M_i)}{L(Y \mid M_j)} = \frac{\int L(Y \mid M_i,\theta_i)p(\mid \theta_i \mid M_i)d\theta}{\int L(Y \mid M_j,\theta_j)p(\theta_j \mid M_j)d\theta} $$
While passingly similar to likelihood ratios, Bayes factors are calculated using likelihoods that have been integrated with respect to the unknown parameters. In contrast, likelihood ratios are calculated based on the maximum likelihood values of the parameters. This is an important difference, which makes Bayes factors a more effective means of comparing models, since it takes into account parametric uncertainty; likelihood ratios ignore this uncertainty. In addition, unlike likelihood ratios, the two models need not be nested. In other words, one model does not have to be a special case of the other.
Bayes factors are called Bayes factors because they are used in a Bayesian context by updating prior odds with information from data.
Posterior odds = Bayes factor x Prior odds
Hence, they represent the evidence in the data for changing the prior odds of one model over another. It is this interpretation as a measure of evidence that makes the Bayes factor a compelling choice for model selection.
One of the obstacles to the wider use of Bayes factors is the difficulty associated with calculating them. While likelihood ratios can be obtained simply by the use of MLEs for all model parameters, Bayes factors require the integration over all unknown model parameters. Hence, for most interesting models Markov chain Monte Carlo (MCMC) is the easiest way to obtain Bayes factors.
Here's a quick tutorial on how to obtain Bayes factors from PyMC. I'm going to use a simple example taken from Chapter 7 of Link and Barker (2010). Consider a short vector of data, consisting of 5 integers:
:::python
Y = array([0,1,2,3,8])
We wish to determine which of two functional forms best models this dataset. The first is a geometric model:
$$ f(x|p) = (1-p)^x p $$
and the second a Poisson model:
$$ f(x|\mu) = \frac{\mu^x e^{-\mu}}{x!} $$
Both describe the distribution of non-negative integer data, but differ in that the variance of Poisson data is equal to the mean, while the geometric model describes variance greater the mean. For this dataset, the sample variance would suggest that the geometric model is favored, but the sample is too small to say so definitively.
In order to calculate Bayes factors, we require both the prior and posterior odds:
Bayes factor = Posterior odds / Prior odds
The Bayes factor does not depend on the value of the prior model weights, but the estimate will be most precise when the posterior odds are the same. For our purposes, we will give 0.1 probability to the geometric model, and 0.9 to the Poisson model:
:::python
pi = (0.1, 0.9)
Next, we need to specify a latent variable, which identifies the true model (we don't believe either model is "true", but we hope one is better than the other). This is easily done using a categorical random variable, that identifies one model or the other, according to their relative weight.
:::python
true_model = Categorical("true_model", p=pi, value=0)
Here, we use the specified prior weights as the categorical probabilities, and the variable has been arbitrarily initialized to zero (the geometric model).
Next, we need prior distributions for the parameters of the two models. For the Poisson model, the expected value is given a uniform prior on the interval [0,1000]:
:::python
mu = Uniform('mu', lower=0, upper=1000, value=4)
This stochastic node can be used for the geometric model as well, though it needs to be transformed for use with that distribution:
:::python
p = Lambda('p', lambda mu=mu: 1/(1+mu))
Finally, the data are incorporated by specifying the appropriate likelihood. We require a mixture of geometric and Poisson likelihoods, depending on which value true_model takes. While BUGS requires an obscure trick to implement such a mixture, PyMC allows for the specification of arbitrary stochastic nodes:
:::python
@observed
def Ylike(value=Y, mu=mu, p=p, M=true_model):
"""Either Poisson or geometric, depending on M"""
return geometric_like(value+1, p)*(M==0) or poisson_like(value, mu)
Notice that the function returns the geometric likelihood when M=0, or the Poisson model otherwise. Now, all that remains is to run the model, and extract the posterior quantities to calculate the Bayes factor.
Though we may be interested in the posterior estimate of the mean, all that we care about from a model selection standpoint is the estimate of true_model. At every iteration, the value of this parameter takes the value of zero for the geometric model and one for the Poisson. Hence, the mean (or median) will be an estimate of the probability of the Poisson model:
:::python
In [11]: M.true_model.stats()['mean']
Out[11]: 0.39654545454545453
So, the posterior probability that the Poisson model is true is about 0.4, leaving 0.6 for the geometric model. The Bayes factor in favor of the geometric model is simply:
:::python
In [18]: p_pois = M.true_model.stats()['mean']
In [19]: ((1-p_pois)/p_pois) / (0.1/0.9)
Out[19]: 13.696011004126548
This value can be interpreted as strong evidence in favor of the geometric model.
If you want to run the model for yourself, you can download the code here.
]]>A distinct advantage of using Bayesian inference is in its universal application of probability models for providing inference. As such, all components of a Bayesian model are specified using probability distributions for either describing a sampling model (in the case of observed data) or characterizing the uncertainty of an unknown quantity. This means that missing data are treated the same as parameters, and so imputation proceeds very much like estimation. When using Markov chain Monte Carlo (MCMC) to fit Bayesian models it usually requires only a few extra lines of code to impute missing values, based on the sampling distribution of the missing data, and associated (usually unknown) parameters. Using PyMC built from the latest development code, missing data imputation can be done automatically.
The appropriate treatment of missing data depends strongly on how the data came to be missing from the dataset. These mechanisms can be broadly classified into three groups, according to how much information and effort is required to deal with them adequately.
If data are MCAR, then the probability of that any given datum is missing is equal over the whole dataset. In other words, each datum that is present had the same probability of being missing as each datum that is absent. This implies that ignoring the missing data will not bias inference.
MAR allows for data to be missing according to a random process, but is more general than MCAR in that all units do not have equal probabilities of being missing. The constraint here is that missingness may only depend on information that is fully observed. For example, the reporting of income on surveys may vary according to some measured factor, such as age, race or sex. We can thus account for heterogeneity in the probability of reporting income by controlling for the measured covariate in whatever model is used for infrence.
When the probability of missing data varies according to information that is not available, this is classified as MNAR. This can either be because suitable covariates for explaining missingness have not been recorded (or are otherwise unavailable) or the probability of being missing depends on the value of the missing datum itself. Extending the previous example, if the probability of reporting income varied according to income itself, this is missing not at random.
In each of these situations, the missing data may be imputed using a sampling model, though in the case of missing not at random, it may be difficult to validate the assumptions required to specify such a model. For the purposes of quickly demonstrating automatic imputation in PyMC, I will illustrate using data that is MCAR.
One of the recurring examples in the PyMC documentation is the coal mining disasters dataset from Jarrett 1979. This is a simple longitudinal dataset consisting of counts of coal mining disasters in the U.K. between 1851 and 1962. The objective of the analysis is to identify a switch point in the rate of disasters, from a relatively high rate early in the time series to a lower one later on. Hence, we are interested in estimating two rates, in addition to the year after which the rate changed.
In order to illustrate imputation, I have randomly replaced the data for two years with a missing data placeholder value, -999:
disasters_array = np.array([ 4, 5, 4, 0, 1, 4, 3, 4, 0, 6, 3, 3, 4, 0, 2, 6,
3, 3, 5, 4, 5, 3, 1, 4, 4, 1, 5, 5, 3, 4, 2, 5,
2, 2, 3, 4, 2, 1, 3, -999, 2, 1, 1, 1, 1, 3, 0, 0,
1, 0, 1, 1, 0, 0, 3, 1, 0, 3, 2, 2, 0, 1, 1, 1,
0, 1, 0, 1, 0, 0, 0, 2, 1, 0, 0, 0, 1, 1, 0, 2,
3, 3, 1, -999, 2, 1, 1, 1, 1, 2, 4, 2, 0, 0, 1, 4,
0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1])
Here, the np
prefix indicates that the array
function comes from the Numpy module. PyMC is able to recognize the presence of missing values when we use Numpy's MaskedArray class to contain our data. The masked array is instantiated via the masked_array
function, using the original data array and a boolean mask as arguments:
masked_values = np.ma.masked_array(disasters_array,
mask=disasters_array==-999)
Of course, my use of -999 to indicate missing data was entirely arbitrary, so feel free to use any appropriate value, so long as it can be identified and masked (obviously, small positive integers would not have been appropriate here). Let's have a look at the masked array:
masked_array(data = [4 5 4 0 1 4 3 4 0 6 3 3 4 0 2 6 3 3 5 4 5 3 1 4
4 1 5 5 3 4 2 5 2 2 3 4 2 1 3 -- 2 1 1 1 1 3 0 0 1 0 1 1 0 0 3 1
0 3 2 2 0 1 1 1 0 1 0 1 0 0 0 2 1 0 0 0 1 1 0 2 3 3 1 -- 2 1 1 1
1 2 4 2 0 0 1 4 0 0 0 1 0 0 0 0 0 1 0 0 1 0 1],
mask = [False False False False False False False False False
False False False False False False False False False False False
False False False False False False False False False False False
False False False False False False False False True False False
False False False False False False False False False False False
False False False False False False False False False False False
False False False False False False False False False False False
False False False False False False False False True False False
False False False False False False False False False False False
False False False False False False False False False False False
False False False],
fill_value = 999999)
Notice that the placeholder values have disappeared from the data, and the array has a mask
attribute that identifies the indices for the missing values.
Beyond the construction of a masked array, there is nothing else that needs to be done to accommodate missing values in a PyMC model.
First, we need to specify prior distributions for the unknown parameters, which I call switch
(the switch point), early
(the early mean) and late
(the late mean). An appropriate non-informative prior for the switch point is a discrete uniform random variable over the range of years represented by the data. Since the rates must be positive, I use identical weakly-informative exponential distributions:
# Switchpoint
switch = DiscreteUniform('switch', lower=0, upper=110)
# Early mean
early = Exponential('early', beta=1)
# Late mean
late = Exponential('late', beta=1)
The only tricky part of the model is assigning the appropriate rate parameter to each observation. Though the two rates and the switch point are stochastic, in the sense that we have used probability models to describe our uncertainty in their true values, the membership of each observation to either the early or late rate is a deterministic function of the stochastics. Thus, we set up a deterministic node that assigns a rate to each observation depending on the location of the switch point at the current iteration of the MCMC algorithm:
@deterministic
def rates(s=switch, e=early, l=late):
"""Allocate appropriate mean to time series"""
out = np.empty(len(disasters_array))
# Early mean prior to switchpoint
out[:s] = e
# Late mean following switchpoint
out[s:] = l
return out
Finally, the data likelihood comprises the annual counts of disasters being modeled as Poisson random variables, conditional on the parameters assigned in the rates
node above. The masked array is specified as the value of the stochastic node, and flagged as data via the observed
argument.
disasters = Poisson('disasters', mu=rates, value=masked_values, observed=True)
If we run the model, then query the disasters
node for posterior statistics, we can obtain a summary of the estimated number of disasters in both of the missing years.
In [9]: DisasterModel.disasters.stats()
Out[9]:
{'95% HPD interval': array([[ 0., 6.],
[ 0., 3.]]),
'mc error': array([ 0.11645149, 0.03479713]),
'mean': array([ 2.2246, 0.91 ]),
'n': 5000,
'quantiles': {2.5: array([ 0., 0.]),
25: array([ 1., 0.]),
50: array([ 2., 1.]),
75: array([ 3., 1.]),
97.5: array([ 7., 3.])},
'standard deviation': array([ 1.88206133, 0.92536479])}
Clearly, this is a rather trivial example, but it serves to illustrate how easy it can be to deal with missing values in PyMC. Though not applicable here, it would be similarly easy to handle MAR data, by constructing a data likelihood whose parameter(s) is a function of one or more covariates.
Automatic imputation is a new feature in PyMC, and is currently available only in the development codebase. It will hopefully appear in the feature set of a future release.
]]>