# The First Release of PyMC3

On Monday morning the PyMC dev team pushed the first release of PyMC3, the culmination of over 5 years of collaborative work. We are very pleased to be able to provide a stable version of the package to the Python scientific computing community. For those of you unfamiliar with the history and progression of this project, PyMC3 is a complete re-design and re-write of the PyMC code base, which was primarily the product of the vision and work of John Salvatier. While PyMC 2.3 is still actively maintained and used (I continue to work with it in a number of project myself), this new incarnation allows us to be able to provide newer methods for Bayesian computation to a degree that would have been impossible impossible previously.

While PyMC2 relied on Fortran extensions (via `f2py`) for most of the computational heavy-lifting, PyMC3 leverages Theano, a library from the LISA lab for array-based expression evaluation, to perform its computation. What this provides, above all else, is fast automatic differentiation, which is at the heart of the gradient-based sampling and optimization methods currently providing inference for probabilistic programming. While the addition of Theano adds a level of complexity to the development of PyMC, fundamentally altering how the underlying computation is performed, we have worked hard to maintain the elegant simplicity of the original PyMC model specification syntax. Since the beginning (over 13 years ago now!), we have tried to provide a simple, black-box interface to model-building, in the sense that the user need only concern herself with the modeling problem at hand, rather than with the underlying computer science.

As a point of comparison, here is what a simple hierarchical model (taken from Gelman et al.'s book) looked like under PyMC 2.3:

```# Priors
alpha = Normal('alpha', 0, 0.01)
beta = Normal('beta', 0, 0.01)

# Transformed variables
theta = Lambda('theta', lambda a=alpha, b=beta, d=dose: invlogit(a + b * d))

# Data likelihood
deaths = Binomial('deaths', n=n, p=theta, value=array([0,1,3,5]), observed=True)

# Instantiate a sampler, and run
M = MCMC(locals())
M.sample(10000, burn=5000)
```

and here is the same model in PyMC3:

```with Model() as bioassay_model:

alpha = Normal('alpha', 0, sd=100)
beta = Normal('beta', 0, sd=100)

theta = invlogit(alpha + beta*dose)

deaths = Binomial('deaths', n=n, p=theta, observed=array([0,1,3,5]))

trace = sample(2000)
```

If anything, the model specification has simplified, for the majority of models.

Though the version 2 and version 3 models are superficially similar (by design), there are very different things happening underneath when `sample`is called in either case. By default, the PyMC3 model will use a form of gradient-based MCMC sampling, a self-tuning form of Hamiltonian Monte Carlo, called NUTS. Gradient based methods serve to drastically improve the efficiency of MCMC, without the need for running long chains and dropping large portions of the chains due to lack of convergence. Rather than conditionally sampling each model parameter in turn, the NUTS algorithm walks in k-space (where k is the number of model parameters), simultaneously updating all the parameters as it leap-frogs through the parameter space. Models of moderate complexity and size that would normally require 50,000 to 100,000 iterations now typically require only 2000-3000.

When we run the PyMC3 version of the model above, we see this:

```Auto-assigning NUTS sampler...
Average ELBO = -6.2597: 100%|████████████████████████████████████████| 200000/200000 [00:11<00:00, 16873.12it/s]
Finished [100%]: Average ELBO = -6.27
100%|██████████████████████████████████████████████████████████████████████| 2000/2000 [00:02<00:00, 928.24it/s]
```

Unless specified otherwise, PyMC3 will assign the NUTS sampler to all the variables of the model. This happens here because our model contains only continuous random variables; NUTS will not work with discrete variables because it is impossible to obtain gradient information from them. Discrete variables are assigned the `Metropolis`sampling algorithm (step method, in PyMC parlance). The next thing that happens is that the variables' initial values are assigned using Automatic Differentiation Variational Inference (ADVI). This is an approximate Bayesian inference algorithm that we have added to PyMC — more on that later. Though it can be used for inference in its own right, here we are using it merely to find good starting values for NUTS (in practice, this is important for getting NUTS to run well). Its an excessive step for small models like this, but it is the default behavior, designed to try and guarantee a good MCMC run.

Another nice innovation includes some new plotting functions for visualizing the posterior distributions obtained with the various estimation methods. Let's look at the regression parameters from our fitted model:

```plot_posterior(trace, varnames=['beta', 'alpha'])
```

`plot_posterior` generates histograms of the posterior distribution that is annotated with summary statistics of interest, in the style of John Kruschke's book. This is just one of several options for visualizing output.

The addition of variational inference (VI) methods in version 3.0 is a transformative change to the sorts of problems you can tackle with PyMC3. I showed it being used to intialize a model that was ultimately fit using MCMC, but variational inference can be used as a tool for obtaining statistical inference in its own right. Just as MCMC approximates a complex posterior by drawing dependent samples from its posterior distribution, variational inference performs an approximation by replacing the true posterior with a more tractable form, then iteratively changes the approximation so that it resembles the posterior distribution as closely as it can, in terms of the information distance between the two distributions. Where MCMC uses sampling, VI uses optimization to estimate the posterior distribution. The benefit to you in doing this is that Bayesian models informed by very large datasets can be fit in a reasonable amount of time (MCMC notoriously scales poorly with data size); the drawback is that you only get an approximation to the posterior, and that appoximation can be unacceptably poor for some applications. Nevertheless, improvements to variational inference methods continue to roll in, and some have the potential to drastically improve the quality of the approximation. The key advance that allowed PyMC3 to implement variational methods was the development of automated algorithms for specifying a variational approximation generally, across a wide variety of models. In particular, Alp Kucukelbir and colleagues' introduction of Automatic Differentiation Variational Inference (ADVI) two years ago made VI relatively easy to apply to arbitrary models (again, assuming the model variables are continuous). Here it is, in action, fitting the same model we used NUTS to estimate before:

```with model:
```
```Average ELBO = -6.2765: 100%|████████████████████████████████████████████████| 100000/100000 [00:05<00:00, 17072.45it/s]
Finished [100%]: Average ELBO = -6.2835
```

ADVI returns the means and standard deviations of the approximating distribution after it has converged to the best approximation. These values can be used to sample from the disribtution:

```with model: