linmix package

Module contents

A hierarchical Bayesian approach to linear regression with error in both X and Y.

class linmix.LinMix(x, y, xsig=None, ysig=None, xycov=None, delta=None, K=3, nchains=4, parallelize=True, seed=None)

A class to perform linear regression of y on x when there are measurement errors in both variables. The regression assumes:

eta = alpha + beta * xi + epsilon

x = xi + xerr

y = eta + yerr

Here, alpha and beta are the regression coefficients, epsilon is the intrinsic random scatter about the regression, xerr is the measurement error in x, and yerr is the measurement error in y. epsilon is assumed to be normally-distributed with mean zero and variance sigsqr. xerr and yerr are assumed to be normally-distributed with means equal to zero, variances xsig`^2 and `ysig`^2, respectively, and covariance `xycov. The distribution of xi is modelled as a mixture of normals, with group proportions pi, means mu, and variances tausqr.

  • x (array_like) – The observed independent variable.
  • y (array_like) – The observed dependent variable.
  • xsig (array_like) – 1-sigma measurement errors in x.
  • ysig (array_like) – 1-sigma measurement errors in y.
  • xycov (array_like) – Covariance between the measurement errors in x and y.
  • delta (array_like) – Array indicating whether a data point is censored (i.e., not detected), or not. If delta[i] == 1, then the ith source is detected. If delta[i] == 0, then the ith source is not detected and y[i] will be interpreted as an upper limit. Note that if there are censored data points, then the maximum-likelihood estimate (alpha, beta, sigsqr) is not valid. By default, all data points are assumed to be detected.
  • K (int) – The number of Gaussians to use in the mixture model for the distribution of xi.
  • nchains (int) – The number of Monte Carlo Markov Chains to instantiate.
  • parallelize (bool) – Use a separate thread for each chain. Only makes sense for nchains > 1.
  • seed (int) – Random seed. If None, then get seed from np.random.randint().

int – The number of instantiated MCMCs.


numpy recarray – The concatenated MCMCs themselves. Actually, only the concatenation of the last half of each chain is stored here after convergence is reached. The recarray has the following columns:

  • alpha(float): The regression intercept.
  • beta(float): The regression slope.
  • sigsqr(float): The regression intrinsic scatter.
  • pi(array_like): The mixture model component fractions.
  • mu(array_like): The mixture model component means.
  • tausqr(array_like): The mixture model component variances.
  • mu0(float): The hyperparameter describing the prior variance of the distribution
    of mixture means.
  • usqr(float): The hyperparameter describing the prior variance of the distribution
    of mixture variances.
  • wsqr(float): The hyperparameter describing the typical scale for the prior on
    usqr and tausqr.
  • ximean(float): The mean of the distribution for the independent latent variable
  • xisig(float): The standard deviation of the distribution for the independent
    latent variable xi.
  • corr(float): The linear correlation coefficient between the latent dependent and
    independent variables xi and eta.
run_mcmc(miniter=5000, maxiter=100000, silent=False)

Run the Markov Chain Monte Carlo for the LinMix object.

Bayesian inference is employed, and a Markov chain containing random draws from the posterior is developed. Convergence of the MCMC to the posterior is monitored using the potential scale reduction factor (RHAT, Gelman et al. 2004). In general, when RHAT < 1.1 then approximate convergence is reached. After convergence is reached, the second halves of all chains are concatenated and stored in the .chain attribute as a numpy recarray.

  • miniter (int) – The minimum number of iterations to use.
  • maxiter (int) – The maximum number of iterations to use.
  • silent (bool) – If true, then suppress updates during sampling.