Python API

High-Level API

High-level Python API for pypoLCA.

class pypolca.api.LCAResult(raw_results, formula=None, data=None, num_choices=None, y_mat=None)[source]

Bases: object

Python-friendly wrapper around the C++ Results struct.

Parameters:
  • raw_results (Any)

  • formula (str | None)

  • data (DataFrame | None)

  • num_choices (list[int] | None)

  • y_mat (ndarray | None)

property probs: list[ndarray]

Class-conditional response probabilities.

Returns: list[J] of ndarray of shape (R, K_j).

property coeff: ndarray

Covariate coefficients.

Returns: ndarray of shape (S, R-1). Column r-1 = coefficients for class r (r >= 2). None if no covariates.

property npar: int
property probs_se: list[ndarray]
property P_se: ndarray
property coeff_se: ndarray

Standard errors of covariate coefficients.

Returns: ndarray of shape (S, R-1) matching .coeff layout. Empty array if no covariates.

property coeff_V: ndarray
property loglik: float
property iterations: int
property converged: bool
property posterior: ndarray

Training-data posterior class membership probabilities.

property prior: ndarray
property predclass: ndarray
predict_posterior(newdata, newx=None)[source]

Compute posterior class membership probabilities for new data.

Parameters:
  • newdata (DataFrame)

  • newx (DataFrame | None)

Return type:

ndarray

property params: Any

Raw fitted parameters (vecprobs, beta).

property P: ndarray

Class population shares (marginal prior probabilities).

property aic: float
property bic: float
property Gsq: float

Likelihood ratio deviance (G-squared) vs saturated model.

property Chisq: float

Pearson chi-square goodness-of-fit.

Includes correction term (N - sum(exp)) for unobserved response patterns where O=0 and E>0, matching R poLCA behavior.

property predcell: tuple[ndarray, ndarray, ndarray]

Returns (observed, expected, patterns) for each unique complete response pattern.

property resid_df: int

Residual degrees of freedom for GOF tests (intercept-only models).

property Nobs: int

Number of fully observed cases (no missing in any manifest variable).

pypolca.api.fit(formula, data, nclass=2, maxiter=1000, tol=1e-10, verbose=False, na_rm=True, probs_start=None, beta_start=None, nrep=1, seed=None, max_restarts=100, calc_se=True)[source]

Fit a latent class model.

Parameters:
  • formula (str) – Patsy-style formula, e.g. “cbind(Y1, Y2, Y3) ~ 1” or “Y1 + Y2 ~ X1 + X2”. Left-hand side gives manifest variables; right-hand side gives covariates.

  • data (pd.DataFrame) – Data frame containing all variables.

  • nclass (int) – Number of latent classes.

  • maxiter (int) – Maximum EM iterations.

  • tol (float) – Log-likelihood convergence tolerance.

  • verbose (bool) – Print iteration progress.

  • na_rm (bool) – Drop rows with any missing values.

  • probs_start (np.ndarray, optional) – Starting values for class-conditional response probabilities.

  • beta_start (np.ndarray, optional) – Starting values for covariate coefficients.

  • nrep (int) – Number of replications with different random starting values (like R’s nrep).

  • seed (int, optional) – Random seed for the first replication. If None, a random seed is drawn.

  • max_restarts (int) – Maximum restarts per replication when a likelihood drop occurs (R retries indefinitely; this is a safety cap).

  • calc_se (bool) – Whether to compute standard errors (default True).

Returns:

Fitted model result object.

Return type:

LCAResult

Utilities

Utility functions for formula parsing and data preparation.

pypolca.utils.build_design_matrix(formula, data, na_rm=True)[source]

Parse a simple formula and build design matrices.

Supports:

“Y1 + Y2 + Y3 ~ 1” -> intercept only (no covariates) “Y1 + Y2 ~ X1 + X2” -> covariates “cbind(Y1, Y2, Y3) ~ 1” -> R-style cbind on LHS

Returns:

  • y (np.ndarray, shape (N, J))

  • x (np.ndarray, shape (N, S))

  • num_choices (list of int) – Number of categories for each manifest variable.

Parameters:
  • formula (str)

  • data (DataFrame)

  • na_rm (bool)

Return type:

tuple[ndarray, ndarray, list[int]]

Datasets

Built-in datasets from R’s poLCA package.

All datasets are re-exported from R’s poLCA (GPL-2.0-or-later, compatible with this package) as CSV files. Use load_dataset() with a Dataset enum member to load one as a Polars DataFrame.

Usage:

from pypolca.data import load_dataset, Dataset

df = load_dataset(Dataset.CARCINOMA)
# or by name:
df = load_dataset("carcinoma")

from pypolca import fit
result = fit("cbind(A,B,C,D,E,F,G) ~ 1", df, nclass=2)
pypolca.data._dataset.load_dataset(name)[source]

Load a built-in dataset as a Polars DataFrame.

Parameters:

name (Dataset or str) – Dataset to load, e.g. Dataset.CARCINOMA or "carcinoma".

Return type:

pl.DataFrame

Raises:

ValueError – If name is not a valid dataset.

Examples

>>> from pypolca.data import load_dataset, Dataset
>>> df = load_dataset(Dataset.CARCINOMA)
>>> df.shape
(118, 7)
pypolca.data._dataset.get_dataset_info(name)[source]

Return metadata for a dataset (description, columns, source, example).

Parameters:

name (Dataset or str) – Dataset name.

Returns:

Keys: description (str), columns (dict), source (str), example_formula (str), nclass_example (int).

Return type:

dict

class pypolca.data._dataset.Dataset(*values)[source]

Built-in datasets available for loading.

CARCINOMA

Dichotomous ratings by seven pathologists of 118 slides for the presence or absence of carcinoma in the uterine cervix. Columns: A–G (1=no, 2=yes). Source: Agresti (2002), Table 13.1.

Type:

str

CHEATING

319 undergraduate students surveyed on chronic cheating behavior. Columns: LIEEXAM, LIEPAPER, FRAUD, COPYEXAM (1=no, 2=yes), GPA (1–5).

Type:

str

ELECTION

2000 American National Election Study survey, 1,785 respondents. 12 trait ratings (MORALG–INTELB, 1–4) for Gore and Bush, plus VOTE3, AGE, EDUC, GENDER, PARTY covariates.

Type:

str

GSS82

1,202 white respondents to the 1982 General Social Survey. Columns: PURPOSE (1–3), ACCURACY (1–2), UNDERSTA (1–3), COOPERAT (1–3). Source: McCutcheon (1987), Table 3.1.

Type:

str

VALUES

216 respondents on four dichotomous items measuring universalistic vs. particularistic values. Columns: A–D (1=universalistic, 2=particularistic).

Type:

str

CARCINOMA = 'carcinoma'
CHEATING = 'cheating'
ELECTION = 'election'
GSS82 = 'gss82'
VALUES = 'values'