Python API¶

High-Level API¶

High-level Python API for pypoLCA.

class pypolca.api.LCAResult(raw_results, formula=None, data=None, num_choices=None, y_mat=None)[source]¶

Bases: object

Python-friendly wrapper around the C++ Results struct.

Parameters:

raw_results (Any)
formula (str | None)
data (DataFrame | None)
num_choices (list[int] | None)
y_mat (ndarray | None)

property probs: list[ndarray]¶

Class-conditional response probabilities.

Returns: list[J] of ndarray of shape (R, K_j).

property coeff: ndarray¶

Covariate coefficients.

Returns: ndarray of shape (S, R-1). Column r-1 = coefficients for class r (r >= 2). None if no covariates.

property npar: int¶

property probs_se: list[ndarray]¶

property P_se: ndarray¶

property coeff_se: ndarray¶

Standard errors of covariate coefficients.

Returns: ndarray of shape (S, R-1) matching .coeff layout. Empty array if no covariates.

property coeff_V: ndarray¶

property loglik: float¶

property iterations: int¶

property converged: bool¶

property posterior: ndarray¶: Training-data posterior class membership probabilities.

property prior: ndarray¶

property predclass: ndarray¶

predict_posterior(newdata, newx=None)[source]¶

Compute posterior class membership probabilities for new data.

Parameters:

newdata (DataFrame)
newx (DataFrame | None)

Return type:

ndarray

property params: Any¶: Raw fitted parameters (vecprobs, beta).

property P: ndarray¶: Class population shares (marginal prior probabilities).

property aic: float¶

property bic: float¶

property Gsq: float¶: Likelihood ratio deviance (G-squared) vs saturated model.

property Chisq: float¶

Pearson chi-square goodness-of-fit.

Includes correction term (N - sum(exp)) for unobserved response patterns where O=0 and E>0, matching R poLCA behavior.

property predcell: tuple[ndarray, ndarray, ndarray]¶: Returns (observed, expected, patterns) for each unique complete response pattern.

property resid_df: int¶: Residual degrees of freedom for GOF tests (intercept-only models).

property Nobs: int¶: Number of fully observed cases (no missing in any manifest variable).

pypolca.api.fit(formula, data, nclass=2, maxiter=1000, tol=1e-10, verbose=False, na_rm=True, probs_start=None, beta_start=None, nrep=1, seed=None, max_restarts=100, calc_se=True)[source]¶

Fit a latent class model.

Parameters:

formula (str) – Patsy-style formula, e.g. “cbind(Y1, Y2, Y3) ~ 1” or “Y1 + Y2 ~ X1 + X2”. Left-hand side gives manifest variables; right-hand side gives covariates.
data (pd.DataFrame) – Data frame containing all variables.
nclass (int) – Number of latent classes.
maxiter (int) – Maximum EM iterations.
tol (float) – Log-likelihood convergence tolerance.
verbose (bool) – Print iteration progress.
na_rm (bool) – Drop rows with any missing values.
probs_start (np.ndarray, optional) – Starting values for class-conditional response probabilities.
beta_start (np.ndarray, optional) – Starting values for covariate coefficients.
nrep (int) – Number of replications with different random starting values (like R’s nrep).
seed (int, optional) – Random seed for the first replication. If None, a random seed is drawn.
max_restarts (int) – Maximum restarts per replication when a likelihood drop occurs (R retries indefinitely; this is a safety cap).
calc_se (bool) – Whether to compute standard errors (default True).

Returns:

Fitted model result object.

Return type:

LCAResult

Utilities¶

Utility functions for formula parsing and data preparation.

pypolca.utils.build_design_matrix(formula, data, na_rm=True)[source]¶

Parse a simple formula and build design matrices.

Supports:: “Y1 + Y2 + Y3 ~ 1” -> intercept only (no covariates) “Y1 + Y2 ~ X1 + X2” -> covariates “cbind(Y1, Y2, Y3) ~ 1” -> R-style cbind on LHS

Returns:

y (np.ndarray, shape (N, J))
x (np.ndarray, shape (N, S))
num_choices (list of int) – Number of categories for each manifest variable.

Parameters:

formula (str)
data (DataFrame)
na_rm (bool)

Return type:

tuple[ndarray, ndarray, list[int]]

Datasets¶

Built-in datasets from R’s poLCA package.

All datasets are re-exported from R’s poLCA (GPL-2.0-or-later, compatible with this package) as CSV files. Use load_dataset() with a Dataset enum member to load one as a Polars DataFrame.

Usage:

from pypolca.data import load_dataset, Dataset

df = load_dataset(Dataset.CARCINOMA)
# or by name:
df = load_dataset("carcinoma")

from pypolca import fit
result = fit("cbind(A,B,C,D,E,F,G) ~ 1", df, nclass=2)

pypolca.data._dataset.load_dataset(name)[source]¶

Load a built-in dataset as a Polars DataFrame.

Parameters:: name (Dataset or str) – Dataset to load, e.g. Dataset.CARCINOMA or "carcinoma".
Return type:: pl.DataFrame
Raises:: ValueError – If name is not a valid dataset.

Examples

>>> from pypolca.data import load_dataset, Dataset
>>> df = load_dataset(Dataset.CARCINOMA)
>>> df.shape
(118, 7)

pypolca.data._dataset.get_dataset_info(name)[source]¶

Return metadata for a dataset (description, columns, source, example).

Parameters:: name (Dataset or str) – Dataset name.
Returns:: Keys: description (str), columns (dict), source (str), example_formula (str), nclass_example (int).
Return type:: dict

class pypolca.data._dataset.Dataset(*values)[source]

Built-in datasets available for loading.

CARCINOMA

Dichotomous ratings by seven pathologists of 118 slides for the presence or absence of carcinoma in the uterine cervix. Columns: A–G (1=no, 2=yes). Source: Agresti (2002), Table 13.1.

Type:: str

CHEATING

319 undergraduate students surveyed on chronic cheating behavior. Columns: LIEEXAM, LIEPAPER, FRAUD, COPYEXAM (1=no, 2=yes), GPA (1–5).

Type:: str

ELECTION

2000 American National Election Study survey, 1,785 respondents. 12 trait ratings (MORALG–INTELB, 1–4) for Gore and Bush, plus VOTE3, AGE, EDUC, GENDER, PARTY covariates.

Type:: str

GSS82

1,202 white respondents to the 1982 General Social Survey. Columns: PURPOSE (1–3), ACCURACY (1–2), UNDERSTA (1–3), COOPERAT (1–3). Source: McCutcheon (1987), Table 3.1.

Type:: str

VALUES

216 respondents on four dichotomous items measuring universalistic vs. particularistic values. Columns: A–D (1=universalistic, 2=particularistic).

Type:: str

CARCINOMA = 'carcinoma'

CHEATING = 'cheating'

ELECTION = 'election'

GSS82 = 'gss82'

VALUES = 'values'