Statsmodels

Language: Python

Data Science

Statsmodels was created by Skipper Seabold and Josef Perktold to provide a Python package for classical statistics and econometrics. It complements libraries like NumPy, SciPy, and Pandas, enabling researchers and analysts to fit statistical models, perform hypothesis testing, and explore data efficiently.

Statsmodels is a Python library that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests and statistical data exploration.

Installation

pip: pip install statsmodels
conda: conda install -c conda-forge statsmodels

Usage

Statsmodels supports regression models (linear, logistic, etc.), time series analysis, generalized linear models, and more. It provides detailed statistical output, making it ideal for rigorous analysis and reporting.

Linear Regression

import statsmodels.api as sm
import numpy as np
X = np.array([1,2,3,4,5])
y = np.array([2,4,5,4,5])
X = sm.add_constant(X)  # add intercept
model = sm.OLS(y, X).fit()
print(model.summary())

Performs ordinary least squares (OLS) linear regression and prints a detailed statistical summary.

Logistic Regression

import statsmodels.api as sm
import numpy as np
X = np.array([[1],[2],[3],[4],[5]])
y = np.array([0,0,0,1,1])
X = sm.add_constant(X)
model = sm.Logit(y, X).fit()
print(model.summary())

Fits a logistic regression model for binary outcome data and outputs statistical details.

Time Series ARIMA Model

import statsmodels.api as sm
import pandas as pd
data = pd.Series([1,2,3,4,5,6,7,8,9,10])
model = sm.tsa.ARIMA(data, order=(1,1,0))
results = model.fit()
print(results.summary())

Fits an ARIMA model to a time series and outputs coefficients and diagnostics.

Generalized Linear Models (GLM)

import statsmodels.api as sm
import numpy as np
X = np.array([1,2,3,4,5])
y = np.array([2,3,5,4,6])
X = sm.add_constant(X)
model = sm.GLM(y, X, family=sm.families.Poisson()).fit()
print(model.summary())

Fits a Poisson regression model using GLM and prints the summary.

Hypothesis Testing

from statsmodels.stats.weightstats import ztest
import numpy as np
data1 = np.array([1,2,3,4,5])
data2 = np.array([2,3,4,5,6])
z_stat, p_val = ztest(data1, data2)
print(f'Z-statistic: {z_stat}, p-value: {p_val}')

Performs a Z-test to compare two samples.

Error Handling

LinAlgError: Singular matrix: Check for multicollinearity or duplicate columns in the predictor matrix.
ValueError: endog and exog matrices are not aligned: Ensure that response (`y`) and predictor (`X`) arrays have compatible dimensions.
PerfectSeparationError: Occurs in logistic regression if one predictor perfectly predicts the outcome; consider removing or regularizing variables.

Best Practices

Always inspect `model.summary()` for statistical diagnostics and model fit.

Preprocess data (standardize, handle missing values) before modeling.

Choose appropriate model type based on data distribution and research question.

Use statistical tests to validate assumptions (normality, heteroscedasticity, etc.).

Combine with Pandas for data manipulation and cleaning prior to modeling.