Language: Python
Data Science
Statsmodels was created by Skipper Seabold and Josef Perktold to provide a Python package for classical statistics and econometrics. It complements libraries like NumPy, SciPy, and Pandas, enabling researchers and analysts to fit statistical models, perform hypothesis testing, and explore data efficiently.
Statsmodels is a Python library that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests and statistical data exploration.
pip install statsmodelsconda install -c conda-forge statsmodelsStatsmodels supports regression models (linear, logistic, etc.), time series analysis, generalized linear models, and more. It provides detailed statistical output, making it ideal for rigorous analysis and reporting.
import statsmodels.api as sm
import numpy as np
X = np.array([1,2,3,4,5])
y = np.array([2,4,5,4,5])
X = sm.add_constant(X) # add intercept
model = sm.OLS(y, X).fit()
print(model.summary())Performs ordinary least squares (OLS) linear regression and prints a detailed statistical summary.
import statsmodels.api as sm
import numpy as np
X = np.array([[1],[2],[3],[4],[5]])
y = np.array([0,0,0,1,1])
X = sm.add_constant(X)
model = sm.Logit(y, X).fit()
print(model.summary())Fits a logistic regression model for binary outcome data and outputs statistical details.
import statsmodels.api as sm
import pandas as pd
data = pd.Series([1,2,3,4,5,6,7,8,9,10])
model = sm.tsa.ARIMA(data, order=(1,1,0))
results = model.fit()
print(results.summary())Fits an ARIMA model to a time series and outputs coefficients and diagnostics.
import statsmodels.api as sm
import numpy as np
X = np.array([1,2,3,4,5])
y = np.array([2,3,5,4,6])
X = sm.add_constant(X)
model = sm.GLM(y, X, family=sm.families.Poisson()).fit()
print(model.summary())Fits a Poisson regression model using GLM and prints the summary.
from statsmodels.stats.weightstats import ztest
import numpy as np
data1 = np.array([1,2,3,4,5])
data2 = np.array([2,3,4,5,6])
z_stat, p_val = ztest(data1, data2)
print(f'Z-statistic: {z_stat}, p-value: {p_val}')Performs a Z-test to compare two samples.
Always inspect `model.summary()` for statistical diagnostics and model fit.
Preprocess data (standardize, handle missing values) before modeling.
Choose appropriate model type based on data distribution and research question.
Use statistical tests to validate assumptions (normality, heteroscedasticity, etc.).
Combine with Pandas for data manipulation and cleaning prior to modeling.