Scikit-learn

Language: Python

ML/AI

Scikit-learn was initially developed by David Cournapeau in 2007 and has since evolved into a widely used open-source machine learning library maintained by a large community. It is built on top of NumPy, SciPy, and Matplotlib, and is widely adopted in both academia and industry for machine learning tasks.

Scikit-learn is a Python library for machine learning that provides simple and efficient tools for data mining, analysis, and modeling. It includes algorithms for classification, regression, clustering, dimensionality reduction, and model evaluation.

Installation

pip: pip install scikit-learn
conda: conda install scikit-learn

Usage

Scikit-learn provides consistent APIs for different machine learning algorithms. You can fit models to data, predict outcomes, evaluate performance, and preprocess datasets with transformers and pipelines.

Training a simple classifier

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)
print(model.score(X_test, y_test))

Loads the Iris dataset, splits it into training and testing sets, trains a logistic regression model, and evaluates accuracy.

K-Means clustering

from sklearn.cluster import KMeans
import numpy as np
X = np.array([[1,2],[1,4],[1,0],[10,2],[10,4],[10,0]])
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
print(kmeans.labels_)

Performs K-Means clustering on a small dataset and prints cluster labels.

Pipeline with preprocessing and classifier

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

pipeline = Pipeline([('scaler', StandardScaler()), ('svc', SVC())])
pipeline.fit(X_train, y_train)
print(pipeline.score(X_test, y_test))

Combines preprocessing and classification into a single pipeline for clean and reproducible ML workflows.

Cross-validation

from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5)
print(scores)

Performs 5-fold cross-validation to evaluate model performance.

Feature selection

from sklearn.feature_selection import SelectKBest, f_classif
selector = SelectKBest(f_classif, k=2)
X_new = selector.fit_transform(X, y)
print(X_new[:5])

Selects the two most important features from the dataset using ANOVA F-value.

Error Handling

ConvergenceWarning: Increase `max_iter` or scale features when fitting models like LogisticRegression.
ValueError: Input contains NaN: Handle missing values using `SimpleImputer` or other preprocessing techniques.

Best Practices

Scale or normalize features when using distance-based algorithms.

Use pipelines to combine preprocessing and modeling steps.

Split data into training, validation, and test sets to avoid overfitting.

Leverage cross-validation for robust performance evaluation.

Understand the assumptions of each algorithm before applying it.