Language: Python
ML/AI
Scikit-learn was initially developed by David Cournapeau in 2007 and has since evolved into a widely used open-source machine learning library maintained by a large community. It is built on top of NumPy, SciPy, and Matplotlib, and is widely adopted in both academia and industry for machine learning tasks.
Scikit-learn is a Python library for machine learning that provides simple and efficient tools for data mining, analysis, and modeling. It includes algorithms for classification, regression, clustering, dimensionality reduction, and model evaluation.
pip install scikit-learnconda install scikit-learnScikit-learn provides consistent APIs for different machine learning algorithms. You can fit models to data, predict outcomes, evaluate performance, and preprocess datasets with transformers and pipelines.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)
print(model.score(X_test, y_test))Loads the Iris dataset, splits it into training and testing sets, trains a logistic regression model, and evaluates accuracy.
from sklearn.cluster import KMeans
import numpy as np
X = np.array([[1,2],[1,4],[1,0],[10,2],[10,4],[10,0]])
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
print(kmeans.labels_)Performs K-Means clustering on a small dataset and prints cluster labels.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
pipeline = Pipeline([('scaler', StandardScaler()), ('svc', SVC())])
pipeline.fit(X_train, y_train)
print(pipeline.score(X_test, y_test))Combines preprocessing and classification into a single pipeline for clean and reproducible ML workflows.
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5)
print(scores)Performs 5-fold cross-validation to evaluate model performance.
from sklearn.feature_selection import SelectKBest, f_classif
selector = SelectKBest(f_classif, k=2)
X_new = selector.fit_transform(X, y)
print(X_new[:5])Selects the two most important features from the dataset using ANOVA F-value.
Scale or normalize features when using distance-based algorithms.
Use pipelines to combine preprocessing and modeling steps.
Split data into training, validation, and test sets to avoid overfitting.
Leverage cross-validation for robust performance evaluation.
Understand the assumptions of each algorithm before applying it.