Language: Python
ML/AI
LightGBM was developed by Microsoft in 2016 as part of the Distributed Machine Learning Toolkit (DMTK). It is optimized for speed and memory usage and has become a popular choice for Kaggle competitions and large-scale machine learning tasks due to its accuracy and efficiency.
LightGBM is a fast, distributed, high-performance gradient boosting framework based on decision tree algorithms. It is designed for efficiency and scalability, supporting large datasets and GPU acceleration.
pip install lightgbmconda install -c conda-forge lightgbmLightGBM provides APIs for training gradient boosted decision tree models for classification, regression, and ranking tasks. It supports categorical features natively, early stopping, custom evaluation metrics, and efficient handling of large datasets.
import lightgbm as lgb
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = lgb.LGBMClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print('Accuracy:', accuracy_score(y_test, y_pred))Trains a LightGBM classifier on the Iris dataset and evaluates accuracy.
import lightgbm as lgb
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
X, y = load_boston(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = lgb.LGBMRegressor()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print('MSE:', mean_squared_error(y_test, y_pred))Trains a LightGBM regressor on the Boston housing dataset and computes mean squared error.
import lightgbm as lgb
import numpy as np
X = np.random.rand(100,5)
y = np.random.randint(0,2,100)
dtrain = lgb.Dataset(X, label=y)
params = {'objective': 'binary', 'metric': 'binary_logloss'}
bst = lgb.train(params, dtrain, num_boost_round=20)Uses LightGBM's Dataset class for efficient training.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
X, y = load_iris(return_X_y=True)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
model = lgb.LGBMClassifier()
model.fit(X_train, y_train, eval_set=[(X_val, y_val)], early_stopping_rounds=5)Stops training early if the validation metric does not improve for a number of rounds.
import matplotlib.pyplot as plt
lgb.plot_importance(model)
plt.show()Visualizes feature importance after training the model.
def f1_score_metric(y_true, y_pred):
from sklearn.metrics import f1_score
y_pred_labels = (y_pred > 0.5).astype(int)
return 'f1', f1_score(y_true, y_pred_labels), True
model.fit(X_train, y_train, eval_set=[(X_val, y_val)], feval=f1_score_metric)Demonstrates using a custom evaluation metric during training.
Use categorical features as category dtype for better performance.
Tune hyperparameters like `num_leaves`, `max_depth`, `learning_rate`, and `n_estimators`.
Use early stopping to avoid overfitting.
Use GPU acceleration for large datasets when possible.
Visualize feature importance to understand model decisions.