Language: Python
ML/AI
XGBoost was developed by Tianqi Chen in 2014 to provide a fast and efficient implementation of gradient boosting algorithms. It gained popularity for winning many Kaggle competitions due to its speed, accuracy, and robustness, supporting regularization to prevent overfitting.
XGBoost (Extreme Gradient Boosting) is a high-performance, scalable, and flexible library for gradient boosting. It is widely used for supervised learning tasks such as regression, classification, and ranking.
pip install xgboostconda install -c conda-forge xgboostXGBoost provides APIs to train gradient boosted decision trees. It supports sparse data, parallel processing, and GPU acceleration. Models can be trained using the `XGBClassifier`, `XGBRegressor`, or `DMatrix` interfaces.
import xgboost as xgb
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = xgb.XGBClassifier(use_label_encoder=False, eval_metric='mlogloss')
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print('Accuracy:', accuracy_score(y_test, y_pred))Trains an XGBoost classifier on the Iris dataset and evaluates accuracy.
import xgboost as xgb
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
X, y = load_boston(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = xgb.XGBRegressor()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print('MSE:', mean_squared_error(y_test, y_pred))Trains an XGBoost regressor on the Boston housing dataset and computes the mean squared error.
import xgboost as xgb
import numpy as np
X = np.random.rand(100,5)
y = np.random.randint(0,2,100)
dtrain = xgb.DMatrix(X, label=y)
params = {'max_depth':3, 'eta':0.1, 'objective':'binary:logistic'}
bst = xgb.train(params, dtrain, num_boost_round=10)Uses XGBoost’s DMatrix for efficient memory usage and training performance.
from sklearn.model_selection import GridSearchCV
params = {'max_depth':[3,5], 'n_estimators':[50,100]}
grid = GridSearchCV(estimator=xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss'), param_grid=params, cv=3)
grid.fit(X_train, y_train)
print(grid.best_params_)Performs grid search to find the best hyperparameters for the XGBoost classifier.
import matplotlib.pyplot as plt
xgb.plot_importance(model)
plt.show()Visualizes feature importance after training a model.
model = xgb.XGBClassifier(use_label_encoder=False, eval_metric='mlogloss')
model.fit(X_train, y_train, eval_set=[(X_test, y_test)], early_stopping_rounds=5)Stops training early if the validation metric does not improve for a number of rounds.
Use `DMatrix` for large datasets to improve training efficiency.
Tune hyperparameters such as `max_depth`, `learning_rate`, and `n_estimators` for optimal performance.
Use early stopping to prevent overfitting.
Leverage GPU acceleration if available for large datasets.
Visualize feature importance to understand model behavior.