XGBoost Algorithm in Machine Learning with example- Working Code

Table of Contents

XGBoost Algorithm in Machine Learning with example- Working Code

Financial Distress Prediction with XGBoost Algorithm, SHAP, and LIME: A Python Guide

Predicting financial distress is a critical task for businesses, investors, and analysts. Leveraging machine learning, especially advanced algorithms like XGBoost, can significantly enhance the accuracy and interpretability of such predictions. In this article, we explore a robust Python pipeline for financial distress prediction, complete with model training, evaluation, and explainability using SHAP and LIME.

Overview of the Pipeline

The provided Python script is designed to train and evaluate models that predict financial distress events using tabular data. It supports both logistic regression and XGBoost classifiers, ensuring a strong baseline and a powerful gradient boosting model. The script also integrates SHAP (SHapley Additive exPlanations) and LIME (Local Interpretable Model-agnostic Explanations) for model interpretability, making it ideal for regulated industries where explainability is crucial.

Key Features of Xgboost Algorithm Pipeline implementation

Flexible Data Input:** The script accepts a CSV file containing entity-period rows, with customizable columns for entity ID, time, and target variable.
Temporal Train-Test Split:** It uses a temporal split to ensure that the model is evaluated on future data, mimicking real-world forecasting scenarios.
Class Imbalance Handling:** The XGBoost model automatically adjusts for class imbalance using the `scale_pos_weight` parameter.
Comprehensive Evaluation:** The script generates ROC, Precision-Recall, and calibration curves, along with confusion matrices and classification reports.
Model Explainability:** SHAP and LIME are used to provide both global and local explanations for model predictions.

How the Code Works

1. Argument Parsing: The script uses `argparse` to accept command-line arguments for data path, entity column, time column, target column, and test size.
2. Data Loading and Preprocessing:** Data is loaded and preprocessed using custom utility functions. Numeric features are selected, and preprocessing pipelines are built for consistent data transformation.
3. Model Training:** Both logistic regression and XGBoost models are trained. XGBoost is configured with early stopping and hyperparameters optimized for tabular data.
4. Evaluation:** The models are evaluated on the test set, and various performance metrics and plots are saved to the `artifacts` directory.
5. Explainability:** SHAP is used to generate global feature importance and dependence plots, while LIME provides local explanations for individual predictions.
6. Artifact Saving:** All models, preprocessors, metrics, and explanation plots are saved for future reference or deployment.

Why Use XGBoost for Financial Distress Prediction?

XGBoost is renowned for its performance on structured data and its ability to handle missing values, outliers, and class imbalance. Its tree-based approach captures complex nonlinear relationships, making it ideal for financial datasets where such patterns are common.

Enhancing Trust with SHAP and LIME

Model explainability is essential in finance. SHAP provides a global view of feature importance and how each feature impacts predictions, while LIME offers case-by-case explanations. This dual approach ensures that stakeholders can trust and understand the model’s decisions.

To use this script, simply run:

python train.py –data your_data.csv –entity_col company_id –time_col date –target distress

Ensure you have the required Python packages installed: `xgboost`, `scikit-learn`, `shap`, `lime`, `matplotlib`, and `joblib`.

This Python pipeline offers a comprehensive solution for financial distress prediction, combining the predictive power of XGBoost with the transparency of SHAP and LIME. By following this approach, organizations can make data-driven decisions with confidence and clarity.

Code


import os, argparse, warnings, json
warnings.filterwarnings("ignore", category=UserWarning)
import numpy as np, pandas as pd, joblib, matplotlib.pyplot as plt
from sklearn.metrics import roc_auc_score, average_precision_score, brier_score_loss, roc_curve, precision_recall_curve, confusion_matrix, classification_report
from sklearn.calibration import calibration_curve
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
import shap
from lime.lime_tabular import LimeTabularExplainer


from utils import build_preprocessor, temporal_split, save_json, load_data


def pick_feature_columns(df, entity_col, time_col, target):
    ignore = {entity_col, time_col, target}
    # numeric columns only
    feats = [c for c in df.columns if c not in ignore and pd.api.types.is_numeric_dtype(df[c])]
    return feats


def evaluate_and_plot(y_true, y_prob, outdir, tag="test"):
    os.makedirs(outdir, exist_ok=True)
    # Metrics
    roc = roc_auc_score(y_true, y_prob)
    pr  = average_precision_score(y_true, y_prob)
    brier = brier_score_loss(y_true, y_prob)


    # ROC curve
    fpr, tpr, _ = roc_curve(y_true, y_prob)
    plt.figure()
    plt.plot(fpr, tpr, label=f"ROC AUC={roc:.3f}")
    plt.plot([0,1], [0,1], linestyle="--")
    plt.xlabel("False Positive Rate"); plt.ylabel("True Positive Rate"); plt.title(f"ROC Curve ({tag})")
    plt.legend(loc="lower right")
    plt.tight_layout(); plt.savefig(os.path.join(outdir, f"roc_curve_{tag}.png")); plt.close()


    # PR curve
    prec, rec, _ = precision_recall_curve(y_true, y_prob)
    plt.figure()
    plt.plot(rec, prec, label=f"AP={pr:.3f}")
    plt.xlabel("Recall"); plt.ylabel("Precision"); plt.title(f"Precision‑Recall Curve ({tag})")
    plt.legend(loc="lower left")
    plt.tight_layout(); plt.savefig(os.path.join(outdir, f"pr_curve_{tag}.png")); plt.close()


    # Calibration curve
    prob_true, prob_pred = calibration_curve(y_true, y_prob, n_bins=10, strategy="quantile")
    plt.figure()
    plt.plot(prob_pred, prob_true, marker="o")
    plt.plot([0,1],[0,1], linestyle="--")
    plt.xlabel("Mean predicted probability"); plt.ylabel("Fraction of positives")
    plt.title(f"Calibration plot ({tag})")
    plt.tight_layout(); plt.savefig(os.path.join(outdir, f"calibration_curve_{tag}.png")); plt.close()


    # Confusion matrix at threshold chosen by top decile
    thresh = np.quantile(y_prob, 0.9)
    y_pred = (y_prob >= thresh).astype(int)
    cm = confusion_matrix(y_true, y_pred)
    plt.figure()
    plt.imshow(cm, interpolation="nearest")
    plt.title(f"Confusion Matrix ({tag}) @TopDecile")
    plt.colorbar()
    tick_marks = np.arange(2)
    plt.xticks(tick_marks, ["No Distress","Distress"])
    plt.yticks(tick_marks, ["No Distress","Distress"])
    for i in range(2):
        for j in range(2):
            plt.text(j, i, cm[i, j], ha="center", va="center")
    plt.ylabel("True label"); plt.xlabel("Predicted label")
    plt.tight_layout(); plt.savefig(os.path.join(outdir, f"confusion_matrix_{tag}.png")); plt.close()


    # Classification report @0.5
    y_pred_05 = (y_prob >= 0.5).astype(int)
    report = classification_report(y_true, y_pred_05, output_dict=True)
    return {"roc_auc": roc, "average_precision": pr, "brier": brier, "threshold_top_decile": float(thresh), "report_at_0.5": report}


def main(args):
    artifacts_dir = os.path.join("artifacts")
    os.makedirs(artifacts_dir, exist_ok=True)


    df = load_data(args.data, args.entity_col, args.time_col, args.target)


    # Feature columns
    feature_cols = pick_feature_columns(df, args.entity_col, args.time_col, args.target)


    # Temporal split (holdout is the latest chunk of time)
    train_df, test_df, cutoff = temporal_split(df, args.time_col, test_size=args.test_size)
    print(f"[INFO] Temporal cutoff = {cutoff}  | Train rows={len(train_df)}  Test rows={len(test_df)}")


    X_train_raw = train_df[feature_cols].copy()
    y_train = train_df[args.target].astype(int).values


    X_test_raw = test_df[feature_cols].copy()
    y_test = test_df[args.target].astype(int).values


    # Preprocessor
    preprocessor = build_preprocessor(feature_cols)


    # Fit preprocessor on TRAIN only
    X_train = preprocessor.fit_transform(X_train_raw)
    X_test  = preprocessor.transform(X_test_raw)


    # Class imbalance (for XGB, use scale_pos_weight)
    pos = y_train.sum()
    neg = len(y_train) - pos
    spw = max((neg / max(pos, 1)), 1.0)


    # Models: Logistic Regression baseline & XGBoost (primary)
    logreg = LogisticRegression(max_iter=2000, class_weight="balanced", n_jobs=None)
    logreg.fit(X_train, y_train)
    y_test_prob_lr = logreg.predict_proba(X_test)[:,1]


    xgb = XGBClassifier(
        n_estimators=500,
        max_depth=4,
        learning_rate=0.05,
        subsample=0.8,
        colsample_bytree=0.8,
        reg_lambda=2.0,
        min_child_weight=1.0,
        objective="binary:logistic",
        eval_metric="aucpr",
        tree_method="hist",
        scale_pos_weight=spw,
        random_state=42
    )
    # Keep a validation split from train for early stopping
    X_tr, X_val, y_tr, y_val = train_test_split(X_train, y_train, test_size=0.2, stratify=y_train, random_state=42)
#    xgb.fit(X_tr, y_tr, eval_set=[(X_val, y_val)], verbose=False, early_stopping_rounds=50)
    xgb.fit(X_tr, y_tr, eval_set=[(X_val, y_val)],  verbose=False)


    y_test_prob_xgb = xgb.predict_proba(X_test)[:,1]


    # Evaluate
    metrics_lr  = evaluate_and_plot(y_test, y_test_prob_lr, artifacts_dir, tag="test_lr")
    metrics_xgb = evaluate_and_plot(y_test, y_test_prob_xgb, artifacts_dir, tag="test_xgb")


    # Save artifacts
    joblib.dump(preprocessor, os.path.join(artifacts_dir, "preprocessor.joblib"))
    joblib.dump(logreg, os.path.join(artifacts_dir, "model_logreg.pkl"))
    joblib.dump(xgb, os.path.join(artifacts_dir, "model.pkl"))
    xgb.save_model(os.path.join(artifacts_dir, "model_xgb.json"))


    with open(os.path.join(artifacts_dir, "metrics.json"), "w") as f:
        json.dump({"logreg": metrics_lr, "xgb": metrics_xgb, "cutoff": str(cutoff), "features": feature_cols}, f, indent=2)


    # -------------------------
    # Global SHAP (TreeExplainer on XGB)
    # -------------------------
    print("[INFO] Computing SHAP values on a sample...")
    X_bg = X_train[:200]  # background for SHAP plots
    explainer = shap.TreeExplainer(xgb)
    # sample to speed up
    sample_idx = np.random.RandomState(42).choice(len(X_test), size=min(200, len(X_test)), replace=False)
    X_sample = X_test[sample_idx]
    shap_values = explainer.shap_values(X_sample)


    # Plot summary
    shap.summary_plot(shap_values, X_sample, feature_names=feature_cols, show=False)
    plt.tight_layout()
    plt.savefig(os.path.join(artifacts_dir, "shap_summary.png"), dpi=144, bbox_inches="tight")
    plt.close()


    # Dependence for the most important feature
    # Compute mean |shap|
    mean_abs = np.mean(np.abs(shap_values), axis=0)
    top_idx = int(np.argmax(mean_abs))
    top_feat = feature_cols[top_idx]
    shap.dependence_plot(top_idx, shap_values, X_sample, feature_names=feature_cols, show=False)
    plt.tight_layout()
    plt.savefig(os.path.join(artifacts_dir, f"shap_dependence_{top_feat}.png"), dpi=144, bbox_inches="tight")
    plt.close()


    # -------------------------
    # LIME local explanations (few examples)
    # -------------------------
    print("[INFO] Generating LIME explanations for a few test rows...")
    lime_explainer = LimeTabularExplainer(
        training_data=X_train,
        feature_names=feature_cols,
        class_names=["NoDistress","Distress"],
        mode="classification",
        discretize_continuous=True,
        sample_around_instance=True,
        random_state=42
    )
    for i, idx in enumerate(sample_idx[:3]):
        exp = lime_explainer.explain_instance(
            X_sample[i],
            xgb.predict_proba,
            num_features=min(10, len(feature_cols))
        )
        out_html = os.path.join(artifacts_dir, f"lime_local_explanation_{i}.html")
        exp.save_to_file(out_html)


    print("[DONE] Artifacts saved in ./artifacts")


if __name__ == "__main__":
    p = argparse.ArgumentParser()
    p.add_argument("--data", type=str, required=True, help="Path to CSV with entity-period rows")
    p.add_argument("--entity_col", type=str, default="company_id")
    p.add_argument("--time_col", type=str, default="date")
    p.add_argument("--target", type=str, default="distress")
    p.add_argument("--test_size", type=float, default=0.2)
    args = p.parse_args()
    main(args)

XGBoost Algorithm in Machine Learning with example- Working Code