![PyPI](https://img.shields.io/pypi/v/forust?color=gr&style=for-the-badge) ![Crates.io](https://img.shields.io/crates/v/forust-ml?color=gr&style=for-the-badge)

Forust

A lightweight gradient boosting package

Forust, is a lightweight package for building gradient boosted decision tree ensembles. All of the algorithm code is written in Rust, with a python wrapper. The rust package can be used directly, however, most examples shown here will be for the python wrapper. For a self contained rust example, see here. It implements the same algorithm as the XGBoost package, and in many cases will give nearly identical results.

I developed this package for a few reasons, mainly to better understand the XGBoost algorithm, additionally to have a fun project to work on in rust, and because I wanted to be able to experiment with adding new features to the algorithm in a smaller simpler codebase.

All of the rust code for the package can be found in the src directory, while all of the python wrapper code is in the py-forust directory.

Installation

The package can be installed directly from pypi. shell pip install forust

To use in a rust project add the following to your Cargo.toml file. toml forust-ml = "0.2.13"

Usage

The GradientBooster class is currently the only public facing class in the package, and can be used to train gradient boosted decision tree ensembles with multiple objective functions.

It can be initialized with the following arguments.

Training and Predicting

Once, the booster has been initialized, it can be fit on a provided dataset, and performance field. After fitting, the model can be used to predict on a dataset. In the case of this example, the predictions are the log odds of a given record being 1.

```python

Small example dataset

from seaborn import load_dataset

df = loaddataset("titanic") X = df.selectdtypes("number").drop(columns=["survived"]) y = df["survived"]

Initialize a booster with defaults.

from forust import GradientBooster model = GradientBooster(objective_type="LogLoss") model.fit(X, y)

Predict on data

model.predict(X.head())

array([-1.94919663, 2.25863229, 0.32963671, 2.48732194, -3.00371813])

predict contributions

model.predict_contributions(X.head())

array([[-0.63014213, 0.33880048, -0.16520798, -0.07798772, -0.85083578,

-1.07720813],

[ 1.05406709, 0.08825999, 0.21662544, -0.12083538, 0.35209258,

-1.07720813],

```

The fit method accepts the following arguments. - X (FrameLike): Either a pandas DataFrame, or a 2 dimensional numpy array, with numeric data. - y (ArrayLike): Either a pandas Series, or a 1 dimensional numpy array. If "LogLoss" was the objective type specified, then this should only contain 1 or 0 values, where 1 is the positive class being predicted. If "SquaredLoss" is the objective type, then any continuous variable can be provided. - sample_weight (Optional[ArrayLike], optional): Instance weights to use when training the model. If None is passed, a weight of 1 will be used for every record. Defaults to None. - evaluation_data (tuple[FrameLike, ArrayLike, ArrayLike] | tuple[FrameLike, ArrayLike], optional): An optional list of tuples, where each tuple should contain a dataset, and equal length target array, and optional an equal length sample weight array. If this is provided metric values will be calculated at each iteration of training. If early_stopping_rounds is supplied, the first entry of this list will be used to determine if performance has improved over the last set of iterations, for which if no improvement is not seen in early_stopping_rounds training will be cut short.

The predict method accepts the following arguments. - X (FrameLike): Either a pandas DataFrame, or a 2 dimensional numpy array, with numeric data. - parallel (Optional[bool], optional): Optionally specify if the predict function should run in parallel on multiple threads. If None is passed, the parallel attribute of the booster will be used. Defaults to None.

The predict_contributions method will predict with the fitted booster on new data, returning the feature contribution matrix. The last column is the bias term. - X (FrameLike): Either a pandas DataFrame, or a 2 dimensional numpy array, with numeric data. - method (str, optional): Method to calculate the contributions, the options are. - "average": If this option is specified, the average internal node values are calculated, this is equivalent to the approx_contribs parameter in XGBoost. - "weight": This method will use the internal leaf weights, to calculate the contributions. This is the same as what is described by Saabas here. - "branch-difference": This method will calculate contributions by subtracting the weight of the node the record will travel down by the weight of the other non-missing branch. This method does not have the property where the contributions summed is equal to the final prediction of the model. - "midpoint-difference": This method will calculate contributions by subtracting the weight of the node the record will travel down by the mid-point between the right and left node weighted by the cover of each node. This method does not have the property where the contributions summed is equal to the final prediction of the model. - parallel (Optional[bool], optional): Optionally specify if the predict function should run in parallel on multiple threads. If None is passed, the parallel attribute of the booster will be used. Defaults to None.

When predicting with the data, the maximum iteration that will be used when predicting can be set using the set_prediction_iteration method. If early_stopping_rounds has been set, this will default to the best iteration, otherwise all of the trees will be used. It accepts a single value. - iteration (int): Iteration number to use, this will use all trees, up to and including this index.

If early stopping was used, the evaluation history can be retrieved with the get_evaluation_history method.

```python model = GradientBooster(objectivetype="LogLoss") model.fit(X, y, evaluationdata=[(X, y)])

model.getevaluationhistory()[0:3]

array([[588.9158873 ],

[532.01055803],

[496.76933646]])

```

Inspecting the Model

Once the booster has been fit, each individual tree structure can be retrieved in text form, using the text_dump method. This method returns a list, the same length as the number of trees in the model.

```python model.text_dump()[0]

0:[0 < 3] yes=1,no=2,missing=2,gain=91.50833,cover=209.388307

1:[4 < 13.7917] yes=3,no=4,missing=4,gain=28.185467,cover=94.00148

3:[1 < 18] yes=7,no=8,missing=8,gain=1.4576768,cover=22.090348

7:[1 < 17] yes=15,no=16,missing=16,gain=0.691266,cover=0.705011

15:leaf=-0.15120,cover=0.23500

16:leaf=0.154097,cover=0.470007

```

The json_dump method performs the same action, but returns the model as a json representation rather than a text string.

To see an estimate for how a given feature is used in the model, the partial_dependence method is provided. This method calculates the partial dependence values of a feature. For each unique value of the feature, this gives the estimate of the predicted value for that feature, with the effects of all features averaged out. This information gives an estimate of how a given feature impacts the model.

The partial_dependence method takes the following parameters...

This information can be plotted to visualize how a feature is used in the model, like so.

```python from seaborn import lineplot import matplotlib.pyplot as plt

pdvalues = model.partialdependence(X=X, feature="age", samples=None)

fig = lineplot(x=pdvalues[:,0], y=pdvalues[:,1],) plt.title("Partial Dependence Plot") plt.xlabel("Age") plt.ylabel("Log Odds") ```

We can see how this is impacted if a model is created, where a specific constraint is applied to the feature using the monotone_constraint parameter.

```python model = GradientBooster( objectivetype="LogLoss", monotoneconstraints={"age": -1}, ) model.fit(X, y)

pdvalues = model.partialdependence(X=X, feature="age") fig = lineplot( x=pdvalues[:, 0], y=pdvalues[:, 1], ) plt.title("Partial Dependence Plot with Monotonicity") plt.xlabel("Age") plt.ylabel("Log Odds") ```

Saving the model

To save and subsequently load a trained booster, the save_booster and load_booster methods can be used. Each accepts a path, which is used to write the model to. The model is saved and loaded as a json object.

```python trainedmodel.savebooster("model_path.json")

To load a model from a json path.

loadedmodel = GradientBooster.loadmodel("model_path.json") ```