Forust

A lightweight gradient boosting package

PyPI

Forust, is a lightweight package for building gradient boosted decision tree ensembles. All of the algorithm code is written in Rust, with a python wrapper. The rust package can be used directly, however, most examples shown here will be for the python wrapper. It implements the same algorithm as the XGBoost package, and in many cases will give nearly identical results.

I developed this package for a few reasons, mainly to better understand the XGBoost algorithm, additionally to have a fun project to work on in rust, and because I wanted to be able to experiment with adding new features to the algorithm in a smaller simpler codebase.

All of the rust code for the package can be found in the src directory, while all of the python wrapper code is in the py-forust directory.

Installation

The package can be installed directly from pypi. shell pip install forust

Usage

The GradientBooster class is currently the only public facing class in the package, and can be used to train gradient boosted decision tree ensembles with multiple objective functions.

It can be initialized with the following arguments.

Training and Predicting

Once, the booster has been initialized, it can be fit on a provided dataset, and performance field. After fitting, the model can be used to predict on a dataset. In the case of this example, the predictions are the log odds of a given record being 1.

```python

Small example dataset

from seaborn import load_dataset

df = loaddataset("titanic") X = df.selectdtypes("number").drop(column=["survived"]) y = df["survived"]

Initialize a booster with defaults.

from forust import GradientBooster model = GradientBooster(objective_type="LogLoss") model.fit(X, y)

Predict on data

model.predict(X.head())

array([-1.94919663, 2.25863229, 0.32963671, 2.48732194, -3.00371813])

```

The fit method accepts the following arguments. - X (FrameLike): Either a pandas DataFrame, or a 2 dimensional numpy array, with numeric data. - y (ArrayLike): Either a pandas Series, or a 1 dimensional numpy array. - sample_weight (Optional[ArrayLike], optional): Instance weights to use when training the model. If None is passed, a weight of 1 will be used for every record. Defaults to None.

The predict method accepts the following arguments. - X (FrameLike): Either a pandas DataFrame, or a 2 dimensional numpy array, with numeric data.

Inspecting the Model

Once the booster has been fit, each individual tree structure can be retrieved in text form, using the text_dump method. This method returns a list, the same length as the number of trees in the model.

```python model.text_dump()[0]

0:[0 < 3] yes=1,no=2,missing=2,gain=91.50833,cover=209.388307

1:[4 < 13.7917] yes=3,no=4,missing=4,gain=28.185467,cover=94.00148

3:[1 < 18] yes=7,no=8,missing=8,gain=1.4576768,cover=22.090348

7:[1 < 17] yes=15,no=16,missing=16,gain=0.691266,cover=0.705011

15:leaf=-0.15120,cover=0.23500

16:leaf=0.154097,cover=0.470007

```

The json_dump method performs the same action, but returns the model as a json representation rather than a text string.

Saving the model

To save and subsequently load a trained booster, the save_booster and load_booster methods can be used. Each accepts a path, which is used to write the model to. The model is saved and loaded as a json object.

```python trainedmodel.savebooster("model_path.json")

To load a model from a json path.

loadedmodel = GradientBooster.loadmodel("model_path.json") ```

TODOs

This is still a work in progress - [ ] Early stopping rounds * We should be able to accept a validation dataset, and this should be able to be used to determine when to stop training. - [ ] Monotonicity support * Right now features are used in the model without any constraints. - [x] Ability to save a model. * The way the underlying trees are structured, they would lend themselves to being saved as JSon objects. - [ ] Clean up the CICD pipeline.