Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enh: Post-Improving Boosting Models #1315

Open
aPovidlo opened this issue Jul 31, 2024 · 0 comments
Open

enh: Post-Improving Boosting Models #1315

aPovidlo opened this issue Jul 31, 2024 · 0 comments
Labels
architecture (re)design of existing or new framework subsystem composer Related to GP-composition algorithm enhancement New feature or request

Comments

@aPovidlo
Copy link
Collaborator

aPovidlo commented Jul 31, 2024

Summary

TLDR:

  • Updating initial assumptions with boosting models;
  • Adding new evolutionary mutations connected with boosting models;
  • Allow the use of boosting model data with nan's.
  • Allow to use boosting with GPU.

Motivation

The motivation for refactoring boosting models was (#1155 #1209 #1264):

  • Update implementation and split it into a separate class strategy.
  • Allow boosting models to use categorical features without encoding.
  • To create a basis for implementing fitting with bagging (Bagging method implementation to FEDOT #1005) like the one used in other popular AutoML frameworks.

Results of testing on OpenML are available here. During the development, subsequent ideas for improvement arose.

Guide-level explanation

1. Updating initial assumptions with boosting models

Add more pipelines that contain boosting models to the initial assumptions. Updates presets by using boosting models. I want to draw your attention to the fact that boosting models have several strategies that can be used for various pipelines and presets. You can find more information about the strategy in the boosting framework documentation.

2. Evolutionary Mutations for Boosting Models

According to parameter updates, adding new mutations for pipelines with boosting models became possible.

Boosting strategy mutation:

  • Switch to another strategy method

Using category mutation:

  • Switch to enable_categorical (default: True)

Change early_stopping_rounds:

  • Increase or decrease early_stopping_rounds to change the fitting time in the population

Improving metric of XGBoost model mutation:

  • Decrease max_depth
  • Increase min_child_weight, gamma, lambda

Improving robustness for the noise of XGBoost model mutation:

  • Decrease subsample, colsample_bytree, colsample_bylevel, colsample_bynode for some step.
  • Switch to use the dart strategy method.

Improving metric of LightGBM model mutation:

  • Decrease learning_rate
  • Increase max_bin, num_iterations, num_leaves
  • Switch to use the dart strategy method.

Improving robustness for overfitting of LightGBM model mutation:

  • Decrease max_bin and num_leaves
  • Increase or decrease min_data_in_leaf and min_sum_hessian_in_leaf
  • Use bagging_fraction and bagging_freq
  • Use feature_fraction
  • Use regularization methods: lambda_l1, lambda_l2, min_gain_to_split and extra_trees
  • Decrease max_depth

Improving robustness for overfitting of Catboost model mutation:

  • Increase or decrease l2_leaf_reg, colsample_bylevel, subsample
  • Decrease max_depth
  • Increase or decrease iterations

3. Allow the use of boosting model data with nan's.

One of the advantages of boosting methods is comparing with nan's in data. Implementing this feature in the current version requires refactoring the preprocessing with filling.

4. Allow to use boosting with GPU.

It is possible to use a GPU to accelerate fitting boosting. Therefore, it would be great to add such an opportunity.

Unresolved Questions

Is it possible to continue #1005? The main idea was to develop a method used in other AutoML frameworks. The main problem that had to be faced and not solved is that the pipeline generation approach differs from other frameworks, parallelization of the learning process of basic models on extraped samples, and embedding such an approach into the composition process since it is pretty time-consuming and resource-intensive. However, it guarantees more stable and accurate models. It would be possible to use this approach after composting; for example, if a boosting model is found in the final pipeline, then try to train it using this method.

P.S.

I also note that adding a weighted model, models from a family based on k-nearest neighbors as a meta-model in an ensemble, will help diversify pipelines for classification and regression.

Also note about not perfect method to detect categorical features.

@aPovidlo aPovidlo added enhancement New feature or request composer Related to GP-composition algorithm architecture (re)design of existing or new framework subsystem labels Jul 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
architecture (re)design of existing or new framework subsystem composer Related to GP-composition algorithm enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant