AutoML
The AutoMLx python package automatically creates, optimizes and explains machine learning pipelines and models. The AutoML pipeline provides a tuned ML pipeline that finds the best model for a given training dataset and a prediction task at hand. AutoML has a simple pipeline-level Python API that quickly jump-starts the datascience process with an accurate tuned model. AutoML has support for any of the following tasks:
Supervised classification or regression prediction with tabular dataset where the target can be a simple binary or a multi-class value or a real valued column in a table, respectively.
Supervised classification for Image and Text datasets.
Unsupervised anomaly detection, where the target or the labels are not provided.
Univariate and multivariate timeseries forecasting task.
The AutoML pipeline consists of five major stages of the ML pipeline: preprocessing , algorithm selection , adaptive sampling , feature selection , and model tuning
These pieces are readily combined into a simple AutoML pipeline which automatically optimizes the whole pipeline with limited user input/interaction.
Pipeline
- Pipeline ( task = 'classification' , dataset_format = 'pandas' , score_metric = None , random_state = 7 , n_algos_tuned = 1 , model_list = None , preprocessing = True , search_space = None , max_tuning_trials = None , search_strategy = 'HyperGD' , ** kwargs )
-
Create AutoMLPipeline based on task and dataset type
- Parameters :
-
-
task ( str , default='classification' ) – Machine learning task, supported: classification, regression, anomaly_detection, forecasting
-
dataset_format ( str , default='pandas' ) – Determine the type of input/output dataset. Defaults to pandas
-
score_metric ( str , callable , tuple , list or None , default=None ) –
One or more score functions (or loss functions) to be computed for each candidate model and hyperparameter configuration.
-
If
None
: it will be determined automatically depending on the task. Default score metrics : classification: binary: neg_log_loss, multiclass: neg_log_loss, regression: neg_mean_squared_error, forecasting: neg_sym_mean_abs_percent_error, anomaly_detection: unsupervised_unify95 -
If a list: should be a list of str, callable or tuple. The first score metric in the list will be the one for which the pipeline optimizes.
-
If a callable: score function (or loss function) with signature
score_func(model, X, y)
. -
If a tuple: should be a tuple with two values with types (str, callable). The string corresponds to the name of the scoring metric, and the callable should have the same signature as above.
-
If a string: automatically infers the scoring metric from the string: nntt**unsupervised** – unsupervised_unify95, unsupervised_unify95_log_loss
continuous_forecast – neg_sym_mean_abs_percent_error, neg_root_mean_squared_percent_error, neg_mean_abs_scaled_error, neg_root_mean_squared_error, neg_mean_squared_error, neg_max_absolute_error, neg_mean_absolute_error, neg_max_abs_error, neg_mean_abs_error
binary – neg_log_loss, roc_auc, accuracy, f1, precision, recall, f1_micro, f1_macro, f1_weighted, f1_samples, recall_micro, recall_macro, recall_weighted, recall_samples, precision_micro, precision_macro, precision_weighted, precision_samples
multiclass – neg_log_loss, accuracy, f1_micro, f1_macro, f1_weighted, f1_samples, recall_macro, recall_micro, recall_weighted, recall_samples, precision_micro, precision_macro, precision_weighted, precision_samples
- More information on scoring metrics can be found here :
-
Classification metrics , Note: Scoring variations like
recall_macro
are equivalent tosklearn.metrics.recall_score(...,average="macro")
continuous – neg_mean_squared_error, r2, neg_mean_absolute_error, neg_mean_squared_log_error, neg_median_absolute_error
- More information on scoring metrics can be found here :
-
-
random_state ( int , default=7 ) – Random seed used by AutoML.
-
n_algos_tuned ( int , default=1 ) –
Number of algorithms that are optimized by the AutoML pipeline. Higher values might make AutoML more accurate at the expense of runtime.
-
To disable algorithm selection set
n_algos_tuned = len(model_list)
. -
-
model_list ( List [ Model | str | Any ] or None , default=None ) –
Models that will be evaluated by the Pipeline. Users can specify built-in models by name or pass in any custom model. Custom models must have their hyper-parameter configuration spaces defined in search_space. Custom models for regression and classification must implement the scikit-learn-style fit and predict methods. Classification models also must support predict_proba. Anomaly detection models must follow the pyod interface. (by default, all supported built-in models for a given task are used) Supported built-in models per task:
classification – CatBoostClassifier, DecisionTreeClassifier, ExtraTreesClassifier, GaussianNB, KNeighborsClassifier, LGBMClassifier, LogisticRegression, RandomForestClassifier, SVC, TorchMLPClassifier, XGBClassifier
regression – AdaBoostRegressor, DecisionTreeRegressor, ExtraTreesRegressor, KNeighborsRegressor, LGBMRegressor, LinearRegression, LinearSVR, RandomForestRegressor, SVR, TorchMLPRegressor, XGBRegressor
anomaly_detection – ClusteringLocalFactorOD, HistogramOD, IsolationForestOD, KNearestNeighborsOD, MinCovOD, OneClassSVMOD, PrincipalCompOD, AutoEncoder
forecasting – NaiveForecaster, ThetaForecaster, ExpSmoothForecaster, ETSForecaster, STLwESForecaster, STLwARIMAForecaster, SARIMAXForecaster, VARMAXForecaster, DynFactorForecaster
-
preprocessing ( bool , default=True ) –
Determines whether or not AutoML pre-processes the dataset. This setting should only be used by advanced users.
-
If True, auto-preprocessor runs on dataset to normalize data.
Categorical features are label encoded and numeric features are normalized to mean of 0 and variance of 1 using
sklearn.preprocessing.StandardScaler
. Features with more than 20 percent missing values are ignored. The remaining missing values are imputed by mean for numeric features and mode for categorical features. - If False, user must cleanse (and normalize if desired) dataset before passing data to AutoML. The use of NaNs in the dataset is not allowed and will produce aValueError
. AutoML will leave it to the underlying algorithm implementations to handle strings (it is recommended to encode strings). -
-
search_space ( dict or None , default=None ) –
This parameter defines the Model Tuning search space. This parameter is required for custom models. Dictionary keys are algorithm names (str) with search space as the key value. Key values must have two parameters: (1) ‘range’ which is a list containing the range and (2) ‘type’ which is one of ‘continuous’, ‘discrete’, ‘categorical’. For example, if the user wishes to provide a custom tune search space for LogisticRegression:
search_space = { 'LogisticRegression' : { 'C': { 'range': [0.03125, 512], 'type': 'continuous' }, 'solver': { 'range': ['newton-cg', 'lbfgs', 'liblinear', 'sag'], 'type': 'categorical' }, 'class_weight': { 'range': [None, 'balanced'], 'type': 'categorical' } } }
-
To disable Model Tune for all models set
search_space = {}
- If a key value is an empty dictionary, then Model Tune is disabled for that key. - IfNone
, default search space defined inside AutoML is used. -
-
max_tuning_trials ( int , dict or None , default=None ) – The maximum number of HPO trials, may be exceeded slightly. - If
None
: AutoML automatically determines when enough HPO trials have been completed. - If an integer: the maximum number of trials for each algorithm. That is, ifn_algos_tuned == 2
, then up to2 * max_tuning_trials
are performed in total. - If adict
: by passing a dictionary you can specify this parameter per algorithm. e.g.,{'LogisticRegression': 100, 'RandomForestClassifier': 200}
. Missing values in the dictionary default to None. -
search_strategy ( str , default='HyperGD' ) – The search strategy used in Model Tune. Valid search_strategy values: HyperGD, BruteForceSampler, CmaEsSampler, GridSampler, IntersectionSearchSpace, MOTPESampler, NSGAIISampler, NSGAIIISampler, PartialFixedSampler, QMCSampler, RandomSampler, TPESampler, intersection_search_space, nsgaii
-
kwargs ( Any ) –
Optional arguments. You can find a list of arguments related to each task in their config method: - :py:meth:automlx.express.classifier.AutoClassifier.configure
for ‘classification’
-
:py:meth:automlx.express.regressor.AutoRegressor.configure for ‘regression’
-
:py:meth:automlx.express.anomaly_detector.AutoAnomalyDetector.configure for ‘anomaly_detection’
-
:py:meth:automlx.express.forecaster.AutoForecaster.configure for ‘forecasting’
-
-
- Raises :
-
AutoMLxValueError – If the given task is not supported or the provided dataset format is not supported.
- Returns :
-
An AutoMLPipeline for the given task: - :py:class:automlx.express.classifier.AutoClassifier
for ‘classification’
-
:py:class:automlx.express.regressor.AutoRegressor for ‘regression’
-
:py:class:automlx.express.anomaly_detector.AutoAnomalyDetector for ‘anomaly_detection’
-
:py:class:automlx.express.forecaster.Forecaster for ‘forecasting’
-
- Return type :
-
AutoMLPipeline
AutoClassifier
- class AutoClassifier
-
Classifier AutoMLPipeline
- classes_
-
Holds the label for each class (for
task=classification
only, otherwise it is set toNone
).- Type :
-
List[Any]
- selected_features_names_
-
Names of the engineered features selected by the AutoML pipeline.
- Type :
-
List[ str ]
- selected_features_names_raw_
-
Names of original feature names selected by the AutoML pipeline. If preprocessing is disabled, then this corresponds to
selected_features_names_
; otherwise, a raw feature is considered selected if at least one of the features engineered from it is selected.- Type :
-
List[ str ]
- ranked_models_
-
List of model names ranked in order of their quality from the last
fit
call.- Type :
-
List[ str ]
- selected_model_params_
-
Dictionary containing the optimal hyperparameters for the selected model. Keys are hyperparameter names with their corresponding values.
- Type :
- selected_rows_
-
List of indices in the original train dataset provided to AutoML corresponding to the rows sampled during Adaptive Sampling. In the case of CV, this attribute will result in a list of lists corresponding to indices selected in each fold. For example, in the case of no CV, this attribute looks like:
[0, 1, 5]
, indicating indices 0, 1, and 5 have been selected during adaptive sampling. In the case of CV=3, this attribute looks like:[ [0, 1], [0, 5], [1, 5] ]
, indicating indices 0,1 were selected from the first fold, 0,5 were selected in the 2nd fold, and 1,5 were selected in the 3rd fold.- Type :
- selected_valid_rows_
-
List of indices in the original validation dataset (if
CV==None
) provided to AutoML corresponding to the rows sampled during Adaptive Sampling. If CV is not None, the returned value is always None given that Adaptive Sampling does not sample the validation set when CV is enabled.- Type :
- pipelines_
-
Sorted list of pipelines (length equal to
n_algos_tuned
), with 0th element being the best model.- Type :
- completed_trials_summary_
-
All trials performed by the AutoML Pipeline. The columns are: Algorithm, # Samples, # Features, Mean Validation Score, Hyperparameters, Runtime, Memory Usage, Step, where the hyperparameters are a dict.
- Type :
- completed_trials_detailed_
-
A DataFrame indicating all trials performed by the AutoML Pipeline with all score metrics’ values. The columns are: Algorithm, # Samples, # Features, Hyperparameters, Features, Step, Scoring Metric, CV Fold ID, Score, Runtime, Memory Usage, # Epochs.
- Type :
- n_jobs_
-
Parallelism internally used by AutoML. Calculated as
inter_model_parallelism*intra_model_parallelism
.- Type :
- feature_importances_
-
Importance of each feature in the dataset for the selected model
- Type :
-
numpy.ndarray of shape (n_features,)
- threshold_tuning_score_
-
The validation score of the pipelines after applying threshold tuning. The scoring metric used to select this threshold can be found in threshold_tuning_scorer_ . It is None when the task is not classification or threshold_tuning is False.
- threshold_tuning_scorer_
-
The scoring metric used to select threshold during threshold tuning. It is None when the task is not classification or threshold_tuning is False.
- Type :
-
Metric
- configure ( self , score_metric = None , random_state = None , n_algos_tuned = None , model_list = None , adaptive_sampling = None , min_features = None , optimization = None , preprocessing = None , search_space = None , min_class_instances = None , max_tuning_trials = None , search_strategy = None , threshold_tuning = None )
-
Configure the AutoClassifier
If an argument is set to None, then its value is not changed and the default value is used.
- Parameters :
-
-
score_metric ( str , callable , tuple , list or None , default=None ) –
One or more score functions (or loss functions) to be computed for each candidate model and hyperparameter configuration.
-
If
None
: it will be determined automatically depending
on the task. Default score metrics : binary: neg_log_loss, multiclass: neg_log_loss - If a list: should be a list of str, callable or tuple. The first score metric in the list will be the one for which the pipeline optimizes. - If a callable: score function (or loss function) with signature
score_func(model, X, y)
. - If a tuple: should be a tuple with two values with types (str, callable). The string corresponds to the name of the scoring metric, and the callable should have the same signature as above. - If a string: automatically infers the scoring metric from the string: nnttbinary – neg_log_loss, roc_auc, accuracy, f1, precision, recall, f1_micro, f1_macro, f1_weighted, f1_samples, recall_micro, recall_macro, recall_weighted, recall_samples, precision_micro, precision_macro, precision_weighted, precision_samples
multiclass – neg_log_loss, accuracy, f1_micro, f1_macro, f1_weighted, f1_samples, recall_macro, recall_micro, recall_weighted, recall_samples, precision_micro, precision_macro, precision_weighted, precision_samples
- More information on scoring metrics can be found here :
-
Classification metrics , Note: Scoring variations like
recall_macro
are equivalent tosklearn.metrics.recall_score(...,average="macro")
-
-
random_state ( int or None , default=None ) – Random seed used by AutoML. Default value (if not previously set):
7
-
n_algos_tuned ( int or None , default=None ) –
Number of algorithms that are optimized by the AutoML pipeline. Higher values might make AutoML more accurate at the expense of runtime.
-
To disable algorithm selection set
n_algos_tuned = len(model_list)
.Default value (if not previously set):
1
-
-
model_list ( List [ str | Any ] or None , default=None ) –
Models that will be evaluated by the Pipeline. Users can specify built-in models by name or pass in any custom model. Custom models must have their hyper-parameter configuration spaces defined in search_space. Custom models for classification must implement the scikit-learn-style fit, predict, and predict_proba methods. (by default, all supported built-in models for a given task are used) Supported built-in models per task:
classification – CatBoostClassifier, DecisionTreeClassifier, ExtraTreesClassifier, GaussianNB, KNeighborsClassifier, LGBMClassifier, LogisticRegression, RandomForestClassifier, SVC, TorchMLPClassifier, XGBClassifier
-
adaptive_sampling ( bool or None , default=None ) – Set to False to disable class balancing and adaptive sampling done in AutoML. Disabling this might significantly increase runtime. Default value (if not previously set):
True
-
min_features ( int , float , list or None , default=None ) –
Minimum number of features to keep. Acceptable values:
-
If int, 0 < min_features <= n_features
-
If float, 0 < min_features <= 1.0
-
If list, names of features to keep, for example
['a', 'b']
means keep features ‘a’ and ‘b’ - To disable feature selection setmin_features = 1.0
Default value (if not previously set):
1
-
-
optimization ( int or None , default=None ) –
Sets AutoML speed, accuracy, and reproducibility level. To be used by advanced users only.
-
Level 0: Optimized for reproducibility
(controls most randomness) - Level 3: Optimized for speed and accuracy - Level 10: Optimized for speed Default value (if not previously set):
3
-
-
preprocessing ( bool or None , default=None ) –
Determines whether or not AutoML pre-processes the dataset. This setting should only be used by advanced users.
-
If True, auto-preprocessor runs on dataset to normalize data.
Categorical features are label encoded and numeric features are normalized to mean of 0 and variance of 1 using
sklearn.preprocessing.StandardScaler
. Features with more than 20 percent missing values are ignored. The remaining missing values are imputed by mean for numeric features and mode for categorical features. - If False, user must cleanse (and normalize if desired) dataset before passing data to AutoML. The use of NaNs in the dataset is not allowed and will produce aValueError
. AutoML will leave it to the underlying algorithm implementations to handle strings (it is recommended to encode strings). Default value (if not previously set):True
-
-
search_space ( dict or None , default=None ) –
This parameter defines the Model Tuning search space. This parameter is required for custom models. Dictionary keys are algorithm names (str) with search space as the key value. Key values must have two parameters: (1) ‘range’ which is a list containing the range and (2) ‘type’ which is one of ‘continuous’, ‘discrete’, ‘categorical’. For example, if the user wishes to provide a custom tune search space for LogisticRegression:
search_space = { 'LogisticRegression' : { 'C': { 'range': [0.03125, 512], 'type': 'continuous' }, 'solver': { 'range': ['newton-cg', 'lbfgs', 'liblinear', 'sag'], 'type': 'categorical' }, 'class_weight': { 'range': [None, 'balanced'], 'type': 'categorical' } } } - To disable *Model Tune* for all models set ``search_space = {}`` - If a key value is an empty dictionary, then Model Tune is disabled for that key. - If ``None``, default search space defined inside AutoML is used.
-
min_class_instances ( int or None , default=None ) – The minimum number of instances all classes must have when doing classification. If any class has less than this number of instances, training is stopped. This argument may take any value of 2 or higher. Default value (if not previously set):
5
-
max_tuning_trials ( int , dict or None , default=None ) –
- The maximum number of HPO trials, may be exceeded slightly.
-
-
If
None
: AutoML automatically determines when enough HPO
trials have been completed. - If an integer: the maximum number of trials for each algorithm. That is, if
n_algos_tuned == 2
, then up to2 * max_tuning_trials
are performed in total. - If adict
: by passing a dictionary you can specify this parameter per algorithm. e.g.,{'LogisticRegression': 100, 'RandomForestClassifier': 200}
. Missing values in the dictionary default to None. -
Default value (if not previously set):
None
-
search_strategy ( str or None , default=None ) – The search strategy used in Model Tune. Valid search_strategy values: HyperGD, BruteForceSampler, CmaEsSampler, GridSampler, IntersectionSearchSpace, MOTPESampler, NSGAIISampler, NSGAIIISampler, PartialFixedSampler, QMCSampler, RandomSampler, TPESampler, intersection_search_space, nsgaii Default value (if not previously set):
'HyperGD'
-
threshold_tuning ( bool or None , default=None ) –
Determine whether or not AutoML optimizes the prediction threshold. Threshold tuning is only used in classification tasks. However, unlike classic threshold tuning, AutoML uses a novel technique that increases or decreases the model’s prediction probabilities for a given class, thereby keeping the prediction probability fixed to 0.5 for binary classification and allowing the method to generalize to multi-class classification problems.
-
If True, the prediction threshold will be optimized
based on the provided score metric. Threshold tuning allows users to post-process classification model predictions to optimize for their custom metric. Threshold tuning will not be exported to onnx models, therefore the onnx model quality may be lower than the original model. - If False, threshold tuning is not applied.
Default value (if not previously set):
False
-
-
- train ( self , X , y , X_valid = None , y_valid = None , cv = 'auto' , col_types = None , time_budget = - 1 )
-
Automatically identifies the most relevant model and hyperparameters for this given set of features (
X
) and target (y
). Does not conduct final model fit. If the latter is desired, usefit
.- Parameters :
-
-
X ( pandas.DataFrame ) – Training dataset features. Optional for forecasting task.
-
y ( pandas.DataFrame , pandas.Series ) – Training dataset target.
-
X_valid ( pandas.DataFrame or None , default=None ) – Validation dataset features
-
y_valid ( pandas.DataFrame , pandas.Series or None , default=None ) – Validation dataset target
-
cv ( int , str or None , default='auto' ) –
Determines the cross-validation split. Possible inputs for cv are:
-
None: uses X_valid and y_valid for validation
-
’auto’: uses 5 folds if number of instances < 1M, disable cv-folds otherwise
-
integer: specifies the number of folds in a (Stratified)KFold ,
-
iterable: yields (train, test) splits as arrays of indices.
For integer/None inputs, if the estimator is a classifier and
y
is either binary or multiclass,StratifiedKFold
is used. In all other cases,KFold
is used. -
-
col_types ( list of strings or None , default=None ) – List of length
X.shape[1]
with string values indicating type of features. Supported types are: ‘categorical’, ‘numerical’, ‘text’. ‘datetime’, ‘timedelta’, ‘image’. For text classification, it has to be set to ‘text’. In the image classification, features with col_type of image should be a column containing images in PIL format. If notNone
, it manually specifies the type of every dataset feature. -
time_budget ( Dict [ str , float ] , float , default=-1 ) –
- If float:
-
Time budget in seconds.
- If Dict[str, float]:
-
Time budget for each step in seconds. Step names are: ModelSelection , ModelTune , FeatureSelection , AdaptiveSampling , ThresholdTuning
-
-
-1
for unconstrained time budget: best effort mode is -
enabled and optimization continues until convergence.
-
-
- Returns :
-
self
- Return type :
-
AutoMLPipeline
- fit ( self , X , y , X_valid = None , y_valid = None , cv = 'auto' , col_types = None , time_budget = - 1 )
-
Automatically identifies the most relevant features, model and hyperparameters for a given training data (
X
) and target (y
). Final model fit is conducted on a full dataset.- Parameters :
-
-
X ( pandas.DataFrame ) – Training dataset features. Optional for forecasting tasks.
-
y ( pandas.DataFrame , pandas.Series ) – Training dataset target. Note that y is required for forecasting task.
-
X_valid ( pandas.DataFrame or None , default=None ) – Validation dataset features
-
y_valid ( pandas.DataFrame , pandas.Series or None , default=None ) – Validation dataset target
-
cv ( int , str or None , default='auto' ) –
Determines the cross-validation split. Possible inputs for cv are:
-
None: uses X_valid and y_valid for validation
-
’auto’: uses 5 folds if number of instances < 1M, disable cv-folds otherwise
-
integer: specifies the number of folds in a (Stratified)KFold ,
-
iterable: yields (train, test) splits as arrays of indices.
For integer/None inputs, if the estimator is a classifier and
y
is either binary or multiclass,StratifiedKFold
is used. In all other cases,KFold
is used. -
-
col_types ( List [ str ] or None , default=None ) – List of length
X.shape[1]
with string values indicating type of features. Supported types are: ‘categorical’, ‘numerical’, ‘text’. ‘datetime’, ‘timedelta’, ‘image’. For text classification, it has to be set to ‘text’. In the image classification, features with col_type of image should be a column containing images in PIL format. If notNone
, it manually specifies the type of every dataset feature. -
time_budget ( Dict [ str , float ] , float , default=-1 ) –
- If float:
-
Time budget in seconds.
- If Dict[str, float]:
-
Time budget for each step in seconds. Step names are: ModelSelection , ModelTune , FeatureSelection , AdaptiveSampling , ThresholdTuning
-
-
-1
for unconstrained time budget: best effort mode is -
enabled and optimization continues until convergence.
-
-
- Returns :
-
self
- Return type :
-
AutoMLPipeline
- refit ( self , X , y , X_valid = None , y_valid = None )
-
Refit previously tuned AutoML Pipeline with new data. Pipeline stages are not re-run on the new datasets. All settings and parameters from Model Selection, Feature Selection, Model Tune are re-used.
fit
must have been called before calling this method. If validation set is provided, it will be concatenated with training set before doing the refit.- Parameters :
-
-
X ( pandas.DataFrame ) – Training dataset features. Optional for forecasting tasks.
-
y ( pandas.DataFrame , pandas.Series ) – Training dataset target.
-
X_valid ( pandas.DataFrame or None , default=None ) – Validation dataset features
-
y_valid ( pandas.DataFrame , pandas.Series or None , default=None ) – Validation dataset target
-
- Returns :
-
self
- Return type :
-
AutoMLPipeline
- predict_proba ( self , X )
-
Probability estimates.
- Parameters :
-
X ( pandas.DataFrame ) – Prediction dataset features
- Raises :
-
AutoMLxRuntimeError – If there is no predictions after calling the model over the given dataset
- Returns :
-
y_pred – The predicted probabilities.
- Return type :
-
numpy.ndarray of shape = (n_samples, n_classes)
AutoRegressor
- class AutoRegressor
-
Regressor AutoMLPipeline
- selected_features_names_
-
Names of the engineered features selected by the AutoML pipeline.
- Type :
-
List[ str ]
- selected_features_names_raw_
-
Names of original feature names selected by the AutoML pipeline. If preprocessing is disabled, then this corresponds to
selected_features_names_
; otherwise, a raw feature is considered selected if at least one of the features engineered from it is selected.- Type :
-
List[ str ]
- ranked_models_
-
List of model names ranked in order of their quality from the last
fit
call.- Type :
-
List[ str ]
- selected_model_params_
-
Dictionary containing the optimal hyperparameters for the selected model. Keys are hyperparameter names with their corresponding values.
- Type :
- selected_rows_
-
List of indices in the original train dataset provided to AutoML corresponding to the rows sampled during Adaptive Sampling. In the case of CV, this attribute will result in a list of lists corresponding to indices selected in each fold. For example, in the case of no CV, this attribute looks like:
[0, 1, 5]
, indicating indices 0, 1, and 5 have been selected during adaptive sampling. In the case of CV=3, this attribute looks like:[ [0, 1], [0, 5], [1, 5] ]
, indicating indices 0,1 were selected from the first fold, 0,5 were selected in the 2nd fold, and 1,5 were selected in the 3rd fold.- Type :
- selected_valid_rows_
-
List of indices in the original validation dataset (if
CV==None
) provided to AutoML corresponding to the rows sampled during Adaptive Sampling. If CV is not None, the returned value is always None given that Adaptive Sampling does not sample the validation set when CV is enabled.- Type :
- pipelines_
-
Sorted list of pipelines (length equal to
n_algos_tuned
), with 0th element being the best model.- Type :
- completed_trials_summary_
-
All trials performed by the AutoML Pipeline. The columns are: Algorithm, # Samples, # Features, Mean Validation Score, Hyperparameters, Runtime, Memory Usage, Step, where the hyperparameters are a dict.
- Type :
- completed_trials_detailed_
-
A DataFrame indicating all trials performed by the AutoML Pipeline with all score metrics’ values. The columns are: Algorithm, # Samples, # Features, Hyperparameters, Features, Step, Scoring Metric, CV Fold ID, Score, Runtime, Memory Usage, # Epochs.
- Type :
- n_jobs_
-
Parallelism internally used by AutoML. Calculated as
inter_model_parallelism*intra_model_parallelism
.- Type :
- feature_importances_
-
Importance of each feature in the dataset for the selected model
- Type :
-
numpy.ndarray of shape (n_features,)
- configure ( self , score_metric = None , random_state = None , n_algos_tuned = None , model_list = None , adaptive_sampling = None , min_features = None , optimization = None , preprocessing = None , search_space = None , max_tuning_trials = None , search_strategy = None )
-
Configure the AutoRegressor
If an argument is set to None, then its value is not changed and the default value is used.
- Parameters :
-
-
score_metric ( str , callable , tuple , list or None , default=None ) –
One or more score functions (or loss functions) to be computed for each candidate model and hyperparameter configuration.
-
If
None
: it will be determined automatically depending
on the task. Default score metrics : neg_mean_squared_error - If a list: should be a list of str, callable or tuple. The first score metric in the list will be the one for which the pipeline optimizes. - If a callable: score function (or loss function) with signature
score_func(model, X, y)
. - If a tuple: should be a tuple with two values with types (str, callable). The string corresponds to the name of the scoring metric, and the callable should have the same signature as above. - If a string: automatically infers the scoring metric from the string: nnttcontinuous – neg_mean_squared_error, r2, neg_mean_absolute_error, neg_mean_squared_log_error, neg_median_absolute_error
- More information on scoring metrics can be found here :
-
-
random_state ( int or None , default=None ) – Random seed used by AutoML. Default value (if not previously set):
7
-
n_algos_tuned ( int or None , default=None ) –
Number of algorithms that are optimized by the AutoML pipeline. Higher values might make AutoML more accurate at the expense of runtime.
-
To disable algorithm selection set
n_algos_tuned = len(model_list)
.Default value (if not previously set):
1
-
-
model_list ( List [ str | Any ] or None , default=None ) –
Models that will be evaluated by the Pipeline. Users can specify built-in models by name or pass in any custom model. Custom models must have their hyper-parameter configuration spaces defined in search_space. Custom models for regression must implement the scikit-learn-style fit and predict methods. (by default, all supported built-in models for a given task are used) Supported built-in models per task:
regression – AdaBoostRegressor, DecisionTreeRegressor, ExtraTreesRegressor, KNeighborsRegressor, LGBMRegressor, LinearRegression, LinearSVR, RandomForestRegressor, SVR, TorchMLPRegressor, XGBRegressor
-
adaptive_sampling ( bool or None , default=None ) – Set to False to disable class balancing and adaptive sampling done in AutoML. Disabling this might significantly increase runtime. Default value (if not previously set):
True
-
min_features ( int , float , list or None , default=None ) –
Minimum number of features to keep. Acceptable values:
-
If int, 0 < min_features <= n_features
-
If float, 0 < min_features <= 1.0
-
If list, names of features to keep, for example
['a', 'b']
means keep features ‘a’ and ‘b’ - To disable feature selection setmin_features = 1.0
Default value (if not previously set):
1
-
-
optimization ( int or None , default=None ) –
Sets AutoML speed, accuracy, and reproducibility level. To be used by advanced users only.
-
Level 0: Optimized for reproducibility
(controls most randomness) - Level 3: Optimized for speed and accuracy - Level 10: Optimized for speed Defaults to
3
-
-
preprocessing ( bool or None , default=None ) –
Determines whether or not AutoML pre-processes the dataset. This setting should only be used by advanced users.
-
If True, auto-preprocessor runs on dataset to normalize data.
Categorical features are label encoded and numeric features are normalized to mean of 0 and variance of 1 using
sklearn.preprocessing.StandardScaler
. Features with more than 20 percent missing values are ignored. The remaining missing values are imputed by mean for numeric features and mode for categorical features. - If False, user must cleanse (and normalize if desired) dataset before passing data to AutoML. The use of NaNs in the dataset is not allowed and will produce aValueError
. AutoML will leave it to the underlying algorithm implementations to handle strings (it is recommended to encode strings). Default value (if not previously set):True
-
-
search_space ( dict or None , default=None ) –
This parameter defines the Model Tuning search space. This parameter is required for custom models. Dictionary keys are algorithm names (str) with search space as the key value. Key values must have two parameters: (1) ‘range’ which is a list containing the range and (2) ‘type’ which is one of ‘continuous’, ‘discrete’, ‘categorical’. For example, if the user wishes to provide a custom tune search space for AdaBoostRegressor:
search_space = { 'AdaBoostRegressor' : { 'learning_rate': { 'range': [0.05, 1], 'type': 'continuous' }, 'n_estimators': { 'range': [10, 50], 'type': 'discrete' }, } } - To disable *Model Tune* for all models set ``search_space = {}`` - If a key value is an empty dictionary, then Model Tune is disabled for that key. - If ``None``, default search space defined inside AutoML is used.
-
max_tuning_trials ( int , dict or None , default=None ) –
- The maximum number of HPO trials, may be exceeded slightly.
-
-
If
None
: AutoML automatically determines when enough HPO
trials have been completed. - If an integer: the maximum number of trials for each algorithm. That is, if
n_algos_tuned == 2
, then up to2 * max_tuning_trials
are performed in total. - If adict
: by passing a dictionary you can specify this parameter per algorithm. e.g.,{'LogisticRegression': 100, 'RandomForestClassifier': 200}
. Missing values in the dictionary default to None. -
Default value (if not previously set):
None
-
search_strategy ( str or None , default=None ) – The search strategy used in Model Tune. Valid search_strategy values: HyperGD, BruteForceSampler, CmaEsSampler, GridSampler, IntersectionSearchSpace, MOTPESampler, NSGAIISampler, NSGAIIISampler, PartialFixedSampler, QMCSampler, RandomSampler, TPESampler, intersection_search_space, nsgaii Default value (if not previously set):
'HyperGD'
-
- train ( self , X , y , X_valid = None , y_valid = None , cv = 'auto' , col_types = None , time_budget = - 1 )
-
Automatically identifies the most relevant model and hyperparameters for this given set of features (
X
) and target (y
). Does not conduct final model fit. If the latter is desired, usefit
.- Parameters :
-
-
X ( pandas.DataFrame ) – Training dataset features. Optional for forecasting task.
-
y ( pandas.DataFrame , pandas.Series ) – Training dataset target.
-
X_valid ( pandas.DataFrame or None , default=None ) – Validation dataset features
-
y_valid ( pandas.DataFrame , pandas.Series or None , default=None ) – Validation dataset target
-
cv ( int , str or None , default='auto' ) –
Determines the cross-validation split. Possible inputs for cv are:
-
None: uses X_valid and y_valid for validation
-
’auto’: uses 5 folds if number of instances < 1M, disable cv-folds otherwise
-
integer: specifies the number of folds in a (Stratified)KFold ,
-
iterable: yields (train, test) splits as arrays of indices.
For integer/None inputs, if the estimator is a classifier and
y
is either binary or multiclass,StratifiedKFold
is used. In all other cases,KFold
is used. -
-
col_types ( list of strings or None , default=None ) – List of length
X.shape[1]
with string values indicating type of features. Supported types are: ‘categorical’, ‘numerical’, ‘text’, ‘datetime’, and ‘timedelta’ If notNone
, it manually specifies the type of every dataset feature. -
time_budget ( Dict [ str , float ] , float or None , default=-1 ) –
- If float:
-
Time budget in seconds.
- If Dict[str, float]:
-
Time budget for each step in seconds. Step names are: ModelSelection , ModelTune , FeatureSelection , AdaptiveSampling , ThresholdTuning
-
-
-1
for unconstrained time budget: best effort mode is -
enabled and optimization continues until convergence.
-
-
- Returns :
-
self
- Return type :
-
AutoMLPipeline
- fit ( self , X , y , X_valid = None , y_valid = None , cv = 'auto' , col_types = None , time_budget = - 1 )
-
Automatically identifies the most relevant features, model and hyperparameters for a given training data (
X
) and target (y
). Final model fit is conducted on a full dataset.- Parameters :
-
-
X ( pandas.DataFrame ) – Training dataset features. Optional for forecasting tasks.
-
y ( pandas.DataFrame , pandas.Series ) – Training dataset target. Note that y is required for forecasting task.
-
X_valid ( pandas.DataFrame or None , default=None ) – Validation dataset features
-
y_valid ( pandas.DataFrame , pandas.Series or None , default=None ) – Validation dataset target
-
cv ( int , str or None , default='auto' ) –
Determines the cross-validation split. Possible inputs for cv are:
-
None: uses X_valid and y_valid for validation
-
’auto’: uses 5 folds if number of instances < 1M, disable cv-folds otherwise
-
integer: specifies the number of folds in a (Stratified)KFold ,
-
iterable: yields (train, test) splits as arrays of indices.
For integer/None inputs, if the estimator is a classifier and
y
is either binary or multiclass,StratifiedKFold
is used. In all other cases,KFold
is used. -
-
col_types ( List [ str ] or None , default=None ) – List of length
X.shape[1]
with string values indicating type of features. Supported types are: ‘categorical’, ‘numerical’, ‘text’, ‘datetime’, and ‘timedelta’ If notNone
, it manually specifies the type of every dataset feature. -
time_budget ( Dict [ str , float ] , float or None , default=-1 ) –
- If float:
-
Time budget in seconds.
- If Dict[str, float]:
-
Time budget for each step in seconds. Step names are: ModelSelection , ModelTune , FeatureSelection , AdaptiveSampling , ThresholdTuning
-
-
-1
for unconstrained time budget: best effort mode is -
enabled and optimization continues until convergence.
-
-
- Returns :
-
self
- Return type :
-
AutoMLPipeline
- refit ( self , X , y , X_valid = None , y_valid = None )
-
Refit previously tuned AutoML Pipeline with new data. Pipeline stages are not re-run on the new datasets. All settings and parameters from Model Selection, Feature Selection, Model Tune are re-used.
fit
must have been called before calling this method. If validation set is provided, it will be concatenated with training set before doing the refit.- Parameters :
-
-
X ( pandas.DataFrame ) – Training dataset features. Optional for forecasting tasks.
-
y ( pandas.DataFrame , pandas.Series ) – Training dataset target.
-
X_valid ( pandas.DataFrame or None , default=None ) – Validation dataset features
-
y_valid ( pandas.DataFrame , pandas.Series or None , default=None ) – Validation dataset target
-
- Returns :
-
self
- Return type :
-
AutoMLPipeline
AutoAnomalyDetector
- class AutoAnomalyDetector
-
Anomaly Detection AutoMLPipeline
- classes_
-
Holds the label for each class (for
task=classification
only, otherwise it is set toNone
).- Type :
-
List[Any]
- selected_features_names_
-
Names of the engineered features selected by the AutoML pipeline.
- Type :
-
List[ str ]
- selected_features_names_raw_
-
Names of original feature names selected by the AutoML pipeline. If preprocessing is disabled, then this corresponds to
selected_features_names_
; otherwise, a raw feature is considered selected if at least one of the features engineered from it is selected.- Type :
-
List[ str ]
- ranked_models_
-
List of model names ranked in order of their quality from the last
fit
call.- Type :
-
List[ str ]
- selected_model_params_
-
Dictionary containing the optimal hyperparameters for the selected model. Keys are hyperparameter names with their corresponding values.
- Type :
- selected_rows_
-
List of indices in the original train dataset provided to AutoML corresponding to the rows sampled during Adaptive Sampling. In the case of CV, this attribute will result in a list of lists corresponding to indices selected in each fold. For example, in the case of no CV, this attribute looks like:
[0, 1, 5]
, indicating indices 0, 1, and 5 have been selected during adaptive sampling. In the case of CV=3, this attribute looks like:[ [0, 1], [0, 5], [1, 5] ]
, indicating indices 0,1 were selected from the first fold, 0,5 were selected in the 2nd fold, and 1,5 were selected in the 3rd fold.- Type :
- selected_valid_rows_
-
List of indices in the original validation dataset (if
CV==None
) provided to AutoML corresponding to the rows sampled during Adaptive Sampling. If CV is not None, the returned value is always None given that Adaptive Sampling does not sample the validation set when CV is enabled.- Type :
- pipelines_
-
Sorted list of pipelines (length equal to
n_algos_tuned
), with 0th element being the best model.- Type :
- completed_trials_summary_
-
All trials performed by the AutoML Pipeline. The columns are: Algorithm, # Samples, # Features, Mean Validation Score, Hyperparameters, Runtime, Memory Usage, Step, where the hyperparameters are a dict.
- Type :
- completed_trials_detailed_
-
A DataFrame indicating all trials performed by the AutoML Pipeline with all score metrics’ values. The columns are: Algorithm, # Samples, # Features, Hyperparameters, Features, Step, Scoring Metric, CV Fold ID, Score, Runtime, Memory Usage, # Epochs.
- Type :
- n_jobs_
-
Parallelism internally used by AutoML. Calculated as
inter_model_parallelism*intra_model_parallelism
.- Type :
- feature_importances_
-
Importance of each feature in the dataset for the selected model
- Type :
-
numpy.ndarray of shape (n_features,)
- configure ( self , score_metric = None , random_state = None , n_algos_tuned = None , model_list = None , optimization = None , preprocessing = None , search_space = None , max_tuning_trials = None , search_strategy = None )
-
Configure the AutoAnomalyDetector
If an argument is set to None, then its value is not changed and the default value is used.
- Parameters :
-
-
score_metric ( str , callable , tuple , list or None , default=None ) –
One or more score functions (or loss functions) to be computed for each candidate model and hyperparameter configuration.
-
If
None
: it will be determined automatically depending
on the task. Default score metrics : unsupervised_unify95 - If a list: should be a list of str, callable or tuple. The first score metric in the list will be the one for which the pipeline optimizes. - If a callable: score function (or loss function) with signature
score_func(model, X, y)
. - If a tuple: should be a tuple with two values with types (str, callable). The string corresponds to the name of the scoring metric, and the callable should have the same signature as above. - If a string: automatically infers the scoring metric from the string: nntt**unsupervised** – unsupervised_unify95, unsupervised_unify95_log_loss -
-
random_state ( int or None , default=None ) – Random seed used by AutoML. Default value (if not previously set):
7
-
n_algos_tuned ( int or None , default=None ) –
Number of algorithms that are optimized by the AutoML pipeline. Higher values might make AutoML more accurate at the expense of runtime.
-
To disable algorithm selection set
n_algos_tuned = len(model_list)
.Default value (if not previously set):
1
-
-
model_list ( List [ str | Any ] or None , default=None ) –
Models that will be evaluated by the Pipeline. Users can specify built-in models by name or pass in any custom model. Custom models must have their hyper-parameter configuration spaces defined in search_space. Custom models for anomaly detection must follow the pyod interface. (by default, all supported built-in models for a given task are used) Supported built-in models per task:
anomaly_detection – ClusteringLocalFactorOD, HistogramOD, IsolationForestOD, KNearestNeighborsOD, MinCovOD, OneClassSVMOD, PrincipalCompOD, AutoEncoder
-
optimization ( int or None , default=None ) –
Sets AutoML speed, accuracy, and reproducibility level. To be used by advanced users only.
-
Level 0: Optimized for reproducibility
(controls most randomness) - Level 3: Optimized for speed and accuracy - Level 10: Optimized for speed Default value (if not previously set):
3
-
-
preprocessing ( bool or None , default=None ) –
Determines whether or not AutoML pre-processes the dataset. This setting should only be used by advanced users.
-
If True, auto-preprocessor runs on dataset to normalize data.
Categorical features are label encoded and numeric features are normalized to mean of 0 and variance of 1 using
sklearn.preprocessing.StandardScaler
. Features with more than 20 percent missing values are ignored. The remaining missing values are imputed by mean for numeric features and mode for categorical features. - If False, user must cleanse (and normalize if desired) dataset before passing data to AutoML. The use of NaNs in the dataset is not allowed and will produce aValueError
. AutoML will leave it to the underlying algorithm implementations to handle strings (it is recommended to encode strings). Default value (if not previously set):True
-
-
search_space ( dict or None , default=None ) –
This parameter defines the Model Tuning search space. This parameter is required for custom models. Dictionary keys are algorithm names (str) with search space as the key value. Key values must have two parameters: (1) ‘range’ which is a list containing the range and (2) ‘type’ which is one of ‘continuous’, ‘discrete’, ‘categorical’. For example, if the user wishes to provide a custom tune search space for IsolationForestOD:
search_space = { 'IsolationForestOD' : { 'n_estimators': { 'range': [10, 50], 'type': 'discrete' }, 'max_features': { 'range': [0.5, 0.7], 'type': 'continuous' }, 'max_samples': { 'range': [5, 10], 'type': 'discrete' } } } - To disable *Model Tune* for all models set ``search_space = {}`` - If a key value is an empty dictionary, then Model Tune is disabled for that key. - If ``None``, default search space defined inside AutoML is used.
-
max_tuning_trials ( int , dict or None , default=None ) –
- The maximum number of HPO trials, may be exceeded slightly.
-
-
If
None
: AutoML automatically determines when enough HPO
trials have been completed. - If an integer: the maximum number of trials for each algorithm. That is, if
n_algos_tuned == 2
, then up to2 * max_tuning_trials
are performed in total. - If adict
: by passing a dictionary you can specify this parameter per algorithm. e.g.,{'LogisticRegression': 100, 'RandomForestClassifier': 200}
. Missing values in the dictionary default to None. -
Default value (if not previously set):
None
-
search_strategy ( str or None , default=None ) – The search strategy used in Model Tune. Valid search_strategy values: HyperGD, BruteForceSampler, CmaEsSampler, GridSampler, IntersectionSearchSpace, MOTPESampler, NSGAIISampler, NSGAIIISampler, PartialFixedSampler, QMCSampler, RandomSampler, TPESampler, intersection_search_space, nsgaii Default value (if not previously set):
'HyperGD'
-
- train ( self , X , X_valid = None , y_valid = None , col_types = None , time_budget = - 1 , contamination = None )
-
Automatically identifies the most relevant model and hyperparameters for this given set of features (
X
) and target (y
). Does not conduct final model fit. If the latter is desired, usefit
.- Parameters :
-
-
X ( pandas.DataFrame ) – Training dataset features. Optional for forecasting task.
-
X_valid ( pandas.DataFrame or None , default=None ) – Validation dataset features
-
y_valid ( pandas.DataFrame , pandas.Series or None , default=None ) – Validation dataset target
-
col_types ( list of strings or None , default=None ) – List of length
X.shape[1]
with string values indicating type of features. Supported types are: ‘categorical’, ‘numerical’, ‘text’, ‘datetime’, ‘timedelta’ If notNone
, it manually specifies the type of every dataset feature. -
time_budget ( Dict [ str , float ] , float or None , default=-1 ) –
- If float:
-
Time budget in seconds.
- If Dict[str, float]:
-
Time budget for each step in seconds. Step names are: ModelSelection , ModelTune
-
-
-1
for unconstrained time budget: best effort mode is -
enabled and optimization continues until convergence.
-
-
contamination ( float or None , default=None ) – Fraction of training dataset corresponding to anomalies (between 0.0 and 0.5). Should only be set for supervised anomaly detection (y_valid is required). Should be set to None for unsupervised anomaly detection (when using the unsupervised metrics).
-
- Raises :
-
AutoMLxValueError – If contamination has been provided for unsupervised AD
- Returns :
-
self
- Return type :
-
AutoMLPipeline
- fit ( self , X , X_valid = None , y_valid = None , col_types = None , time_budget = - 1 , contamination = None )
-
Automatically identifies the most relevant features, model and hyperparameters for a given training data (
X
). Final model fit is conducted on a full dataset.- Parameters :
-
-
X ( pandas.DataFrame ) – Training dataset features. Optional for forecasting tasks.
-
X_valid ( pandas.DataFrame or None , default=None ) – Validation dataset features
-
y_valid ( pandas.DataFrame , pandas.Series or None , default=None ) – Validation dataset target
-
col_types ( List [ str ] or None , default=None ) – List of length
X.shape[1]
with string values indicating type of features. Supported types are: ‘categorical’, ‘numerical’, ‘text’, ‘datetime’, and ‘timedelta’ -
time_budget ( Dict [ str , float ] , float , default=-1 ) –
- If float:
-
Time budget in seconds.
- If Dict[str, float]:
-
Time budget for each step in seconds. Step names are: ModelSelection , ModelTune
-
-
-1
for unconstrained time budget: best effort mode is -
enabled and optimization continues until convergence.
-
-
contamination ( float or None , default=None ) – Fraction of training dataset corresponding to anomalies (between 0.0 and 0.5). Should only be set for supervised anomaly detection (y_valid is required). Should be set to None for unsupervised anomaly detection (when using the unsupervised metrics).
-
- Returns :
-
self
- Return type :
-
AutoMLPipeline
- refit ( self , X , X_valid = None , y_valid = None )
-
Refit previously tuned AutoML Pipeline with new data. Pipeline stages are not re-run on the new datasets. All settings and parameters from Model Selection, Feature Selection, Model Tune are re-used.
fit
must have been called before calling this method. If validation set is provided, it will be concatenated with training set before doing the refit.- Parameters :
-
-
X ( pandas.DataFrame ) – Training dataset features. Optional for forecasting tasks.
-
X_valid ( pandas.DataFrame or None , default=None ) – Validation dataset features
-
y_valid ( pandas.DataFrame , pandas.Series or None , default=None ) – Validation dataset target
-
- Returns :
-
self
- Return type :
-
AutoMLPipeline
- predict_proba ( self , X )
-
Probability estimates.
- Parameters :
-
X ( pandas.DataFrame ) – Prediction dataset features
- Raises :
-
-
AutoMLxNotFittedError – If the pipeline is not fitted yet.
-
AutoMLxRuntimeError – If there is no predictions after calling the model over the given dataset.
-
- Returns :
-
y_pred – The predicted probabilities.
- Return type :
-
numpy.ndarray of shape = (n_samples, n_classes)
AutoForecaster
- class AutoForecaster
-
Forecasting AutoMLPipeline
- ranked_models_
-
List of model names ranked in order of their quality from the last
fit
call.- Type :
-
List[ str ]
- selected_model_params_
-
Dictionary containing the optimal hyperparameters for the selected model. Keys are hyperparameter names with their corresponding values.
- Type :
- pipelines_
-
Sorted list of pipelines (length equal to
n_algos_tuned
), with 0th element being the best model.- Type :
- completed_trials_summary_
-
All trials performed by the AutoML Pipeline. The columns are: Algorithm, # Samples, # Features, Mean Validation Score, Hyperparameters, Runtime, Memory Usage, Step, where the hyperparameters are a dict.
- Type :
- completed_trials_detailed_
-
A DataFrame indicating all trials performed by the AutoML Pipeline with all score metrics’ values. The columns are: Algorithm, # Samples, # Features, Hyperparameters, Features, Step, Scoring Metric, CV Fold ID, Score, Runtime, Memory Usage, # Epochs.
- Type :
- configure ( self , score_metric = None , random_state = None , n_algos_tuned = None , model_list = None , optimization = None , preprocessing = None , search_space = None , max_tuning_trials = None , search_strategy = None , time_series_period = None )
-
Configure the AutoForecaster
If an argument is set to None, then its value is not changed and the default value is used.
- Parameters :
-
-
score_metric ( str , callable , tuple , list or None , default=None ) –
One or more score functions (or loss functions) to be computed for each candidate model and hyperparameter configuration.
-
If
None
: it will be determined automatically depending
on the task. Default score metrics : neg_sym_mean_abs_percent_error - If a list: should be a list of str, callable or tuple. The first score metric in the list will be the one for which the pipeline optimizes. - If a callable: score function (or loss function) with signature
score_func(model, X, y)
. - If a tuple: should be a tuple with two values with types (str, callable). The string corresponds to the name of the scoring metric, and the callable should have the same signature as above. - If a string: automatically infers the scoring metric from the string: nnttcontinuous_forecast – neg_sym_mean_abs_percent_error, neg_root_mean_squared_percent_error, neg_mean_abs_scaled_error, neg_root_mean_squared_error, neg_mean_squared_error, neg_max_absolute_error, neg_mean_absolute_error, neg_max_abs_error, neg_mean_abs_error
-
-
random_state ( int , or None , default=None ) – Random seed used by AutoML. Suggested default:
7
-
n_algos_tuned ( int , or None , default=None ) –
Number of algorithms that are optimized by the AutoML pipeline. Higher values might make AutoML more accurate at the expense of runtime.
-
To disable algorithm selection set
n_algos_tuned = len(model_list)
.Suggested default:
1
-
-
model_list ( List [ str ] , or None , default=None ) –
Models that will be evaluated by the Pipeline. Users can specify built-in models by name (by default, all supported built-in models for a given task are used).
-
All models except VARMAX and DynFactor models are applicable
when doing there is a single timeseries in y. - If you have multiple timeseries in y that you want to predict as a system, then multivariate forecasting VARMAX and DynFactor may be utilized. - When you have features or exogenous regressors that you known in advance for your forecast period, pass them into X.
Supported built-in models per task:
forecasting – NaiveForecaster, ThetaForecaster, ExpSmoothForecaster, ETSForecaster, STLwESForecaster, STLwARIMAForecaster, SARIMAXForecaster, VARMAXForecaster, DynFactorForecaster
-
-
optimization ( int , or None , default=None ) –
Sets AutoML speed, accuracy, and reproducibility level. To be used by advanced users only.
-
Level 0: Optimized for reproducibility (controls most randomness)
-
Level 3: Optimized for speed and accuracy
-
Level 10: Optimized for speed
Suggested default:
3
-
-
preprocessing ( bool , or None , default=None ) – Determines whether or not AutoML pre-processes the dataset. This setting should only be used by advanced users. Most of the preprocessing can not be turned off for the forecasting task. Suggested default:
True
-
search_space ( dict , or None , default=None ) –
This parameter defines the search space for model tuning. This parameter is required for custom models. Dictionary keys are algorithm names (str) with search space as the key value. Key values must have two parameters: (1) ‘range’ which is a list containing the range and (2) ‘type’ which is one of ‘continuous’, ‘discrete’, ‘categorical’. For example, if the user wishes to provide a custom tune search space for ETSForecaster:
search_space = { 'ETSForecaster' : { 'error': { 'range': ['add', 'mul'], 'type': 'categorical' }, 'damped_trend': { 'range': [True, False], 'type': 'categorical' }, } } - To disable *model tuning* for all models set ``search_space = {}`` - If a key value is an empty dictionary, then model tuning is disabled for that key. - If ``None``, default search space defined inside AutoML is used.
-
max_tuning_trials ( int , dict or None , default=None ) –
- The maximum number of HPO trials, may be exceeded slightly.
-
-
If
None
: AutoML automatically determines when enough HPO
trials have been completed. - If an integer: the maximum number of trials for each algorithm. That is, if
n_algos_tuned == 2
, then up to2 * max_tuning_trials
are performed in total. - If adict
: by passing a dictionary you can specify this parameter per algorithm. e.g.,{'LogisticRegression': 100, 'RandomForestClassifier': 200}
. Missing values in the dictionary default to None. -
Default value (if not previously set):
None
-
search_strategy ( str ) – The search strategy used in model tuning. Valid search_strategy values: HyperGD, BruteForceSampler, CmaEsSampler, GridSampler, IntersectionSearchSpace, MOTPESampler, NSGAIISampler, NSGAIIISampler, PartialFixedSampler, QMCSampler, RandomSampler, TPESampler, intersection_search_space, nsgaii Suggested default:
'HyperGD'
-
time_series_period ( int or None , default=None ) – The seasonality period to force-fit the time series at regardless of whether it is detected in the data. If None, AutoML guesses the seasonability by inspecting the training data. However, users can use this to set it manually instead.
-
- fit ( self , y , X = None , X_valid = None , y_valid = None , cv = 'auto' , col_types = None , time_budget = - 1 )
-
Automatically identifies the most relevant features, model and hyperparameters for a given training data (
X
) and target (y
). Final model fit is conducted on a full dataset.- Parameters :
-
-
y ( pandas.DataFrame ) – Training dataset target.
-
X ( pandas.DataFrame or None , default=None ) – A dataframe of explanatory variables that support the target timeseries in y. These must be known in advance for the foreast period and the training period.
-
X_valid ( pandas.DataFrame or None , default=None ) – Validation dataset features
-
y_valid ( pandas.DataFrame or None , default=None ) – Validation dataset target
-
cv ( int , str or None , default='auto' ) –
Determines the cross-validation split. Possible inputs for cv are:
-
None: uses X_valid and y_valid for validation
-
’auto’: uses 5 folds if number of instances < 1M, disable cv-folds otherwise
-
integer: specifies the number of folds in a (Stratified)KFold ,
-
iterable: yields (train, test) splits as arrays of indices.
For integer/None inputs, if the estimator is a classifier and
y
is either binary or multiclass,StratifiedKFold
is used. In all other cases,KFold
is used. -
-
col_types ( List [ str ] or None , default=None ) – List of length
X.shape[1]
with string values indicating type of features. Supported types are: ‘categorical’, ‘numerical’, ‘text’, ‘datetime’, and ‘timedelta’ -
time_budget ( Dict [ str , float ] , float , default=-1 ) –
- If float:
-
Time budget in seconds.
- If Dict[str, float]:
-
Time budget for each step in seconds. Step names are: ModelSelection , ModelTune
-
-
-1
for unconstrained time budget: best effort mode is -
enabled and optimization continues until convergence.
-
-
- Returns :
-
self
- Return type :
-
AutoMLPipeline
- predict ( self , X )
-
Predict labels for features (X).
- Parameters :
-
X ( pandas.DataFrame ) – A dataframe of explanatory variables that support the target timeseries in y
- Raises :
-
-
AutoMLxNotFittedError – If the pipeline is not fitted yet
-
AutoMLxRuntimeError – If there is no predictions after calling the model over the given dataset
-
AutoMLxRuntimeError – If result of time series numerical inverse transform is None
-
- Returns :
-
y_pred – A data frame containing the predicted values.
- Return type :
- forecast ( self , periods , alpha = 0.05 , X = None )
-
Forecast with the selected model.
A dataframe of explanatory variables that support forecast for period number of timestamps beginning from the last index in y. The index of X here must continue from that which was used in fit.
- Parameters :
-
-
periods ( int ) – The number of time steps to forecast from the end of the sample.
-
alpha ( float , default=0.05 ) – A significance level. To receive a prediction interval of 95% alpha must be set to 0.05.
-
X ( pandas.DataFrame , or None , default=None ) – A dataframe of explanatory variables that support forecast for period number of timestamps. Columns must match the ones used in
fit
.
-
- Returns :
-
summary_frame – A dataframe with three columns listing prediction, ci_lower and ci_upper for the given confidence interval (ci) provided by level of alpha. Note: ci columns are excluded for models that don’t support intervals.
- Return type :
-
pandas.Dataframe
- Raises :
-
-
AutoMLxNotFittedError – If the pipeline is not fitted yet.
-
AutoMLxValueError – If explanatory variables are not provided, complete, or length of explanatory variables not equal to requested periods.
-
- score ( self , X , y )
-
Score of this pipeline for a given set of features (
X
) and labels (y
). If inferred_score_metric has multiple score metrics, the first score metric will be calculated.- Parameters :
-
-
X ( pd.DataFrame ) – Training dataset features
-
y ( pd.DataFrame , pd.Series ) – Training dataset target
-
- Raises :
-
AutoMLxNotFittedError – If the pipeline is not fitted yet
- Returns :
-
score – Score of
self.predict(X)
with respect toy
. - Return type :
- transform ( self , X , y )
-
Apply automatic preprocessing to a given set of features (
X
) and labels (y
).- Parameters :
-
-
X ( pandas.DataFrame or None ) – Dataset features
-
y ( pandas.DataFrame , pandas.Series or None ) – Dataset timeseries
-
- Raises :
-
AutoMLxNotFittedError – If the pipeline is not fitted.
- Returns :
-
Transformed dataset features, transformed dataset timeseries
- Return type :
-
(pd.DataFrame or None, pd.DataFrame or pd.Series or None)
- plot_forecast ( self , summary_frame , show_y = True , show_pi = True , additional_frames = None )
-
Plot the forecasts.
- Parameters :
-
-
summary_frame ( pd.DataFrame ) – A dataframe containing columns mean, pi_lower (optional) and pi_upper (optional)
-
show_y ( bool , default=True ) – If True, plots training series y
-
show_pi ( bool , default=True ) – if True, plots Prediction Intervals (PI) when available
-
additional_frames ( dictionary of pd.DataFrame , optional ) – Plots the dataframes to the same axes, e.g., additional_frames = dict(‘label1’=dataframe1, ‘label2’=dataframe2)
-
- Return type :
-
A plotly figure.
- Raises :
-
AutoMLxValueError – If summary dataframe column names are incorrect.