Quick Start 

This section provides a quick introduction to train and evaluate a machine learning model using Oracle AutoMLx. We explore the various options provided by Oracle AutoMLx, allowing the user to specify different options in the training procedure. We then evaluate the model trained by AutoMLx.

Train a Model using AutoMLx 

Here we show how easy it is to use the AutoMLx train_model API to quickly and automatically train a model for a machine learning problem. We pass the data, with the name of the target to predict and task to the train_model function. This function will return the best, fully-trained model that AutoML could find for the given dataset.

            >>> from automlx import train_model
>>> model = train_model(
    data = "classification_train.csv",  # path to dataset file / pandas DataFrame object
    target_to_predict = "income_group",  # name of the target column in the dataset
    task = 'classification',  # type of problem you are interested in solving
    )

           

That’s it! The model is fully trained and ready to be used to make predictions or to be deployed.

Inspect the Model’s Quality 

But how well can you expect your model to work? You can see how well your model performs on the data used for training as well as an estimate of how well you can expect it to perform on new, unseen data in the future. Both scores can be accessed using model.quality .

            >>> model.quality
Evaluated on 2023-11-29            neg_log_loss
Measured quality on training data   -0.377920
Estimate of future quality          -0.243201

           

Make Predictions using the Model 

We can also use the model to make predictions for new data. This will return a copy of your dataset that contains an additional column for predictions.

            >>> data_with_prediction = model.predict("classification_train.csv")
>>> data_with_prediction.head(2)
    age     education    sex     income_group     prediction for income_group
0    42      diploma    female      <=50k                  <=50k
1    57     bachelors    male       <=50k                  <=50k

           

We can also save the dataset with predictions by passing a CSV file path to the output parameter.

            >>> data_with_prediction = model.predict("classification_train.csv", output='data_with_prediction.csv')

           

Evaluate the Quality of a Model on a New Dataset 

We can evaluate the model’s quality on a new dataset evaluate_model_quality function. We just need to pass the model and the desired dataset to this function.

            >>> from automlx import evaluate_model_quality
>>> score = evaluate_model_quality(model, "classification_test.csv")
>>> score
                        neg_log_loss
classification_test.csv  -0.450856

           

Save a model 

Once we are satisfied with the model, we can save it using the save method, by passing the desired path.

            >>> model.save('model.amlx')

           

Load a model 

We can also load a saved model using load_model function by providing the path to the model.

            >>> from automlx import load_model
>>> loaded_model = load_model('model.amlx')

Advanced (scikit-learn-like) AutoMLx API 

This section provides a quick introduction to training a classifier using the more advanced, scikit-learn-like API from AutoMLx. This API offers all the same features as the train_model function, but allows for more advanced customization and configurability.

The dataset is a multi-class classification dataset. More details about the dataset can be found at Iris dataset . We demonstrate the preliminary steps required to train a model with the Oracle AutoMLx tool. We then explain the tuned model.

Load dataset 

We start by reading in the dataset from Scikit-learn.

            >>> import pandas as pd
>>> from sklearn.datasets import load_iris
>>> data = load_iris()
>>> df = pd.DataFrame(data['data'], columns=data['feature_names'])
>>> y = pd.Series(data['target'])

           

This toy dataset only contains numerical data. We now separate the predictions ( y ) from the training data ( X ) for both the training ( 70% ) and test ( 30% ) datasets. The training set will be used to create a Machine Learning model using AutoMLx, and the test set will be used to evaluate the model’s performance on unseen data.

            >>> from sklearn.model_selection import train_test_split
>>> X_train, X_test, y_train, y_test = train_test_split(df,
                                                        y,
                                                        train_size=0.7,
                                                        random_state=0)
>>> X_train.shape, X_test.shape
((105, 4), (45, 4))

           

Set the AutoMLx engine 

AutoMLx offers the init() function, which allows to initialize the parallel engine. By default, the AutoMLx pipeline uses the dask parallel engine. One can also set the engine to local , which uses python’s multiprocessing library for parallelism instead.

            >>> import automlx
>>> from automlx import init

            >>> init(engine='local')
[2023-01-12 05:48:31,814] [automlx.xengine] Local ProcessPool execution (n_jobs=36)

Train a model using AutoMLx 

The Oracle AutoMLx solution provides a pipeline that automatically finds a tuned model given a prediction task and a training dataset. In particular it allows to find a tuned model for any supervised prediction task, e.g. classification or regression where the target can be binary, categorical or real-valued.

AutoMLx consists of five main modules:

Preprocessing : Clean, impute, engineer, and normalize features.
Algorithm Selection : Identify the right classification algorithm for a given dataset.
Adaptive Sampling : Select a subset of the data samples for the model to be trained on.
Feature Selection : Select a subset of the data features, based on the previously selected model.
Hyperparameter Tuning : Find the right model parameters that maximize score for the given dataset.

All these pieces are readily combined into a simple AutoMLx pipeline which automates the entire Machine Learning process with minimal user input/interaction.

The AutoMLx API is quite simple to work with. We create a Pipeline instance. Next, the training data is passed to the fit() function which executes the previously mentioned steps.

            >>> est = automlx.Pipeline(task='classification')
>>> est.fit(X_train, y_train)
    Pipeline()

           

A model is then generated ( est ) and can be used for prediction tasks. Here, we use the F1_score scoring metric to evaluate the performance of this model on unseen data ( X_test ).

            >>> from sklearn.metrics import f1_score
>>> y_pred = est.predict(X_test)
>>> score_default = f1_score(y_test, y_pred, average='macro')
>>> print(f'Score on test data : {score_default}')
Score on test data : 0.975983436853002

           

The Pipeline can also fit regression, forecasting and anomaly detection models. Please check out the rest of the documentation for more details about advanced configuration parameters.

Explain a classifier 

For a variety of decision-making tasks, getting only a prediction as model output is not sufficient. A user may wish to know why the model outputs that prediction, or which data features are relevant for that prediction. For that purpose the Oracle AutoMLx solution defines the MLExplainer object, which allows to compute a variety of model explanations for any AutoMLx-trained pipeline or scikit-learn-like model. MLExplainer takes as argument the trained model, the training data and labels, as well as the task.

            >>> explainer = automlx.MLExplainer(est,
                                   X_train,
                                   y_train,
                                   task="classification")

           

Let’s explain the model’s performance (relative to the provided train labels) using Global Feature Importance. This technique would change if a given feature were dropped from the dataset, without retraining the model. This notion of feature importance considers each feature independently from all other features.

The method explain_model() allows to compute such feature importances. It also provides 95% confidence intervals for each feature importance attribution.

            >>> result_explain_model_default = explainer.explain_model()
>>> result_explain_model_default.to_dataframe()
    feature attribution     upper_bound     lower_bound
0   petal width (cm)        0.350644        0.416850        0.284437
1   petal length (cm)       0.272190        0.309005        0.235374
2   sepal length (cm)       0.000000        0.000000        0.000000
3   sepal width (cm)        0.000000        0.000000        0.000000

           

The oracle AutoMLx solution offers advanced configuration options and allows one to change the effect of feature interactions and interaction evaluations. It also provides other model and prediction explanation techniques, such as:

Local feature importance , for example, using Kernel SHAP or an enhanced LIME;

Feature Dependence Explanations , such as partial dependence plots or accumulated local effects;

Interactive What-IF explainers , which let users explore a model’s predictions; and

Counterfactual explanations , which show how to change a row to obtain a desired outcome.

Please check out the MLExplainer documentation for more details.