Quick Start
This section provides a quick introduction to train and evaluate a machine learning model using Oracle AutoMLx. We explore the various options provided by Oracle AutoMLx, allowing the user to specify different options in the training procedure. We then evaluate the model trained by AutoMLx.
Train a Model using AutoMLx
Here we show how easy it is to use the AutoMLx train_model API to quickly and automatically train a model for a machine learning problem. We pass the data, with the name of the target to predict and task to the train_model function. This function will return the best, fully-trained model that AutoML could find for the given dataset.
>>> from automlx import train_model
>>> model = train_model(
data = "classification_train.csv", # path to dataset file / pandas DataFrame object
target_to_predict = "income_group", # name of the target column in the dataset
task = 'classification', # type of problem you are interested in solving
)
That’s it! The model is fully trained and ready to be used to make predictions or to be deployed.
Inspect the Model’s Quality
But how well can you expect your model to work? You can see how well your model performs on the data used for training as well as an estimate of how well you can expect it to perform on new, unseen data in the future. Both scores can be accessed using model.quality .
>>> model.quality
Evaluated on 2023-11-29 neg_log_loss
Measured quality on training data -0.377920
Estimate of future quality -0.243201
Make Predictions using the Model
We can also use the model to make predictions for new data. This will return a copy of your dataset that contains an additional column for predictions.
>>> data_with_prediction = model.predict("classification_train.csv")
>>> data_with_prediction.head(2)
age education sex income_group prediction for income_group
0 42 diploma female <=50k <=50k
1 57 bachelors male <=50k <=50k
We can also save the dataset with predictions by passing a CSV file path to the output parameter.
>>> data_with_prediction = model.predict("classification_train.csv", output='data_with_prediction.csv')
Evaluate the Quality of a Model on a New Dataset
We can evaluate the model’s quality on a new dataset evaluate_model_quality function. We just need to pass the model and the desired dataset to this function.
>>> from automlx import evaluate_model_quality
>>> score = evaluate_model_quality(model, "classification_test.csv")
>>> score
neg_log_loss
classification_test.csv -0.450856
Save a model
Once we are satisfied with the model, we can save it using the save method, by passing the desired path.
>>> model.save('model.amlx')
Load a model
We can also load a saved model using load_model function by providing the path to the model.
>>> from automlx import load_model
>>> loaded_model = load_model('model.amlx')
Advanced (scikit-learn-like) AutoMLx API
This section provides a quick introduction to training a classifier using the more advanced, scikit-learn-like API from AutoMLx. This API offers all the same features as the train_model function, but allows for more advanced customization and configurability.
The dataset is a multi-class classification dataset. More details about the dataset can be found at Iris dataset . We demonstrate the preliminary steps required to train a model with the Oracle AutoMLx tool. We then explain the tuned model.
Load dataset
We start by reading in the dataset from Scikit-learn.
>>> import pandas as pd
>>> from sklearn.datasets import load_iris
>>> data = load_iris()
>>> df = pd.DataFrame(data['data'], columns=data['feature_names'])
>>> y = pd.Series(data['target'])
This toy dataset only contains numerical data. We now separate the predictions ( y ) from the training data ( X ) for both the training ( 70% ) and test ( 30% ) datasets. The training set will be used to create a Machine Learning model using AutoMLx, and the test set will be used to evaluate the model’s performance on unseen data.
>>> from sklearn.model_selection import train_test_split
>>> X_train, X_test, y_train, y_test = train_test_split(df,
y,
train_size=0.7,
random_state=0)
>>> X_train.shape, X_test.shape
((105, 4), (45, 4))
Set the AutoMLx engine
AutoMLx offers the
init()
function, which allows to initialize the parallel engine.
By default, the AutoMLx pipeline uses the
dask
parallel engine. One can also set the engine to
local
,
which uses python’s multiprocessing library for parallelism instead.
>>> import automlx
>>> from automlx import init
>>> init(engine='local')
[2023-01-12 05:48:31,814] [automlx.xengine] Local ProcessPool execution (n_jobs=36)
Train a model using AutoMLx
The Oracle AutoMLx solution provides a pipeline that automatically finds a tuned model given a prediction task and a training dataset. In particular it allows to find a tuned model for any supervised prediction task, e.g. classification or regression where the target can be binary, categorical or real-valued.
- AutoMLx consists of five main modules:
-
-
Preprocessing : Clean, impute, engineer, and normalize features.
-
Algorithm Selection : Identify the right classification algorithm for a given dataset.
-
Adaptive Sampling : Select a subset of the data samples for the model to be trained on.
-
Feature Selection : Select a subset of the data features, based on the previously selected model.
-
Hyperparameter Tuning : Find the right model parameters that maximize score for the given dataset.
-
All these pieces are readily combined into a simple AutoMLx pipeline which automates the entire Machine Learning process with minimal user input/interaction.
The AutoMLx API is quite simple to work with. We create a
Pipeline
instance.
Next, the training data is passed to the
fit()
function which executes the previously mentioned steps.
>>> est = automlx.Pipeline(task='classification')
>>> est.fit(X_train, y_train)
Pipeline()
A model is then generated ( est ) and can be used for prediction tasks. Here, we use the F1_score scoring metric to evaluate the performance of this model on unseen data ( X_test ).
>>> from sklearn.metrics import f1_score
>>> y_pred = est.predict(X_test)
>>> score_default = f1_score(y_test, y_pred, average='macro')
>>> print(f'Score on test data : {score_default}')
Score on test data : 0.975983436853002
The
Pipeline
can also fit regression, forecasting and anomaly detection models.
Please check out the rest of the documentation for more details about advanced configuration parameters.
Explain a classifier
For a variety of decision-making tasks, getting only a prediction as model output is not sufficient.
A user may wish to know why the model outputs that prediction, or which data features are relevant for that prediction.
For that purpose the Oracle AutoMLx solution defines the
MLExplainer
object, which allows to compute a variety of model explanations for any AutoMLx-trained pipeline or scikit-learn-like model.
MLExplainer
takes as argument the trained model, the training data and labels, as well as the task.
>>> explainer = automlx.MLExplainer(est,
X_train,
y_train,
task="classification")
Let’s explain the model’s performance (relative to the provided train labels) using Global Feature Importance. This technique would change if a given feature were dropped from the dataset, without retraining the model. This notion of feature importance considers each feature independently from all other features.
The method
explain_model()
allows to compute such feature importances. It also provides 95% confidence intervals for each feature importance attribution.
>>> result_explain_model_default = explainer.explain_model()
>>> result_explain_model_default.to_dataframe()
feature attribution upper_bound lower_bound
0 petal width (cm) 0.350644 0.416850 0.284437
1 petal length (cm) 0.272190 0.309005 0.235374
2 sepal length (cm) 0.000000 0.000000 0.000000
3 sepal width (cm) 0.000000 0.000000 0.000000
The oracle AutoMLx solution offers advanced configuration options and allows one to change the effect of feature interactions and interaction evaluations. It also provides other model and prediction explanation techniques, such as:
Local feature importance , for example, using Kernel SHAP or an enhanced LIME;
Feature Dependence Explanations , such as partial dependence plots or accumulated local effects;
Interactive What-IF explainers , which let users explore a model’s predictions; and
Counterfactual explanations , which show how to change a row to obtain a desired outcome.
Please check out the
MLExplainer
documentation for more details.