{ "cells": [ { "cell_type": "markdown", "id": "322dd88a", "metadata": {}, "source": [ "***\n", "# Building an Image Classifier using AutoMLx\n", "
by the Oracle AutoMLx Team
\n", "\n", "***" ] }, { "cell_type": "markdown", "id": "6d8cbf4b", "metadata": {}, "source": [ "Image Classification Demo Notebook.\n", "\n", "Copyright © 2025, Oracle and/or its affiliates.\n", "\n", "Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/" ] }, { "cell_type": "markdown", "id": "faf204cc", "metadata": {}, "source": [ "## Overview of this Notebook\n", "\n", "In this notebook we will build a image classifier using the Oracle AutoMLx tool for the public PneumoniaMNIST dataset which is part of MedMNIST datasets. The dataset is a multi-label classification dataset, and more details about the dataset can be found at https://medmnist.com/.\n", "We explore the various options provided by the Oracle AutoMLx tool, allowing the user to exercise control over the AutoML training process. We then evaluate the different models trained by AutoML.\n", "\n", "---\n", "## Prerequisites\n", "\n", " - Experience level: Novice (Python and Machine Learning)\n", " - Professional experience: Some industry experience\n", "---\n", "\n", "## Business Use\n", "\n", "Data analytics and modeling problems using Machine Learning (ML) are becoming popular and often rely on data science expertise to build accurate ML models. Such modeling tasks primarily involve the following steps:\n", "- Pick an appropriate model for the given dataset and prediction task at hand.\n", "- Tune the chosen model’s hyperparameters for the given dataset.\n", "\n", "All of these steps are significantly time consuming and heavily rely on data scientist expertise. Unfortunately, to make this problem harder, the best model, and hyperparameter choice widely varies with the dataset and the prediction task. Hence, there is no one-size-fits-all solution to achieve reasonably good model performance. Using a simple Python API, AutoMLx can quickly jump-start the datascience process with an accurately-tuned model for a given prediction task.\n", "\n", "## Table of Contents\n", "\n", "- Setup\n", "- Load the PneumoniaMNIST dataset\n", "- AutoML\n", " - Setting the execution engine\n", " - Create an Instance of Oracle AutoMLx\n", " - Train a Model using AutoML\n", " - Analyze the AutoML optimization process \n", " - Algorithm Selection\n", " - Adaptive Sampling\n", " - Model Tuning\n", " - Confusion Matrix\n", " - Advanced AutoML Configuration \n", "- References" ] }, { "cell_type": "markdown", "id": "25354007", "metadata": {}, "source": [ "\n", "## Setup\n", "\n", "Basic setup for the Notebook." ] }, { "cell_type": "code", "execution_count": 1, "id": "24950533", "metadata": { "execution": { "iopub.execute_input": "2025-04-25T10:15:17.928127Z", "iopub.status.busy": "2025-04-25T10:15:17.927799Z", "iopub.status.idle": "2025-04-25T10:15:18.665789Z", "shell.execute_reply": "2025-04-25T10:15:18.665103Z" } }, "outputs": [], "source": [ "\n", "%matplotlib inline\n", "%load_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "markdown", "id": "7c105ab0", "metadata": {}, "source": [ "Load the required modules." ] }, { "cell_type": "code", "execution_count": 2, "id": "7a3b26ba", "metadata": { "execution": { "iopub.execute_input": "2025-04-25T10:15:18.667923Z", "iopub.status.busy": "2025-04-25T10:15:18.667629Z", "iopub.status.idle": "2025-04-25T10:15:21.742265Z", "shell.execute_reply": "2025-04-25T10:15:21.741594Z" }, "lines_to_next_cell": 0 }, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import plotly.express as px\n", "from sklearn.metrics import balanced_accuracy_score, roc_auc_score\n", "from datasets import load_dataset\n", "from sklearn.model_selection import train_test_split\n", "\n", "# Settings for plots\n", "plt.rcParams['figure.figsize'] = [4, 3]\n", "plt.rcParams['font.size'] = 15\n", "\n", "import automlx" ] }, { "cell_type": "markdown", "id": "168a2cd3", "metadata": {}, "source": [ "\n", "## Load the PneumoniaMNIST dataset\n", "We start by reading in the dataset from Hugging Face." ] }, { "cell_type": "code", "execution_count": 3, "id": "b4da04c9", "metadata": { "execution": { "iopub.execute_input": "2025-04-25T10:15:21.744596Z", "iopub.status.busy": "2025-04-25T10:15:21.744172Z", "iopub.status.idle": "2025-04-25T10:15:23.306705Z", "shell.execute_reply": "2025-04-25T10:15:23.306165Z" } }, "outputs": [], "source": [ "dataset = load_dataset(\"albertvillanova/medmnist-v2\", \"pneumoniamnist\")" ] }, { "cell_type": "markdown", "id": "fedcf8c9", "metadata": {}, "source": [ "Lets look at a few of the values in the data" ] }, { "cell_type": "code", "execution_count": 4, "id": "1bababf0", "metadata": { "execution": { "iopub.execute_input": "2025-04-25T10:15:23.308953Z", "iopub.status.busy": "2025-04-25T10:15:23.308481Z", "iopub.status.idle": "2025-04-25T10:15:23.359988Z", "shell.execute_reply": "2025-04-25T10:15:23.359486Z" } }, "outputs": [ { "data": { "text/plain": [ "{'image': [\n", " |
---|
(1000, 1) | \n", "
None | \n", "
KFoldSplit(Shuffle=True, Seed=7, folds=5, stratify by=target) | \n", "
balanced_accuracy | \n", "
ResNet | \n", "
{'optimizer_class': 'Adam', 'shuffle_dataset_each_epoch': True, 'optimizer_params': {}, 'criterion_class': None, 'criterion_params': {}, 'scheduler_class': None, 'scheduler_params': {}, 'batch_size': 128, 'lr': 0.0031630600334930504, 'epochs': 18, 'input_transform': 'auto', 'tensorboard_dir': None, 'use_tqdm': None, 'prediction_batch_size': 128, 'prediction_input_transform': 'auto', 'shuffling_buffer_size': None, 'freeze.encoder': False, 'load.encoder': None, 'size': '18'} | \n", "
25.2.1 | \n", "
3.9.21 (main, Dec 11 2024, 16:24:11) \\n[GCC 11.2.0] | \n", "
Step | \n", "# Samples | \n", "# Features | \n", "Algorithm | \n", "Hyperparameters | \n", "Score (balanced_accuracy) | \n", "All Metrics | \n", "Runtime (Seconds) | \n", "Memory Usage (GB) | \n", "Finished | \n", "
---|---|---|---|---|---|---|---|---|---|
Model Selection | \n", "{5: 800, 2: 800, 3: 800, 1: 800, 4: 800} | \n", "1 | \n", "ResNet | \n", "{'optimizer_class': 'Adam', 'shuffle_dataset_each_epoch': True, 'optimizer_params': {}, 'criterion_class': None, 'criterion_params': {}, 'scheduler_class': None, 'scheduler_params': {}, 'batch_size': 128, 'lr': 0.001, 'epochs': 18, 'input_transform': 'auto', 'tensorboard_dir': None, 'use_tqdm': None, 'prediction_batch_size': 128, 'prediction_input_transform': 'auto', 'shuffling_buffer_size': None, 'freeze.encoder': False, 'load.encoder': None, 'size': '18'} | \n", "0.9429 | \n", "{'balanced_accuracy': 0.9429372134190274} | \n", "566.4878 | \n", "0.8595 | \n", "Fri Apr 25 03:17:38 2025 | \n", "
Model Selection | \n", "{1: 800, 3: 800, 5: 800, 2: 800, 4: 800} | \n", "1 | \n", "EfficientNet | \n", "{'optimizer_class': 'Adam', 'shuffle_dataset_each_epoch': True, 'optimizer_params': {}, 'criterion_class': None, 'criterion_params': {}, 'scheduler_class': None, 'scheduler_params': {}, 'batch_size': 128, 'lr': 0.001, 'epochs': 18, 'input_transform': 'auto', 'tensorboard_dir': None, 'use_tqdm': None, 'prediction_batch_size': 128, 'prediction_input_transform': 'auto', 'shuffling_buffer_size': None, 'freeze.encoder': False, 'load.encoder': None, 'size': 'b0'} | \n", "0.8153 | \n", "{'balanced_accuracy': 0.8153082390588295} | \n", "734.1541 | \n", "0.8685 | \n", "Fri Apr 25 03:18:12 2025 | \n", "
Model Tuning | \n", "{4: 800, 3: 800, 2: 800, 1: 800, 5: 800} | \n", "1 | \n", "ResNet | \n", "{'optimizer_class': 'Adam', 'shuffle_dataset_each_epoch': True, 'optimizer_params': {}, 'criterion_class': None, 'criterion_params': {}, 'scheduler_class': None, 'scheduler_params': {}, 'batch_size': 128, 'lr': 0.0031630600334930504, 'epochs': 18, 'input_transform': 'auto', 'tensorboard_dir': None, 'use_tqdm': None, 'prediction_batch_size': 128, 'prediction_input_transform': 'auto', 'shuffling_buffer_size': None, 'freeze.encoder': False, 'load.encoder': None, 'size': '18'} | \n", "0.9421 | \n", "{'balanced_accuracy': 0.9420862061013219} | \n", "1679.7299 | \n", "0.9625 | \n", "Fri Apr 25 03:27:58 2025 | \n", "
Model Tuning | \n", "{4: 800, 3: 800, 1: 800, 5: 800, 2: 800} | \n", "1 | \n", "ResNet | \n", "{'optimizer_class': 'Adam', 'shuffle_dataset_each_epoch': True, 'optimizer_params': {}, 'criterion_class': None, 'criterion_params': {}, 'scheduler_class': None, 'scheduler_params': {}, 'batch_size': 128, 'lr': 0.001, 'epochs': 18, 'input_transform': 'auto', 'tensorboard_dir': None, 'use_tqdm': None, 'prediction_batch_size': 128, 'prediction_input_transform': 'auto', 'shuffling_buffer_size': None, 'freeze.encoder': False, 'load.encoder': None, 'size': '18'} | \n", "0.9409 | \n", "{'balanced_accuracy': 0.9409403911576797} | \n", "1654.4550 | \n", "0.8447 | \n", "Fri Apr 25 03:24:15 2025 | \n", "
Model Tuning | \n", "{4: 800, 3: 800, 5: 800, 2: 800, 1: 800} | \n", "1 | \n", "ResNet | \n", "{'optimizer_class': 'Adam', 'shuffle_dataset_each_epoch': True, 'optimizer_params': {}, 'criterion_class': None, 'criterion_params': {}, 'scheduler_class': None, 'scheduler_params': {}, 'batch_size': 128, 'lr': 0.001000099, 'epochs': 18, 'input_transform': 'auto', 'tensorboard_dir': None, 'use_tqdm': None, 'prediction_batch_size': 128, 'prediction_input_transform': 'auto', 'shuffling_buffer_size': None, 'freeze.encoder': False, 'load.encoder': None, 'size': '18'} | \n", "0.94 | \n", "{'balanced_accuracy': 0.9399735188445627} | \n", "1676.6548 | \n", "0.8474 | \n", "Fri Apr 25 03:24:22 2025 | \n", "
Model Tuning | \n", "{1: 800, 2: 800, 3: 800, 4: 800, 5: 800} | \n", "1 | \n", "ResNet | \n", "{'optimizer_class': 'Adam', 'shuffle_dataset_each_epoch': True, 'optimizer_params': {}, 'criterion_class': None, 'criterion_params': {}, 'scheduler_class': None, 'scheduler_params': {}, 'batch_size': 128, 'lr': 0.0031631590334930504, 'epochs': 18, 'input_transform': 'auto', 'tensorboard_dir': None, 'use_tqdm': None, 'prediction_batch_size': 128, 'prediction_input_transform': 'auto', 'shuffling_buffer_size': None, 'freeze.encoder': False, 'load.encoder': None, 'size': '18'} | \n", "0.932 | \n", "{'balanced_accuracy': 0.9320400827675271} | \n", "1606.3738 | \n", "0.9188 | \n", "Fri Apr 25 03:28:28 2025 | \n", "
Model Tuning | \n", "{1: 800, 2: 800, 3: 800, 4: 800, 5: 800} | \n", "1 | \n", "ResNet | \n", "{'optimizer_class': 'Adam', 'shuffle_dataset_each_epoch': True, 'optimizer_params': {}, 'criterion_class': None, 'criterion_params': {}, 'scheduler_class': None, 'scheduler_params': {}, 'batch_size': 128, 'lr': 0.001, 'epochs': 18, 'input_transform': 'auto', 'tensorboard_dir': None, 'use_tqdm': None, 'prediction_batch_size': 128, 'prediction_input_transform': 'auto', 'shuffling_buffer_size': None, 'freeze.encoder': False, 'load.encoder': None, 'size': '34'} | \n", "0.9319 | \n", "{'balanced_accuracy': 0.9318589293978995} | \n", "1835.8036 | \n", "1.0745 | \n", "Fri Apr 25 03:31:47 2025 | \n", "
Model Tuning | \n", "{1: 800, 2: 800, 3: 800, 4: 800, 5: 800} | \n", "1 | \n", "ResNet | \n", "{'optimizer_class': 'Adagrad', 'shuffle_dataset_each_epoch': True, 'optimizer_params': {}, 'criterion_class': None, 'criterion_params': {}, 'scheduler_class': None, 'scheduler_params': {}, 'batch_size': 128, 'lr': 0.001, 'epochs': 18, 'input_transform': 'auto', 'tensorboard_dir': None, 'use_tqdm': None, 'prediction_batch_size': 128, 'prediction_input_transform': 'auto', 'shuffling_buffer_size': None, 'freeze.encoder': False, 'load.encoder': None, 'size': '18'} | \n", "0.9304 | \n", "{'balanced_accuracy': 0.9303550701472287} | \n", "1599.9909 | \n", "0.8875 | \n", "Fri Apr 25 03:28:47 2025 | \n", "
Model Tuning | \n", "{1: 800, 3: 800, 2: 800, 5: 800, 4: 800} | \n", "1 | \n", "ResNet | \n", "{'optimizer_class': 'Adam', 'shuffle_dataset_each_epoch': True, 'optimizer_params': {}, 'criterion_class': None, 'criterion_params': {}, 'scheduler_class': None, 'scheduler_params': {}, 'batch_size': 128, 'lr': 0.000100198, 'epochs': 18, 'input_transform': 'auto', 'tensorboard_dir': None, 'use_tqdm': None, 'prediction_batch_size': 128, 'prediction_input_transform': 'auto', 'shuffling_buffer_size': None, 'freeze.encoder': False, 'load.encoder': None, 'size': '18'} | \n", "0.9215 | \n", "{'balanced_accuracy': 0.9215387221291802} | \n", "1473.4144 | \n", "0.8585 | \n", "Fri Apr 25 03:23:27 2025 | \n", "
Model Tuning | \n", "{1: 800, 2: 800, 3: 800, 4: 800, 5: 800} | \n", "1 | \n", "ResNet | \n", "{'optimizer_class': 'Adam', 'shuffle_dataset_each_epoch': True, 'optimizer_params': {}, 'criterion_class': None, 'criterion_params': {}, 'scheduler_class': None, 'scheduler_params': {}, 'batch_size': 128, 'lr': 0.000100099, 'epochs': 18, 'input_transform': 'auto', 'tensorboard_dir': None, 'use_tqdm': None, 'prediction_batch_size': 128, 'prediction_input_transform': 'auto', 'shuffling_buffer_size': None, 'freeze.encoder': False, 'load.encoder': None, 'size': '18'} | \n", "0.921 | \n", "{'balanced_accuracy': 0.9210114051871254} | \n", "1514.9727 | \n", "0.9292 | \n", "Fri Apr 25 03:23:26 2025 | \n", "
Model Tuning | \n", "{1: 800, 3: 800, 2: 800, 4: 800, 5: 800} | \n", "1 | \n", "ResNet | \n", "{'optimizer_class': 'RMSprop', 'shuffle_dataset_each_epoch': True, 'optimizer_params': {}, 'criterion_class': None, 'criterion_params': {}, 'scheduler_class': None, 'scheduler_params': {}, 'batch_size': 128, 'lr': 0.001, 'epochs': 18, 'input_transform': 'auto', 'tensorboard_dir': None, 'use_tqdm': None, 'prediction_batch_size': 128, 'prediction_input_transform': 'auto', 'shuffling_buffer_size': None, 'freeze.encoder': False, 'load.encoder': None, 'size': '18'} | \n", "0.8965 | \n", "{'balanced_accuracy': 0.8965285352342509} | \n", "1535.3766 | \n", "0.8253 | \n", "Fri Apr 25 03:28:50 2025 | \n", "
Model Tuning | \n", "{1: 800, 2: 800, 3: 800, 4: 800, 5: 800} | \n", "1 | \n", "ResNet | \n", "{'optimizer_class': 'SGD', 'shuffle_dataset_each_epoch': True, 'optimizer_params': {}, 'criterion_class': None, 'criterion_params': {}, 'scheduler_class': None, 'scheduler_params': {}, 'batch_size': 128, 'lr': 0.001, 'epochs': 18, 'input_transform': 'auto', 'tensorboard_dir': None, 'use_tqdm': None, 'prediction_batch_size': 128, 'prediction_input_transform': 'auto', 'shuffling_buffer_size': None, 'freeze.encoder': False, 'load.encoder': None, 'size': '18'} | \n", "0.8746 | \n", "{'balanced_accuracy': 0.874649502424656} | \n", "1438.3949 | \n", "0.7970 | \n", "Fri Apr 25 03:29:03 2025 | \n", "