Large Model Support

Data Science Model Deployment and Model Catalog services now support large model deployments.

Large model artifacts can be stored in the model catalog service and be used to create model deployments. The endpoint mapping feature lets you integrate inference containers, such as the Text Generation Interface (TGI), even if they don't comply with standard API contracts for the /predict and /health endpoints.

Creating Model Deployment for Large Models

Model deployment supports Bring Your Own Container (BYOC). Build and use a custom container as the runtime dependency when you create a model deployment. With custom containers, you can package system and language dependencies, install, and configure inference servers, and set up different language run times, all within the defined boundaries of an interface with a model deployment resource to run the containers. BYOC means you can transfer containers between different environments so you can migrate and deploy applications to the OCI Cloud.

Changes to the Model Catalog

Create a model and save it to the model catalog by using the ADS SDK, OCI Python SDK, or the Console. For more information, see Creating and Saving a Model to the Model Catalog and Large Model Artifacts. Large model cataloging uses the same export feature to save models in the model catalog. The user experience is no different to the documented behavior.
Important

Deployment of Large Models

Model Deployment is designed to support an array of machine learning inference frameworks, catering to the diverse needs of large model deployments. Among these, OCI supports the Text Generation Interface (TGI), NVIDIA Triton Inference Server, and Virtual Large Language Models (VLLM) for Large Language Models (LLM). This support lets you select the framework that best fits the deployment requirements. TGI's integration with OCI supports customized container use, enabling precise environment set ups tailored to specific model behaviors and dependencies. For models requiring intensive computational resources, especially those of AI and deep learning, the NVIDIA Triton Inference Server offers a streamlined path on OCI. It helps with the efficient management of GPU resources and supports a broad spectrum of machine learning frameworks such as TensorFlow, PyTorch, and ONNX. OCI's handling of VLLM and NVIDIA Triton TensorRT LLM provides specialized optimizations for large language models. These frameworks benefit from enhanced performance capabilities through advanced optimization techniques, such as layer fusion and precision calibration, which are crucial for handling the very large computational demands of large-scale language processing tasks. By deploying these frameworks on OCI, you can use high throughput and low latency inference, making it ideal for applications that require real-time language understanding and generation. More information on the deployment of each option follows:

Deploy Large Models Using Text Generation Interface (TGI)

For background information on deploying large models with TGI see the HuggingFace website.

For the steps to deploy large models using TGI, see the documentation on GitHub.

Deploy Large Models Using NVIDIA's Triton Inference Server

Triton Inference Server is designed to streamline the deployment and management of Large AI models, supporting several frameworks such as TensorFlow, PyTorch, and ONNX in a single, unified architecture. By using BYOC on Model Deployment, you can customize environments to optimize performance and resource usage according to specific project needs. This set up enhances the capabilities of Triton, making it ideal for deploying complex models efficiently and cost-effectively on OCI. Here is an example for deploying Falcon TensorRT ensemble model with a NVIDIA Triton Inference Server using OCI Data Science Model Deployment's Bring Your Own Container support. The example is based on Triton's inflight_batcher_llm. The Falcon Model TensorRT engine files need to be built using TensorRT-LLM/examples/falcon.

Follow the steps on GitHub to Deploy Large Models using Triton TensoRT LLM.

Deploy Large Models Using the vLLM Inference Server

As AI applications increasingly rely on sophisticated language models, the need for efficient and high-performance inference servers has grown. vLLM is an open source library for fast LLM inference and serving. vLLM uses PagedAttention, an attention algorithm that manages attention keys and values.

Follow the steps for deploying large models using vLLM and the steps for deploying Meta-Llama-3-8B-Instruct with Oracle service managed vLLM(0.3.0) container on GitHub.