Large Model Support
Data Science Model Deployment and Model Catalog services now support large model deployments.
Large model artifacts can be stored in the model catalog service and be used to create model
deployments. The endpoint mapping feature lets you integrate inference containers, such as the
Text Generation Interface (TGI), even if they don't comply with standard API contracts for the
/predict
and /health
endpoints.
Creating Model Deployment for Large Models
Model deployment supports Bring Your Own Container (BYOC). Build and use a custom container as the runtime dependency when you create a model deployment. With custom containers, you can package system and language dependencies, install, and configure inference servers, and set up different language run times, all within the defined boundaries of an interface with a model deployment resource to run the containers. BYOC means you can transfer containers between different environments so you can migrate and deploy applications to the OCI Cloud.
Changes to the Model Catalog
-
We recommend that you create and save models to the model catalog programmatically using ADS or the OCI Python SDK.
-
You can use ADS to create large models. Large models have artifacts limitations of up to 400 GB.
Deployment of Large Models
Model Deployment is designed to support an array of machine learning inference frameworks, catering to the diverse needs of large model deployments. Among these, OCI supports the Text Generation Interface (TGI), NVIDIA Triton Inference Server, and Virtual Large Language Models (VLLM) for Large Language Models (LLM). This support lets you select the framework that best fits the deployment requirements. TGI's integration with OCI supports customized container use, enabling precise environment set ups tailored to specific model behaviors and dependencies. For models requiring intensive computational resources, especially those of AI and deep learning, the NVIDIA Triton Inference Server offers a streamlined path on OCI. It helps with the efficient management of GPU resources and supports a broad spectrum of machine learning frameworks such as TensorFlow, PyTorch, and ONNX. OCI's handling of VLLM and NVIDIA Triton TensorRT LLM provides specialized optimizations for large language models. These frameworks benefit from enhanced performance capabilities through advanced optimization techniques, such as layer fusion and precision calibration, which are crucial for handling the very large computational demands of large-scale language processing tasks. By deploying these frameworks on OCI, you can use high throughput and low latency inference, making it ideal for applications that require real-time language understanding and generation. More information on the deployment of each option follows:
For background information on deploying large models with TGI see the HuggingFace website.
For the steps to deploy large models using TGI, see the documentation on GitHub.
Triton Inference Server is designed to streamline the deployment and management of Large AI models, supporting several frameworks such as TensorFlow, PyTorch, and ONNX in a single, unified architecture. By using BYOC on Model Deployment, you can customize environments to optimize performance and resource usage according to specific project needs. This set up enhances the capabilities of Triton, making it ideal for deploying complex models efficiently and cost-effectively on OCI. Here is an example for deploying Falcon TensorRT ensemble model with a NVIDIA Triton Inference Server using OCI Data Science Model Deployment's Bring Your Own Container support. The example is based on Triton's inflight_batcher_llm. The Falcon Model TensorRT engine files need to be built using TensorRT-LLM/examples/falcon.
Follow the steps on GitHub to Deploy Large Models using Triton TensoRT LLM.
As AI applications increasingly rely on sophisticated language models, the need for efficient and high-performance inference servers has grown. vLLM is an open source library for fast LLM inference and serving. vLLM uses PagedAttention, an attention algorithm that manages attention keys and values.
Follow the steps for deploying large models using vLLM and the steps for deploying Meta-Llama-3-8B-Instruct with Oracle service managed vLLM(0.3.0) container on GitHub.