Training Data Requirements in Generative AI
Understand the guidelines for creating training data for fine-tuning the pretrained models in OCI Generative AI.
Custom models accept only one training dataset file in a JSONL (JSON Lines)
format. The file must have a minimum of 32 prompt/completion pair examples per file. This
dataset is randomly split to a 80:20 ratio for training and validation. There's no maximum
number of sentences for the training file, but large datasets take longer to train.
- About
JSONL
-
A
JSONL
file contains a newJSON
value or object on each line. The file isn't evaluated as a whole, like a regularJSON
file. Instead, each line is treated as if it is a separateJSON
file. This format is ideal for storing a set of inputs inJSON
format.The OCI Generative AI service accepts a
JSONL
file for fine-tuning custom models in the following format:{"prompt": "<first prompt>", "completion": "<expected completion given first prompt>"} {"prompt": "<second prompt>", "completion": "<expected completion given second prompt>"} . . .
JSONL
Example
Ensure that each
JSONL
dataset file that you create for Generative AI has the following properties: - The file is
UTF-8
encoded. - Each line item contains a valid
JSON
object. - Each
JSON
object has two properties:"prompt"
and"completion"
. - Each
JSON
object is entered in a new line or followed by a newline character (\n
).
After you create the JSONL file, add your dataset to an Object Storage bucket.