Asynchronous Document Translation

The OCI Language service Asynchronous Document Translation model translates text into a chosen language

OCI Asynchronous Document Translation is a cloud-based service that enables seamless and efficient translation of various document formats at scale in an asynchronous manner within your own Object Storage locations while preserving the structure and format of the original documents. OCI Asynchronous Document Translation uses Oracle pretrained Machine Translation models to perform language translation and other language-related operations.

Asynchronous Document Translation translates various document types. Word, Excel, Power Point, and others can be translated while keeping their original formatting. Plain text, HTML formats, and JSON are supported, ideal for translating online content or integrating translation of global applications. Also, formats for closed captions and subtitling are supported, improving the accessibility of video content.

The service also offers the ability to transform files into LLM AI compatible JSON or CSV files suitable for such tasks as training and fine-tuning ML models or building RAG indices.

Use Cases

Streamlined approach to overcoming language barriers
  • Translate user guides, blogs and knowledge base articles to reach a wider audience.
  • Improve internal communications and knowledge sharing across global teams.
  • Expand the reach of your sales and marketing campaigns by providing presentations and marketing assets in multiple languages.
  • Make your training content more inclusive to non-native speakers by adding subtitles to recorded video content.
  • Develop multi-lingual support for products and services, including expanding your machine learning models to be used with non-English input content.
Prep your multi-lingual Enterprise data for LLM processing
Foundation LLMs and AI Models effectiveness can be improved by using your Enterprise data. The fact that a large proportion of this Enterprise data exists in various formats and languages can provide a challenge. Some LLMs and AI Models only support particular languages and multi-lingual models might perform differently depending on the language.
  • Translate and transform your multi-lingual Enterprise content from various formats into JSON or CSV
    • Segment by sentence, chunk, or the file format's natural boundaries.
  • Use the JSON to build RAG indices, fine-tune custom models, or submit to AI pipelines for further analysis and processing. For example, sentiment analysis, NER.

Supported Document Types

Document Type Extensions
Microsoft Office docx, pptx, xlsx
HTML .html
JSON .json
Text .txt
CSV Comma-separated values, .csv
TSV Tab-separated values, .tsv
SRT SubRip Subtitle file, .srt
Web VTT Web Video Text Tracks Format, .vtt

Supported Languages

For list of supported languages, see Supported Languages. Auto-detection of a dominant source language is supported when the source language parameter is set to auto.

Prerequisite

Async Job Policies setup is required to use the Asynchronous Document Translation service.

Size Limits and Restrictions

  • Maximum document size is 20 MB. Any documents over the size are ignored.
  • All text formats (Text, HTML, CSV, TSV, SRT, WebVTT, JSON) must be encoded in UTF-8.
  • Maximum size of single request is 5 GB. However, a smaller size of request is recommended for faster responses.

Controlling Asynchronous Document Translation Features

With Asynchronous Document Translation you can control and customize translation through advanced properties, either by using a glossary file or specific file properties.

A glossary is a list of user-supplied terms that can be used within the Asynchronous Document Translation to control translation. By using a glossary, you can specify how to translate or not translate certain terminology.

The main use cases for glossaries include:

  • Ensuring context and domain-specific terminology is translated consistently throughout the content.
  • Restrict certain terms or words from translation. For example, brand or product names that you don't want to translate.

To optionally control what elements of a file are translated, use file type specific properties. For example, use columns to translate a CSV file or elements to translate a JSON file.

See the following advanced properties and descriptions:

Glossaries

You can specify custom terminologies per job, where certain words can be translated differently. Glossary can be supplied as comma separated values (CSV) with no header.

Sample value for advanced properties:

{"translation":{"glossary": {"type": "bucket","bucketDetails": {"bucketName":
          "source-bucket", "namespace": "idngwwc5ajp5","prefix":
          "glossary_text.csv"}}}}

Sample glossary CSV file content 1 - Applied to all target languages:

India,India

Oracle,Oracle
Oracle Cloud Infrastructure,Oracle Cloud Infrastructure
Oracle NetSuite,Oracle NetSuite

Sample glossary CSV file content 2 - Language specific glossaries

en,nl,es

India,India,India
Oracle,Oracle,Oracle
Oracle Cloud Infrastructure,Oracle Cloud Infrastructure,Oracle Cloud Infrastructure
Oracle NetSuite,Oracle NetSuite,Oracle NetSuite

Best practices for forced glossaries

  • Keep the forced glossary minimal:
    • Only include terms which you want to control and which are unambiguous.
    • Only use terms that you know you never want to use an alternate meaning of, and you want it to only ever be translated in a single way.
    • Limit the list to proper names, such as brand names and product names.
  • Forced glossaries are case-sensitive:
    • If you need both capitalized and non-capitalized versions of a term to be included, you must include an entry for each version.
    • Similarly the plural version of a term must be included as a separate entry in the glossary
  • Don't include different translations for the same source phrase. MT results can't be guaranteed in such cases.

    Example:

    en,fr
    Oracle MT, Oracle MT
    Oracle MT, Système de traduction automatique de Oracle
CSV controls

You can specify the headers and columns to translate.

  • columnsToTranslate: Index (starting from 1) of the column to translate.
  • hasHeaders: Specifies whether the CSV file has headers, if true the first row remains untranslated.

Example:

{"translation":{"csv":{"columnsToTranslate":[2],"hasHeaders":false}}}
JSON configuration

You can translate specific elements by setting pathsToTranslate to an array of valid JSON path expressions.

Example:

{"translation":{"json":{"filter":"path","pathsToTranslate":
["jsonData.title","jsonData.existingSkills","jsonData.structured.experience[*].role"]}}}
Custom segmentation with a delimiter

By default, each entry in JSON/CSV/TSV is translated at the sentence level. The custom delimiter can be used if the content doesn't consist of normal sentences. The delimiter is a valid regular expression that can be used to split a text.

Example:

To translate each line separately:

{"translation":{ "json": {"delimiters": "\\s*\\n+\\s*"} }

{"translation":{ "csv": {"delimiters": "\\s*\\n+\\s*"} }

{"translation":{ "tsv": {"delimiters": "\\s*\\n+\\s*"} }
HTML content processor

To treat text in JSON/CSV/TSV entries as HTML text, use the "contentProcessor" property.

Example:

{"translation":{ "json": {"contentProcessor": "html"} }

{"translation":{ "csv": {"contentProcessor": "html"} }

{"translation":{ "tsv": {"contentProcessor": "html"} }
Excel: Optional translation of sheet names

By default, the sheet names are untranslated. Translating sheet names can break some macros or references. However, if the spreadsheets don't have a reference using sheet names or macros, the service can translate the sheet names by setting the translateSheetNames property to true.

Example:

{"translation":{"xlsx": {"translateSheetNames":true} }}
Extra translation controls for Office documents

By default, hidden texts, comments, and document properties in an Office document are excluded from translation.

  • The translateHiddenText property can be set to translate hidden texts in the documents.
  • The translateDocProperties property can be set to translate hidden texts in the documents.
  • The translateComments property can be set to translate comments in the documents.

Example:

{"translation":{"docx": {"translateHiddenText":true, "translateDocProperties":true, "translateComments": true}, "pptx":

{"translateHiddenText":true, "translateDocProperties":true, "translateComments": true}, "xlsx":

{"translateHiddenText":true, "translateDocProperties":true, "translateComments": true} }}

The default value of these properties is false. The properties can be set differently for each Office document type as necessary.

{"translation":{"docx": {"translateHiddenText":true}, "pptx": {"translateDocProperties":true}, "xlsx": 
{ "translateComments": true} }}
Translation controls for subtitle files

By default, OCI tries to build a sentence from many subtitle entries before translating the text. However, sometimes a subtitle entry must be independently translated or no proper sentences exist in the text.

If each of the subtitle entries needs to be translated individually, set the value to true. maxItemSize isn't effective in this case.

Output formats (File types)

With this feature, you can to specify the preferable output file for translated text. Translation service automatically detects the input file type based on the file you provide. By default, the same file type is used for translated text.

You can specify the preferable file type for translated text. Supported file types include:

  • JSON
  • CSV
  • Native (default)

Example:

"properties" : {
   "commonOutputFormat" : "json"
}
Note

This property is applied to all files in input source. If several files are provided, each file is translated according to same output format.
Output formats (segmentation)

With this feature, you can specify segmentation options to control how text is divided during translation.

Supported segmentation options are:

  • Natural: No segmentation is done.
  • Sentence: Each paragraph is split into sentences.
  • Chunk-plain: sentence based segmentations used first, and then sentences are joined into chunks up to a specified size.
  • Chunk-natural: The same as chunk-plain, except natural boundaries are respected. No chunk contains sentences from two different paragraphs.

Example:

"properties" : {
"commonOutputFormat" : "csv:chunk-plain:2000"
}

Segmentation settings aren't allowed with native outputFormat.

Note

This property is applied to all files in input source. If several files are provided, each file is translated according to same output format and segmentation settings.

Running Asynchronous Document Translation

Run Asynchronous Document Translation using the OCI Language service.

  • Upload the document to a bucket. For more information, see Upload Dataset.
    1. Open the navigation menu and click Analytics & AI. Under AI Services, click Language.
    2. In the left-side navigation menu, click Jobs.
    3. Click Create Job, and then enter a name and compartment.
    4. Select the Pretrained language translation.
    5. Select source language.
    6. Select target languages.
    7. Click Next.
    8. Enter the Data type.
    9. Enter the bucket where the document is located.
    10. Enter the datafile name.
    11. Enter the text column name of the column that has the text to be processed.
    12. Enter the row ID column. This is the column that uniquely identifies the row.
    13. (Optional) Enter the columns to be copied to output.
    14. (Optional) Enter the Job output data.
    15. To review details, click Next.
    16. Click Create job.
  • Use the oci ai language batch-language-translation command and required parameters to translate one or more files:

    oci ai language batch-language-translation --documents [<list-of-documents>] ... [OPTIONS]

    For a complete list of flags and variable options for CLI commands, see the CLI Command Reference.

  • Run the CreateJob operation to translate one or more files.