Jobs
Troubleshoot your jobs and job runs.
Can't create log object on behalf of the user Errors During a Job Run Creation
If the job run creation fails and you see the following lifecycle details:
The specified log group is not found or not authorized. Cannot create log object on behalf of the user.
Ensure the log group is valid and the user has appropriate permissions configured
- Incorrect Log Group OCID
-
Ensure that the log group OCID specified in the job run create configuration is correct.
- Incorrect Permissions
-
You're missing permissions. The user creating the job run must have permissions to log groups and logging content. This is to ensure that the user has access to the specified log group and log object. Also, to help creating a new log object on behalf of the user when
enableAutoLogCreation
is enabled.allow group <group-name> to manage log-groups in compartment <log-compartment-name>
allow group <group-name> to use log-content in compartment <log-compartment-name>
Common mistakes are:
- Only giving the user
use
permissions on log groups. Themanage
permission is required whenenableAutoLogCreation
is enabled. - Allowing the wrong group. The group refers to the group the creator of the job run is in.
If you're creating job runs using instance principals, the required policy
is:
dynamic group <instance-principal-dynamic-group-name>
Bring Your Own Container Job Run Failure When Downloading the Image
When trying to create a bring your own container job run it fails with errors when downloading the image, ensure the following:
- You could be missing the host in the path to the image. The correct format for image path
is
<region-key>.ocir.io/<tenancy-namespace>/<repository-name>:<tag>
. A common mistake is to miss the first part of the path (the host URL). - The container image is in a different region than the job run: Data Science jobs don't support pulling images from OCIR cross-region. Ensure that the container image is in the same region as the job run.
Why Isn't Fast Launch an Option in the Console When Creating a Job
The fast launch option is only available in the regions where it's supported. Not all regions and realms support this feature. For example, it's generally not supported in Dedicated Region Cloud@Customer (DRCC) realms.
The same is true for the ListFastLaunchJobConfigs API endpoint. The API responds with the list of options for fast launch, so for regions where fast launch isn't supported the response is an error or empty list.
400 LimitExceeded Error
When creating a job or job run and this error occur, it means you that you have reached the OCI Service Limits. Watch the increasing your Data Science service limits video to learn how submit a request to increase your service limits.
There is currently no capacity for the specified shape Error
If this error occurs when creating a job run (as the lifecycle detail describes), there is no capacity to create the run. You must retry later, try in other regions, or use different shape families.
401 NotAuthenticated Error When making Requests to the Data Science API
This type of error is entirely unrelated to the Data Science service. Rather, it's an issue on the user side when creating and signing the requests.
If you're using user principal to make the request, some common mistakes are:
- Having invalid API keys, see assigning keys.
- Making a request immediately after uploading a public key. The identity information needs time to propagate across the regions in a realm. Typically, occurs within 5 minutes though occasionally more time might be required.
Job Run Logging Integration is Enabled Though Logs Aren't Generated
For a successfully created job run that reached an IN_PROGRESS
state, but no logs appear in the log object. Typically, this occurs when policies are missing or incorrect. The job run must have permissions to write to the job run log.
First, define a dynamic group for the job run resource:
all { resource.type='datasciencejobrun', resource.compartment.id='<job-run-compartment-ocid>' }
Then set this dynamic group access:
allow dynamic-group <job-runs-dynamic-group> to use log-content in compartment <log-compartment-name>
Some common mistakes are:
- An incorrect compartment is specified. Notice that the compartment described in preceding policies are different.
- For the dynamic group definition, it's the compartment of the job run.
- For the policy statement for access to log content, it's the compartment of the log.
- Defining the dynamic group using the
compartment.id
instead of theresource.compartment.id
. - An incorrect resource type was included in the dynamic group definition. Likely, the dynamic group defined is for the notebook session resource and doesn't include the job run resource. The
datasciencejobrun
resource principal is used to write to logs for job run logging integration so must be included in the dynamic group definition.
Job Run Logging Integration is Enabled Though the Logs Appear Truncated
Data Science jobs supports integration with the OCI Logging service for automatic logging. If the logs appear truncated or incomplete, it's likely because of the following Logging service limits:
- Each entry must be less than 1MB.
- Any log data field can't be more than 10,000 characters.
If the data exceeds these limits, then the log entry is truncated during ingestion.
Job Run Metrics Have No Data
If you don't see the job run metrics during or after job processing, likely you don't have the correct policies configured. Ensure that you have the following policy:
allow group <user-group-name> to read metrics in compartment <compartment-name>
The compartment is the compartment of the job run.
Job run artifact execution failed with exit code ___ Error
This means that the execution of the code failed with the indicated exit code related to the code. Enable logging integration, and ensure that you have enough log statements in the code to debug the issue.
Job Run Exit Code Isn't Indicated
Jobs indicate the exit code of a job run failure when it exits. This information is available in the job run's lifecycle detail field. This is supported for all job runs including bring your own container job runs.
If you're observing that the exit code you know the job run failed with isn't correctly indicated, likely the exit code isn't being propagated correctly.
Some common mistakes are:
- If you're using a shell script as an entry point start other files to run (other python files), then the shell script must capture the exit code from the internal file execution, then later exit the shell script with the captured exit code.
- Throwing exceptions might not be enough. The file run (or container for bring your own
container) must explicitly exit with an exit code. In Python, this is done using
sys.exit(ERROR_CODE)
. - Using an incorrect type for the exit code value. Typically, the incorrect type used is a string. Exit codes must be Numbers or integers, and between 1-255 as described in Job with Exit Codes.
Job Run Invalid Entry Point
Specifying JOB_RUN_ENTRYPOINT
to a file that doesn't exist or the file isn't at the location specified results in this error:
Job run bootstrap failure: invalid job run entry point (JOB_RUN_ENTRYPOINT).