Using odcp Command Line Utility to Copy Data
Use the odcp
command line utility to manage copy jobs data between
HDFS on your cluster and remote storage providers.
odcp
CLI can be used only in clusters that use Cloudera Distribution
including Hadoop.odcp
uses Spark to provide parallel transfer of one or more files. It
takes the input file and splits it into chunks, which are then transferred in parallel to
the destination. By default, transferred chunks are then merged back to one output file.
odcp
supports copying files when using the following:-
Apache Hadoop Distributed File Service (HDFS)
-
Apache WebHDFS and Secure WebHDFS (SWebHDFS)
-
Amazon Simple Storage Service (S3)
-
Oracle Cloud Infrastructure Object Storage
-
Hypertext Transfer Protocol (HTTP) and HTTP Secure (HTTPS) — Used for sources only.
Before You Begin
The following topics tell how to use the options to the odcp
command
to copy data between HDFS on your cluster and external storage providers.
-
Access to all running storage services.
-
All required credentials established, for example Oracle Cloud Infrastructure Object Storage instances.
See odcp Reference for the odcp
syntax, parameters,
and options.
Using bda-oss-admin with odcp
Use bda-oss-admin
commands to configure the cluster for use with
storage providers. This makes it easier and faster to use odcp
with the storage
provider.
Any user with access privileges to the cluster can run odcp
.
To copy data between HDFS and a storage provider, for example, Oracle Cloud Infrastructure Object Storage, you must have an account with the data store and access to it.
To copy data between HDFS and a storage provider:
Copying Data on a Secure Cluster
Using odcp
to copy data on a Kerberos-enabled cluster requires some
additional steps.
In Oracle Big Data Service, a cluster is Kerberos-enabled when it's created with the Secure and Highly Available (HA) option selected.
If you want to execute a long running job or run odcp
from an automated shell script or from a workflow service such as Apache Oozie, then you must pass to the odcp
command a Kerberos principal and the full path to the principal's keytab
file, as described below:
odcp
job from the console,
you don't have to generate a keytab
file or specify the principal. You just
have to have an active Kerberos token (created using the kinit
command).
Retrying a Failed Copy Job
If a copy job fails, you can retry it. When retrying the job, the source and destination are automatically synchronized. Therefore odcp doesn't transfer successfully transferred file parts from source to destination.
Use the following:
odcp --retry <source> <target>
Debugging odcp
You must configure the cluster to enable debugging for odcp.
Configuring a Cluster to Enable Debugging
Collecting Transfer Rates
You can collect the transfer rates when debugging is enabled.
Transfer rates are reported after every:
-
Read chunk operation
-
Write or upload chunk operation
The summary throughput is reported after a chunk transfer is completed. The summary throughput includes all:
-
Read operations
-
Write or upload operations
-
Spark framework operations (task distribution, task management, etc.)
Output Example:
./get-transfer-rates.sh application_1476272395108_0054 2>/dev/null
Action,Speed [MBps],Start time,End time,Duration [s],Size [B]
Download from OSS,2.5855451864420473,2016-10-31 11:34:48,2016-10-31 11:38:06,198.024,536870912
Download from OSS,2.548912231791706,2016-10-31 11:34:47,2016-10-31 11:38:08,200.87,536870912
Download from OSS,2.53447780846872,2016-10-31 11:34:47,2016-10-31 11:38:09,202.014,536870912
Download from OSS,2.5130931169717226,2016-10-31 11:34:48,2016-10-31 11:38:11,203.733,536870912
Write to HDFS,208.04550995530275,2016-10-31 14:00:30,2016-10-31 14:00:33,2.46099999999999967435,536870912
Write to HDFS,271.76220806794055,2016-10-31 14:00:38,2016-10-31 14:00:40,1.88400000000000001398,536870912
Write to HDFS,277.5067750677507,2016-10-31 14:00:43,2016-10-31 14:00:45,1.84499999999999985045,536870912
Write to HDFS,218.0579216354344,2016-10-31 14:00:44,2016-10-31 14:00:46,2.34800000000000013207,536870912
Write to HDFS,195.56913674560735,2016-10-31 14:00:44,2016-10-31 14:00:47,2.61799999999999978370,536870912
Use the following command to collect output rates:
get-transfer-rates.sh application_id
odcp Reference
The odcp command-line utility has the single command
odcp
, with parameters and options as described below.
Syntax
odcp [<options>]
<source1> [<source2> ...]
<destination>
Parameters
Parameter | Description |
---|---|
<source1>
[<source2> ...] |
The source can be any of the following:
If you specify multiple sources, list them one after the other:
If two or more source files have the same name, nothing is copied and
Regular expressions are supported through these parameters:
|
<destination> |
The destination can be any of the following:
|
Options
Option | Description |
---|---|
|
Destination file part size in bytes.
The remainder after dividing |
|
Concatenate the file chunks (default). |
|
Specify the number of executor cores. The default value is |
|
Specify the executors memory limit in gigabytes. The default value is |
|
Specify extra configuration options. For example: --extra-conf
spark.kryoserializer.buffer.max=128m |
|
Specify files to concatenate to a
|
|
Show help for this command. |
|
The full path to the keytab file of the Kerberos principal. (Use in a Kerberos-enabled Spark environment only.) |
|
The Kerberos principal. (Use in a Kerberos-enabled Spark environment only.) |
|
Don't overwrite an existing file. |
|
Don't copy files recursively. |
|
Specify the number of executors. The default value is |
|
Show the progress of the data transfer. |
|
Retry if the previous transfer failed or was interrupted. |
|
Destination file part size in bytes.
The remainder after dividing |
|
The path to a directory containing an Apache Spark installation. If nothing is
specified, |
|
Filters sources by matching the source name with a regular expression.
|
--sync |
Synchronize the |
|
Enable verbose mode for debugging. |