Migrate from Big Data Appliance (BDA) or Big Data Cloud Service (BDCS)
Find out how to migrate from Oracle Big Data Appliance or Big Data Cloud Service to Big Data Service
- Back up your Big Data Appliance or Big Data Cloud Service data and metadata to Oracle Object Storage.
- Create a Big Data Service cluster on Oracle Cloud Infrastructure.
- Restore your Big Data Appliance or Big Data Cloud Service data from Oracle Object Storage to Big Data Service.
We recommend that even after migrating to OCI, keep your Big Data Appliance or Big Data Cloud Service clusters (in a stopped state) for at least three months as a backup.
Migrating Resources Using WANdisco LiveData Migrator
Ensure that Port 8020 opens at the destination.
For information about WANdisco LiveData Migrator, click here.
To migrate resources using WANdisco LiveData Migrator, follow these steps:
Migrating Resources Using BDR
Before you back up your Oracle Big Data Appliance cluster, ensure the following:
-
You have administrator access to your Big Data Appliance cluster.
-
You need the administrator credentials for Cloudera Manager.
-
You need a Hadoop administrator user with full access to the HDFS data and Hive metadata that's getting backed up to Oracle Object Storage.
-
-
Set up the Oracle Cloud Infrastructure object store to which HDFS data is getting copied. For more information, see Overview of Object Storage.
-
Set up your Oracle Cloud Infrastructure tenancy with the following details
-
The administrator has created a user in Oracle Cloud Infrastructure and has added the user to the required groups.
-
The user has permission and can access the Oracle Cloud Infrastructure console.
-
The user has permission and can create a bucket. For more information, see Let Object Storage admins manage buckets and objects in Common Policies.
-
The user can inspect the configuration of the Oracle Cloud Infrastructure object store.
-
To back up a BDA cluster, follow these steps:
For more information, see Creating a Cluster.
Before you restore your Oracle Big Data Appliance cluster to Oracle Big Data Service, you must have the following:
-
A backup of your Big Data Appliance cluster. See Back up BDA Data to Oracle Object Storage.
-
A deployed Big Data Service cluster. See Create a Big Data Service Cluster on Oracle Cloud Infrastructure.
-
Access to the secret key that has privileges to read the Oracle Object Storage bucket that contains the Big Data Appliance cluster backup.
-
Administrator credentials for Cloudera Manager on your Big Data Service cluster.
- An HDFS superuser and Hive administrator with rights to restore data and metadata to the cluster.
Restore the BDA Backup
- Log on to Cloudera Manager on the Big Data Service cluster.
- Log on to
https://your-utility-node-1:7183
, whereyour-utility-node
is the public or private IP address for the utility node. If high availability is used, this is the first utility node on the cluster. If high availability is not used, this is the only utility node. - Enter the user name
admin
, and the password specified during cluster creation.
- Log on to
- Create an external account in Cloudera Manager for restore.
Use the access key and secret key to create an external account in Cloudera Manager. You set up an external account to allow the cluster to access data in Oracle Object Storage.
To create an external account, follow these steps:- Log on to Cloudera Manager on the Oracle Big Data Service cluster.
- Go to Administration and click External Accounts.
- On the AWS Credentials tab, click Add Access Key Credentials and specify the following:
-
Name: Specify a name for the credentials. For example,
oracle-credential
. -
AWS Access Key ID: Specify a name for the access key. For example,
myaccesskey
. -
AWS Secret Key: Enter the secret key value generated earlier when you created a customer secret key.
-
- Click Add. The Edit S3Guard page appears. Do not select Enable S3Guard.
- Click Save.
- On the page that appears, enable cluster access to S3:
- Select Enable for the cluster name.
- Select the More Secure credential policy and click Continue.
- On the Restart Dependent Services page, select Restart Now, and then click Continue. Restart details are displayed. Restarting the cluster can take a few minutes.
- After the restart, click Continue, and then click Finish.
- Update the s3a endpoint.Note
Skip this step if you have already updated the core-site.xml file.The endpoint URI enables your Hadoop cluster to connect to the object store that contains your source data. Specify this URI in Cloudera Manager.
To update the endpoint, follow these steps:
- Log on to Cloudera Manager on the Oracle Big Data Service cluster.
- From the list of services on the left, click S3 Connector.
- Click the Configuration tab.
- Update the Default S3 Endpoint property with the following:
https://your-tenancy.compat.objectstorage.your-region.oraclecloud.com
For example, https://oraclebigdatadb.compat.objectstorage.us-phoenix-1.oraclecloud.com
- Save your changes.
- Update the cluster:
- Go to your cluster, click Actions, select Deploy Client Configuration, and then confirm the action.
- When complete, click Close.
- Restart the cluster (Click Actions and click Restart).
- Create an HDFS replication schedule for restore.
Restore HDFS data that's backed up to Oracle Object Storage. Restore the HDFS data to the HDFS file system root directory to mirror the source.
If Hive has external data that's captured in HDFS and not managed by Hive, create the HDFS replication schedule before you create the Hive replication schedule.
To create an HDFS replication schedule:
- Log in to Cloudera Manager on the Oracle Big Data Service cluster.
- Create an HDFS replication schedule:
- Go to Backup and click Replication Schedules.
- Click Create Schedule and select HDFS Replication.
- Specify details for the replication schedule:
-
Name: Enter a name. For example,
hdfs-rep1
. -
Source: Select the credential that you defined earlier. For example,
oracle-credential
. -
Source Path: Specify the root location where your data was backed up. For example,
s3a://BDA-BACKUP/
. -
Destination: Select HDFS (cluster name).
-
Destination Path: Enter
/
-
Schedule: Select Immediate.
-
Run As Username: Specify a user with access to the data and metadata that's being restored. This is typically a Hadoop superuser and Sentry administrator.Note
If you don't have a user with access to the required data and metadata, you should create one. Do not use thehdfs
superuser for this step.Note
: If Hadoop encryption is used, ensure to have destination directory created with appropriate keys and the command is executed as user who has encrypt access.
-
- Click Save Schedule. You can monitor the replication on the Replication Schedules page.
- Create a Hive replication schedule for restore.
To restore Hive data and metadata from Oracle Object Storage to the Hadoop cluster, create a Hive replication schedule in Cloudera Manager.
To create a Hive replication schedule, follow these steps:
- Log on to Cloudera Manager on the Oracle Big Data Service cluster.
- Create the replication schedule:
- Go to Backup and click Replication Schedules.
- Click Create Schedule and select Hive Replication.
- Specify details for the Hive replication schedule:
-
Name: Enter a name. For example,
hive-rep1
. -
Source: Specify the credential that you defined earlier. For example,
oracle-credential
. -
Destination: Select Hive (cluster name).
-
Cloud Root Path: Select the root location where you backed up your data. For example,
s3a://BDA-BACKUP/
. -
HDFS Destination Path: Enter
/
-
Databases: Select Replicate All.
-
Replication Option: Select Metadata and Data.
-
Schedule: Select Immediate.
-
Run As Username: Specify a user with access to the data and metadata that will be restored. This is typically a Hadoop and Hive superuser, and Sentry administrator.Note
If you don't have a user with access to the required data and metadata, you should create one. Do not use thehdfs
superuser for this step.
-
- Click Save Schedule. You can monitor the replication on the Replication Schedules page.
Spark
Review the spark job and update it based on the new cluster details.
Yarn
- From the source cluster, copy the Cloudera Manager. (Go to YARN, click Configuration, and click Fair Scheduler Allocations (Deployed) content to the target cluster's same position.)
- If you are unable to copy, create the queue manually. (On Cloudera Manager, go to Clusters and select Dynamic Resource Pool Configuration.)
Sentry
- Migrate the HDFS data and Hive Metadata using BDR, Wandisco, or Hadoop Distcp.
- To export the sentry data from source sentry database and restore at Destination sentry database, you need the Sentry meta migration tool. Reach out to Oracle Support for MOS note Doc ID 2879665.1 for the Sentry meta migration tooling.
Migrating Resources Using the Distcp Tool
You can also migrate data and metadata from BDA and import them to the Big Data Service using the Distcp tool. Distcp is an open source tool that can be used to copy large data sets between distributed file systems within and across clusters.
To prepare the BDA or BDCS cluster for export, follow these steps:
To export data from HDFS, follow these steps:
Migrate HDFS data incrementally using distcp to send data from source to target after an interval of time and an addition, update, or deletion in the source data.
- Be sure the snapshot name in the source and target cluster are the same.
- Don't delete/change the HDFS data in the target cluster. This can cause errors mentioned in next section.
To export Hive metadata, follow these steps:
You now import the exported data and metadata to Big Data Service.
- Set up a fresh target environment on Big Data Service with the same BDA or BDCS Hadoop version (Hadoop 2.7.x) as the source cluster.Note
Note the following:-
- Define the Big Data Service cluster on OCI with the same size as the source BDA or BDCS cluster. However, you must review your computing and storage needs before deciding the size of the target cluster.
- For Oracle Cloud Infrastructure VM shapes, see Compute Shapes. BDA or BDCS does not support all shapes.
- If any software other than the BDA or BDCS stack is installed on the source system using the bootstrap script or some other method, you must install and maintain that software on the target system as well.
-
- Copy the PEM private key (
oci_api_key.pem
) file to all the nodes of the Big Data Service cluster, and set the appropriate permissions. - Export the artifacts from the source BDA or BDCS cluster.
To import data to HDFS, follow these steps:
Import the metadata files and execute the permissions
- Import metadata files from Object Store to
/metadata
in HDFS.hadoop distcp -libjars ${LIBJARS} \ -Dfs.client.socket-timeout=3000000 -Dfs.oci.client.auth.fingerprint=<fingerprint> \ -Dfs.oci.client.auth.pemfilepath=<oci_pem_key> \ -Dfs.oci.client.auth.passphrase=<passphrase> \ -Dfs.oci.client.auth.tenantId=<OCID for Tenancy> \ -Dfs.oci.client.auth.userId=<OCID for User> \ -Dfs.oci.client.hostname=<HostName. Example: https://objectstorage.us-phoenix-1.oraclecloud.com/> \ -Dfs.oci.client.multipart.allowed=true \ -Dfs.oci.client.proxy.uri=<http://proxy-host>:port \ -Dmapreduce.map.java.opts="$DISTCP_PROXY_OPTS" \ -Dmapreduce.reduce.java.opts="$DISTCP_PROXY_OPTS" \ -Dmapreduce.task.timeout=6000000 \ -skipcrccheck -m 40 -bandwidth 500 \ -update -strategy dynamic -i oci://<bucket>@<tenancy>/metadata/ /metadata
- Move files to the local directory.
hdfs dfs -get /metadata/Metadata*
- Run the files in parallel in the background or in multiple terminals.
bash Metadataaa & bash Metadataab & bash Metadataac &...
To import metadata, follow these steps:
Do the following:
Validating the Migration
- Verify that you see the same set of hive tables in the target cluster as in the source cluster.
- Connect to the hive shell.
hive
- Run the following command to list the tables:
show tables;
- Connect to the hive shell.