Providing a Dependency Archive
Your Java or Scala applications might need extra JAR files that you can't, or don't, want to bundle in a Fat JAR. Or you might want to include native code or other assets to make available within the Spark runtime.
When the spark-submit options don't work, Data Flow has the
option of providing a ZIP archive (archive.zip
) along with your application
for bundling third-party dependencies. The ZIP archive can be created using a Docker-based
tool. The archive.zip
is installed on all Spark nodes before running the
application. If you construct the archive.zip
correctly, the Python libraries
are added to the runtime, and the JAR files are added to the Spark classpath. The libraries
added are isolated to one Run. That means they don't interfere with other concurrent Runs or
later Runs. Only one archive can be provided per Run.
Anything in the archive must be compatible with the Data Flow runtime. For example, Data Flow runs on Oracle Linux using particular versions of Java and Python. Binary code compiled for other OSs, or JAR files compiled for other Java versions, might cause the Run to fail. Data Flow provides tools to help you build archives with compatible software. However, these archives are ordinary Zip files, so you're free to create them any way you want. If you use your own tools, you're responsible for ensuring compatibility.
Dependency archives, similarly to your Spark applications, are loaded to Data Flow. Your Data Flow Application definition contains a link to this archive, which can be overridden at runtime. When you run your Application, the archive is downloaded and installed before the Spark job runs. The archive is private to the Run. This means, for example, that you can run concurrently two different instances of the same Application, with different dependencies, but without any conflicts. Dependencies don't persist between Runs, so there aren't any problems with conflicting versions for other Spark applications that you might run.
Building a Dependency Archive Using the Data Flow Dependency Packager
The Structure of the Dependency Archive
Dependency archives are ordinary ZIP files. Advanced users might choose to build archives with their own tools rather than using the Data Flow Dependency Packager. A correctly constructed dependency archive has this general outline:
python
python/lib
python/lib/python3.6/<your_library1>
python/lib/python3.6/<your_library2>
python/lib/python3.6/<...>
python/lib/python3.6/<your_libraryN>
python/lib/user
python/lib/user/<your_static_data>
java
java/<your_jar_file1>
java/<...>
java/<your_jar_fileN>
Data Flow extracts archive files under
/opt/dataflow
directory.Validate an Archive.zip File Using the Data Flow Dependency Packager
You can use the Data Flow Dependency Packager to validate an archive.zip
file locally, before uploading the file to Object Storage.
Navigate to the directory containing the archive.zip
file, and run the
following commands, depending on the shape:
docker run --platform linux/arm64 --rm -v $(pwd):/opt/dataflow --pull always -it phx.ocir.io/axmemlgtri2a/dataflow/dependency-packager-linux_arm64_v8:latest -p 3.11 --validate archive.zip
docker run --platform linux/amd64 --rm -v $(pwd):/opt/dataflow --pull always -it phx.ocir.io/axmemlgtri2a/dataflow/dependency-packager-linux_x86_64:latest -p 3.11 --validate archive.zip
Example Requirements.txt and Packages.txt Files
requirements.txt
file includes the Data Flow
SDK for Python version 2.14.3 in a Data Flow
Application:-i https://pypi.org/simple
certifi==2020.4.5.1
cffi==1.14.0
configparser==4.0.2
cryptography==2.8
oci==2.14.3
pycparser==2.20
pyopenssl==19.1.0
python-dateutil==2.8.1
pytz==2020.1
six==1.15.0
requirements.txt
file includes a mix of PyPI sources,
web sources, and local sources for Python wheel
files:-i https://pypi.org/simple
blis==0.4.1
catalogue==1.0.0
certifi==2020.4.5.1
chardet==3.0.4
cymem==2.0.3
https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz#egg=en-core-web-sm
idna==2.9
importlib-metadata==1.6.0 ; python_version < '3.8'
murmurhash==1.0.2
numpy==1.18.3
plac==1.1.3
preshed==3.0.2
requests==2.23.0
spacy==2.2.4
srsly==1.0.2
thinc==7.4.0
tqdm==4.45.0
urllib3==1.25.9
wasabi==0.6.0
zipp==3.1.0
/opt/dataflow/mywheel-0.1-py3-none-any.whl
ojdbc8-18.3.jar
oraclepki-18.3.jar
osdt_cert-18.3.jar
osdt_core-18.3.jar
ucp-18.3.jar