comment 1

Wheel deployment for PySpark

Spark-logo-192x100pxI am working on a set of patches for the Apache Spark project job to ease the way to deploy complex Python program with external dependencies. One should be able to deploy job as easy as it should be, and Wheels make this job really easy.

Deployment is never a fascinating task to do, we as developer want our code to work in production exactly how it does on our machine. Python was never really good at deployment, but in recent years, it became easier and standardized to package a project, describes in a unified way its dependencies and have them installed properly with Pip, isolated inside virtualenv. It is however not obvious at first sight for non-Pythonista experts and there are several tasks to do to make everything automatic for Python package developer, and so, for a PySpark developer as well.

I describe in this blog post some thoughts on how PySpark should allow users to deploy Python applications and no more simple Python scripts, by handling Wheels and isolated virtual environments.

The main idea behind this proposal is to let developers handle the Python environment to deploy on executors instead of being jailed by what is actually installed on the Spark’s Executors Python envionment. If you agree with this approach, please add a comment in the JIRA ticket for speeding up the integration inside Spark 2.x soon.

Java/Scala Spark Job Deployment

Let’s first start with how a Java or Scala application is sent to the Spark cluster with spark-submit:

The application is compiled into a package format named Jar, which hold the compiled Java/Scala classes. Dependencies can be automatically downloaded from a central Maven repository by Spark using  --packages argument. Also note that there is an option to change the entry point for the execution of the code inside the Jar with the  --class argument.

Basically, you package your JVM application, and just send it along with its dependencies to the Spark cluster.

Python Packaging and Deployment (out of Spark)

Before seeing how Spark actually deploys Python script, let’s see how a simple Python script is to be deployed nowadays.

Let’s take a script that has some dependencies on other modules:

python-dependencies

It is pretty usual for a Python script to import some packages. While it is only modules from the Python Standard Library, there is no problem, the script can expect it to be present on the production machine1)of course, if the Python version is the same, if it is not the case, there is no guarantee that the module will be available. For instance, argparse is in Python 2.7 but not on 2.6.

Python has long struggle with this problem, and only recently2)ok, a few years ago a clean and efficient solution has been standardized in PEP-4533)Installation using pip tool, 2013 and PEP-4274)the Wheel package format, 2012. Writing a package used to be a nightmare for freshmen, and even today, bootstrapping a state-of-the-art module is not trivial for new Python developer.

For people that never bootstrapped a Python project, I highly recommend giving a try to CookieCutter PyPackage. It will unfold an empty, quite standard Python project in a single command.

For pythonists used to the old  python setup.py develop, you need to forget this command line. The proper5)and safer, see here way to set up a project is:

It’s a longer command line, I know, but this is recommended in the official documentation. The -e  argument stands for “editable” setup, to be used during development. For installation on production, you will not use -e.

Dependencies should be described in a requirements.txt file. I will come back on this subject in a future post. I usually split it into 2 files :

  1. Production dependencies, ie, when running in prod, with only packages that are needed for my application to run. I place all these dependencies inside requirements.txt.
  2. Development dependencies, ie, when developing the app, which packages are needed for developing, formatting, testing, publishing it, because these package are not needed on the production server, for example: pylint, pycoverage, isort, nose. I use for that a requirements-dev.txt.

Also, be careful that the setup.py refers the content of requirements.txt. You can use PBR for automating that, or copy paste this snippet:

I’ll describe how to use PBR also in future post.

However, playing with dependencies on a system is a mess: you might need the administrator privileges to set up things, and you might end up breaking your own system.

Just use Virtualenv to isolate environment.

Virtualenv

I won’t enter too much in the detail in virtualenv. The documentation is pretty good.

Just remember it is always safer to be inside a virtualenv. You can set up all your dependencies without interfering with other program on your system.

To setup a virtualenv, just use:

To setup the virtual environment inside the env directory.

Then enter into the Virtualenv with

Package deployment

Pip and Virtualenv happily work together. Once inside a virtualenv, simply run the pip install  command line shown above and voila, all your dependencies has been properly installed. Pip automatically download packages from a central repository, pypi.python.org.

wheel-04The only piece I did not mentioned in the package format pip uses: the wheel.

Its full description and how to generate it will be part of another blog post. But in a nutshell, Wheel is a compiled format of distribution for python module. Most of the time, python code can be distributed as it, since Python is by design portable provided you use the same Python distribution. But if you embedded C code for low-level optimizations, things get messier: you need to compile the object files for your machine. Or be sure the precompiled package your retrieve will work with your version of the Python interpreter, your libc, your OS,…

Wheel and pip handle this for you.

Simple packages (ie packages that does not embed low-level, compiled code) can be distributed indifferently in source distribution package, binary distribution package or wheel package. But for packages that depends on the system, you need to use wheel.

Here is some definition extracted from the Python Glossary:

Source Distribution (or “sdist”)

A distribution format (usually generated using python setup.py sdist) that provides metadata and the essential source files needed for installing by a tool like pip, or for generating a Built Distribution.

Binary Distribution

A specific kind of Built Distribution that contains compiled extensions.

Wheel

A Built Distribution format introduced by PEP 427, which is intended to replace the Egg format. Wheel is currently supported by pip.

Forget about the egg files. This is a inferior format than the wheel format:

  • Wheel has an official PEP. Egg did not.
  • Wheel is a distribution format, i.e a packaging format. Egg was both a distribution format and a runtime installation format (if left zipped), and was designed to be importable.
  • Wheel archives do not include .pyc files. Therefore, when the distribution only contains python files (i.e. no compiled extensions), and is compatible with Python 2 and 3, it’s possible for a wheel to be “universal”, similar to an sdist.
  • Wheel uses PEP376-compliant .dist-info directories. Egg used .egg-info.
  • Wheel has a richer file naming convention. A single wheel archive can indicate its compatibility with a number of Python language versions and implementations, ABIs, and system architectures.
  • Wheel is versioned. Every wheel file contains the version of the wheel specification and the implementation that packaged it.
  • Wheel is internally organized by sysconfig path type, therefore making it easier to convert to other formats.

Most packages that are distributed on Pypi comes in forms of source distribution package for modules without any system or low-level dependency or a series of precompiled wheels for modules that is closely linked to specific architecture or operating system. For example, the highly famous numpy packages comes with a huge amount of precompiled wheels:

numpy

When you pip install numpy, it will automatically download the right version matching your system. That how much pip is magic!

Just remember that:

  • pip rarely downloads source packages from pypi.python.org when wheels are provided, most of the time it is able to find the right precompiled wheel and install library such as numpy is a lightspeed.
  • if in the worst case, pip cannot find the wheel matching your system, it will download the source distribution package and start the compilation. It will generate a new wheel that will be automatically saved into a cache on your machine (in ~/.cache/pip/wheels). So next time you pip install this package in another environment, the wheel will not be recompiled.

Let’s go back to Spark to see how scripts are deployed and how it can be changed to deploy Python applications.

Python Script Spark Deployment

Python-Logo-PNG-ImageThe user can send directly a Python script as it is, without compilation, which is good for most of trivial cases.

deploy-script-2

It will distribute the script to the executors and run the part of it they need to execute. If the --deploy-mode cluster argument is used, the driver will be executed in a fresh new environment as well.

But if we depends on, say, the pandas library. This is a pretty famous one, which as multiple dependencies (on numpy, six,…). There is almost no easy way to deploy this package except asking for the IT support to set up this package by default on each node of the Spark Cluster.

But what if two PySpark jobs need two different versions of pandas but to be executed on the same Spark Cluster? You can ask IT department to set up as many environments as possible, in different folders. They can refuse or say it will be setup in a few months…

But if you want to get rid of this dependency, PySpark should support virtualenv, wheels and wheelhouse.

Python Application Deployment

Let’s say we have a good, complex Spark job, with multiple dependencies, refactored code (because, refactoring successful code is good). It is no more a simple, disposable script, it becomes an application (or a module, if you want).

Here is the workflow I propose. Of course, this will not be available until this pull request and this one are merged on Spark 2 branch.

1. Packaging the PySpark job like a boss

Your PySpark job has to be packaged in a source distribution package (or better, in a Wheel) with:

A tar.gz file is created inside a dist folder. You can now deploy this package using pip.

python-setup.py-sdist-1

To create a wheel, use:

Most of the time, you hardly embed C code into a Spark job, they are high level code, so a source distribution package will do the job fine.

When you install the wheel of the source distribution package, the requirements.txt will be used to install dependencies as well.

2. Sending to PySpark

We actually have to deploy two elements to the Spark cluster, a runner script and a  job package:

deploy-script-3

We can use a configuration file spark-pypi.conf  to set some settings:

To use the official Pypi mirror, set:

But this requires each node of your Spark Cluster has access to the Internet.

Here, I have enabled the use of virtualenv inside the slave (I have based my work this pull request) set the URL of a pypi mirror (because of a company firewall rules for instance). There are a couple of options that allow to configure the way pip executes, upgrades itself, how virtualenv can use or not system packages,….

Please also note that in this example the job is executed on client  deployment mode, this mean the driver code will be executed on the same machine, on the same environment that the call to spark-submit  has been made. This implies you need to be inside a virtualenv will all the dependencies already installed before calling spark-submit, or you will probably ends up with some “Unable to find” exceptions.

What is the runner-script.py for?

PySpark does not support the --class argument to define entry point for a Python application. It expects to find a script in the command line. So I have to provide an empty shell I call a python runner. With for example the following content:

3. How PySpark will handle the job (client mode)

The driver is executed inside the virtualenv of the client. There is no virtualenv or installation done on the driver in this case, since the client is in control of the driver execution. Workers of course will creates empty virtualenv, one of each executor and execute the task within it.

spark-client-mode

Inside each virtualenv, it will copy the tar.gz (defined in --files  and spark.pyspark.virtualenv.install_package) and will execute a command similar to:

This will automatically execute pip on this source distribution package, and thus download the dependencies defined in requirements.txt (if you have written your setup.py properly)

Then, it will execute the script runner-script.py. The only role of this script is to start the execution from the entry point of the mypackage module.

4. How PySpark will handle the job (cluster mode)

This part will be described later.

Other use cases

Deploy a script with many dependencies to your standalone cluster

Execution:

  • a virtualenv is created on each worker
  • the dependencies described in requirements.txt are installed in each worker.
  • dependencies are downloaded from the Pypi repository
  • the driver is executed on the client, so this command line should be executed from within a virtualenv.

Deploy a simple runner script along with a source distribution package of the complete job project

Execution:

  • a virtualenv is created on each worker
  • the package mypackage_sdist.tar.gz is installed with pip, so if the setup.py refers requirements.txt properly, each the dependencies are installed in each worker.
  • dependencies are downloaded from the Pypi repository
  • the driver is executed on the client, so this command line should be executed from within a virtualenv.
  • the runner script simply call an entry point within mypackage.

Deploy a wheelhouse package to your YARN cluster with:

Execution:

  • a virtualenv is created on each worker
  • the dependencies described in requirements.txt  are installed in each worker
  • dependencies are found into the wheelhouse archive. If not found, it will be downloaded from Pypi repository (to avoid this, remove spark.pyspark.virtualenv.index_urloption)
  • the driver is executed on the cluster, so this command line does not have to be executed from within a virtualenv.

To deploy against an internal Pypi mirror (HTTPS mirror without certificates), force pip upgrade (it is a good practice to always be at the latest version of pip), and inject some wheels manually to the PYTHONPATH :

Execution:

  • a virtualenv is created on each worker
  • the pip tool is updated to the latest version
  • the dependencies described in` requirements.txt are installed in each worker
  • dependencies are found into the wheelhouse archive. If not found, it will be downloaded from a Pypi mirror
  • the two wheels set in the --py-files are added to the PYTHONPATH. You can use this to avoid describing them in the requirements.txt and send them directly. Might be useful for development, however for production you might want to have these dependency projects available on an internal repository and referenced by URL.
  • the driver is executed on the cluster, so this command line does not have to be executed from within a virtualenv.

Wheelhouse ?

Wheelhouse support is an extension of this virtualenv and Pypi support. What if your company does not allow you to download packages from the official Pypi repository (and mirrors), and not want to set up a Pypi mirror? For this case, my proposal provides support for wheelhouse, a special way to package ALL wheels of ALL formats inside a single archive, so no internet connectivity will be needed when deploying the job inside a Spark cluster.

Description of how to package your job Wheelhouse will be published in a future post.

Conclusion

This blog post only describe a proposal for Virtualenv and Wheel support in PySpark. The final result may be different once integrated. I really hope this will be merged on trunk, since this simplify a lot concurrent PySpark jobs deployment on a Spark Cluster, especially you do not have to rely on your IT department to install some Python library or new environment on all executors.

Let me know you have any comment, or add your comment in this Jira Ticket.

References   [ + ]

1. of course, if the Python version is the same, if it is not the case, there is no guarantee that the module will be available. For instance, argparse is in Python 2.7 but not on 2.6
2. ok, a few years ago
3. Installation using pip tool, 2013
4. the Wheel package format, 2012
5. and safer, see here
  • Ramiah Natarajan

    great article. I hope this is available soon.