I am working on a set of patches for the Apache Spark project job to ease the way to deploy complex Python program with external dependencies. One should be able to deploy job as easy as it should be, and Wheels make this job really easy.
Deployment is never a fascinating task to do, we as developer want our code to work in production exactly how it does on our machine. Python was never really good at deployment, but in recent years, it became easier and standardized to package a project, describes in a unified way its dependencies and have them installed properly with Pip, isolated inside virtualenv. It is however not obvious at first sight for non-Pythonista experts and there are several tasks to do to make everything automatic for Python package developer, and so, for a PySpark developer as well.
I describe in this blog post some thoughts on how PySpark should allow users to deploy Python applications and no more simple Python scripts, by handling Wheels and isolated virtual environments.
The main idea behind this proposal is to let developers handle the Python environment to deploy on executors instead of being jailed by what is actually installed on the Spark’s Executors Python envionment. If you agree with this approach, please add a comment in the JIRA ticket for speeding up the integration inside Spark 2.x soon.