comment 0

Draft for a Scalable Python Data Processing Framework

Design Goals

This post describes an idea for a data processing framework built on Python for Data Processing project inspirited by state-of-the-art actor systems such as Akka. It is a bit like a restricted version of a lambda architecture.

This could be used in ETL, data extraction or any custom warehouse process, where data are pushed on pulled from one side, need some obscure processing which can involve getting more data from somewhere else, and then stored in a database or storage area.

I don’t have a name for this framework, I like how “Akka” is a small palindrome name. Maybe I’ll find a nice palindrome name in a near future.

I’ll start with a high overview of the Lambda Architecture and Actor Model where I found some inspiration and then describe the variation I would like my system to be from this model.

Lambda Architecture

A data-processing architecture designed to handle massive quantities of data by taking advantages of both batch and stream processing methods

What I like in this definition is the acknowledgement of the differences between batch processing of data and stream processing of data.

Here is the overall architecture diagram of a Lambda Architecture:

It seems like there is no single product to solve this problem, and thus some stacks start to emerge as reference implementation of this problem. A good one is SMACK. You can find here a good presentation of this stack here and here.

The SMACK Stack

What interest me a lot in this stack, is actually that Akka is needed. SMACK stands for “Spark, Mesos, Akka, Cassandra and Kafka. Communication is done using Kafka, Cassandra will store the data. Spark will do computations and Mesos will deploy jobs and support the cluster (the acronym forgets about all the other necessary elements such as Zookeeper for Kafka, Marathon and Chronos for Mesos,…).

The fact is Spark provides out-of-the box both Streaming AND Streaming processing support.

But what is the difference between Batch and Stream Processing?

Batch Processing

Analysis after data has been accumulated

Stream Processing

Analysis as the data arrives. The Analysis will not be meaningful for a long time.

Here is the relationship between these two concept on a time diagram:

So, streaming is meant to be as quick as possible, most of the time it will update a view or a model somewhere on a storage that might be visualized later or immediately in a metrics dashboard.

Batch processing is a job you start less often, might need access to the historical data and so a huge among of data. Results might be saved in a database or in a report to be used later on.

To have both in the same application is the goal of Lambda Architecture.

Why the need of Akka?

So in a way, the picture is complete. There is no need for another layer. But promoters of the SMACK stack add Akka, which seems to give a higher level of abstraction to drive streaming or batch jobs.

In other words, you code with Akka “representations” of elements that might be local or distant, each representing a certain “data processing concept”.

Akka for Python?

There are a bunch of Python module that exposes the Actor model to the Python world:

But they look like “academic” implementations of the Actor model, with all the features. I don’t really want a Akka for Python, but I want to see how to take inspirations from this stack that got famous recently, as being a “reference implementation” of the Lambda Architecture for Big Data for the JVM and might surely be tremendously successful in the future.

I want to write a framework that is practical, easy to write for, and, more importantly, easy to debug and maintain. That’s the key for me, having a huge feature base is pointless when maintaining anything written for this framework is a pain. I prefer a smaller feature list but something easy to debug and support.

And yes, I want this to be in Python, since the JVM already has all what it needs.

Actor Pattern

Actor is a unit of code organization that allows creating concurrent, scalable and fault tolerant application. In the following of this page, I will use the Actor model as a reference for the presentation of the proposal design, emphasis with the differences with this well documented model:

  • Python 3.5 and higher
    • because we all love Python and JVM people already have Akka.
    • Python 3.5 for the future proof. Let’s forget about the past.
  • Twisted/AsyncIO because Twisted it not so complicated to handle, and they have now a good Python 3 support
  • Distributed by Design, automatic scaling
  • Data and messages are immutability (like Spark RDD)
  • Stateless, to make scalability trivial
  • Elastic and Dynamic, adaptive load-balance
  • Fault tolerant and self-healing

At the very bottom level, actor is an extension of the object model, and is an alternative for threaded functions, callbacks or other concurrent event handling. Actually, you implement the Actor model on top of one of these concurrent architectures.

Actors usually react to messages send from its surroundings, occupying almost 0% CPU when no message is sent. Messages are queued and treated once at a time.

When an actor receives a message, it can:

  • create new actors
  • send message to actor it has the address
  • choose how it will react to react to new messages (ex: Become/Unbecome with a “behavior” stack)

Actor model is well described in the documentation of the inspiriting Python framework Pikka.

Limits of the Actor model

  • Actor cannot compose, they tend to be responsible for their underlying children. Said otherwise, I cannot reuse easily an actor in a different but highly similar context
  • Actors are stateful and can change behavior over time. This introduces a lot of complexity we don’t really need all of the time
  • Type safety is hard to implement, especially when you begin to morphing behavior.

Major Deviations with the Actor Model

Here is some deviations in the Actor Model that I would like to do :

  • Actor should be stateless
    • No changing behavior
    • it doesn’t gracefully degrade
    • It cannot be used to implement a Finite State Machine
    • easier to debug and support
  • Message are split into 2 categories:
    • append-only JSON structure called “Context” (each data is named like in a dictionary)
    • submessage called “SubContext”
  • Actors doesn’t “know” the rest of the graph, it simply treat the message and react to it.
  • Actor does not “supervise” other actors to react to errors. Errors are handled by the root node
  • Message structure and dependency are automatically checked upfront by the Data Dependency Checker.

The reason behind these variations is that I really would like to focus on ease of development and maintainability. At the end, the most wonderful system can get so tricky to support that it will die alone in the dark.

I want this to look like a Lambda Architecture with easy change and maintenance features.

Features Descriptions

Append-only immutability

Data crossing the graph is immutable in the sense “cannot be overridden”. All data is hold in a single structure called “Context” with accessors to add named data inside a dictionary.

Process Definition

Actors are created by a factory and organized as a Directed Acyclic Graph (DAG) with a single root node called “supervisor”:

Résultat de recherche d'images pour "directed acyclic graph"

We can define an addressing scheme to reach each node: a/b/d  for instance.

This allows to change the configuration of a given node easily:

The supervisor is responsible for handling errors. Since context is immutable (append-only), we can restart the graph from anywhere.

Error handling

The proposed error handling of Akka with supervisor taking care of the failure of one of its node greatly inspirited, but only at the supervisor level, because it is the only one who knows the complete graph.

This simplify a lot the design of the actor, but does not allow to same among of flexibility than in Akka.

Actor reacts to messages, asynchronous and lockless.

Data Dependency Checks

Let’s imagine you have and actor A that reads a given message. It expect to find a data d1, performs some operations and outputs a message d2.

It is connected to another actor B that needs to read d2. What if you implement B in a graph without someone creating a message d2 ?

Data Dependency Checker verifies that you create a graph that is logically consistent.