Motivation

With Panda-like dataframe or numby-like array, do you want to create a code, and choose at the end, the framework to use? Do you want to be able to choose the best framework after simply performing performance measurements? This framework unifies multiple Panda-compatible or Numpy-comptatible components, to allow the writing of a single code, compatible with all.

This project will weave your code with the selected framework, at runtime.

Synopsis

With some parameters and Virtual classes, it's possible to write a code, and execute this code:

  • With or without multicore
  • With or without cluster (multi nodes on Dask or Spark)
  • With or without GPU

To do that, we create some virtual classes, add some methods in others classes, etc.

To reduce the confusion, you must use the classes VDataFrame and VSeries (The prefix V is for Virtual). These classes propose the methods .to_pandas() and .compute() for each version, but are the real classes of the selected framework.

With some parameters, the real classes may be pandas.DataFrame, modin.pandas.DataFrame, cudf.DataFrame, pyspark.pandas.DataFrame without GPU, pyspark.pandas.DataFrame with GPU, dask.dataframe.DataFrame with Pandas or dask.dataframe.DataFrame with cudf (with Pandas or cudf for each partition).

Or, for Numpy, the real classes may be numpy.ndarray, cupy.ndarray or dask.array.

A new @delayed annotation can be use, with or without Dask.

To manage the initialisation of a Dask ou Spark, you must use the VClient(), a connector to the cluster. This alias, can be automatically initialized with some environment variables.

# Sample of code, compatible Pandas, cudf, dask, dask_modin and dask_cudf
from virtual_dataframe import *

TestDF = VDataFrame

with (VClient()):
    @delayed
    def my_function(data: TestDF) -> TestDF:
        return data


    rc = my_function(VDataFrame({"data": [1, 2]}, npartitions=2))
    print(rc.to_pandas())

With this framework, you can select your environment, to run or debug your code.

env Environement
VDF_MODE=pandas Only Python with classical pandas
VDF_MODE=numpy Alias of pandas
VDF_MODE=cudf Python with local cuDF (GPU)
VDF_MODE=cupy Alias of cudf
VDF_MODE=dask Dask with local multiple process and pandas
VDF_MODE=dask_array Alias of dask
VDF_MODE=dask_cudf Dask with local multiple process and cuDF
VDF_MODE=dask
DEBUG=True
Dask with single thread and pandas
VDF_MODE=dask_cudf
DEBUG=True
Dask with single thread and cuDF
VDF_MODE=dask
VDF_CLUSTER=dask://.local
Dask with local cluster and pandas
VDF_MODE=dask_cudf
VDF_CLUSTER=dask://.local
Dask with local cuda cluster and cuDF
VDF_MODE=dask
VDF_CLUSTER=dask://...:ppp
Dask with remote cluster and Pandas
VDF_MODE=dask_cudf
VDF_CLUSTER=dask://...:ppp
Dask with remote cluster and cuDF
VDF_MODE=dask_modin
Dask with modin
VDF_MODE=dask_modin
VDF_CLUSTER=dask://.local
Dask with local cluster and modin
VDF_MODE=dask_modin
VDF_CLUSTER=dask://...:ppp
Dask with remote cluster and modin
VDF_MODE=pyspark
VDF_CLUSTER=spark://.local
PySpark with local cluster and modin
VDF_MODE=pyspark
VDF_CLUSTER=spark://...:ppp
PySpark with remote cluster and modin

For pyspark with GPU, read this.

The real compatibilty between the differents simulation of Pandas, depends on the implement of the modin, cudf, pyspark or dask. Sometime, you can use the VDF_MODE variable, to update some part of code, between the selected backend.

It's not always easy to write a code compatible with all scenario, but it's possible. Generally, add just .compute() and/or .to_pandas() at the end of the ETL, is enough. But, you must use, only the common feature with all frameworks.

After this effort, it's possible to compare the performance about the differents technologies, or propose a component, compatible with differents scenario.

For the deployment of your project, you can select the best framework for your process (in a dockerfile? or virtual environment), with only one ou two environment variables. May be, you can use Modin all the time, except for the end of the year periods, when the use of a GPU is preferable.

With conda environment, you can use variables to set the variables when you activate an environment.

Binder

Alternative

  • The projet Fugue propose another approch to distribute a job with Spark, Dask or Ray.