Motivation
With Panda-like dataframe or numby-like array, do you want to create a code, and choose at the end, the framework to use? Do you want to be able to choose the best framework after simply performing performance measurements? This framework unifies multiple Panda-compatible or Numpy-comptatible components, to allow the writing of a single code, compatible with all.
This project will weave your code with the selected framework, at runtime.
Synopsis
With some parameters and Virtual classes, it's possible to write a code, and execute this code:
- With or without multicore
- With or without cluster (multi nodes on Dask or Spark)
- With or without GPU
To do that, we create some virtual classes, add some methods in others classes, etc.
To reduce the confusion, you must use the classes VDataFrame
and VSeries
(The prefix V
is for Virtual).
These classes propose the methods .to_pandas()
and .compute()
for each version, but are the real classes
of the selected framework.
With some parameters, the real classes may be pandas.DataFrame
, modin.pandas.DataFrame
,
cudf.DataFrame
,
pyspark.pandas.DataFrame
without GPU,
pyspark.pandas.DataFrame
with GPU,
dask.dataframe.DataFrame
with Pandas or
dask.dataframe.DataFrame
with cudf (with Pandas or cudf for each partition).
Or, for Numpy, the real classes may be numpy.ndarray
, cupy.ndarray
or dask.array
.
A new @delayed
annotation can be use, with or without Dask.
To manage the initialisation of a Dask ou Spark, you must use the VClient()
,
a connector to the cluster.
This alias, can be automatically initialized with some environment variables.
# Sample of code, compatible Pandas, cudf, dask, dask_modin and dask_cudf
from virtual_dataframe import *
TestDF = VDataFrame
with (VClient()):
@delayed
def my_function(data: TestDF) -> TestDF:
return data
rc = my_function(VDataFrame({"data": [1, 2]}, npartitions=2))
print(rc.to_pandas())
With this framework, you can select your environment, to run or debug your code.
env | Environement |
---|---|
VDF_MODE=pandas | Only Python with classical pandas |
VDF_MODE=numpy | Alias of pandas |
VDF_MODE=cudf | Python with local cuDF (GPU) |
VDF_MODE=cupy | Alias of cudf |
VDF_MODE=dask | Dask with local multiple process and pandas |
VDF_MODE=dask_array | Alias of dask |
VDF_MODE=dask_cudf | Dask with local multiple process and cuDF |
VDF_MODE=dask DEBUG=True |
Dask with single thread and pandas |
VDF_MODE=dask_cudf DEBUG=True |
Dask with single thread and cuDF |
VDF_MODE=dask VDF_CLUSTER=dask://.local |
Dask with local cluster and pandas |
VDF_MODE=dask_cudf VDF_CLUSTER=dask://.local |
Dask with local cuda cluster and cuDF |
VDF_MODE=dask VDF_CLUSTER=dask://...:ppp |
Dask with remote cluster and Pandas |
VDF_MODE=dask_cudf VDF_CLUSTER=dask://...:ppp |
Dask with remote cluster and cuDF |
VDF_MODE=dask_modin |
Dask with modin |
VDF_MODE=dask_modin VDF_CLUSTER=dask://.local |
Dask with local cluster and modin |
VDF_MODE=dask_modin VDF_CLUSTER=dask://...:ppp |
Dask with remote cluster and modin |
VDF_MODE=pyspark VDF_CLUSTER=spark://.local |
PySpark with local cluster and modin |
VDF_MODE=pyspark VDF_CLUSTER=spark://...:ppp |
PySpark with remote cluster and modin |
For pyspark with GPU, read this.
The real compatibilty between the differents simulation of Pandas, depends on the implement of the modin, cudf, pyspark
or dask. Sometime, you can use the VDF_MODE
variable, to update some part of code, between
the selected backend.
It's not always easy to write a code compatible with all scenario, but it's possible.
Generally, add just .compute()
and/or .to_pandas()
at the end of the ETL, is enough.
But, you must use, only the common feature with all frameworks.
After this effort, it's possible to compare the performance about the differents technologies, or propose a component, compatible with differents scenario.
For the deployment of your project, you can select the best framework for your process (in a dockerfile? or virtual environment), with only one ou two environment variables. May be, you can use Modin all the time, except for the end of the year periods, when the use of a GPU is preferable.
With conda environment, you can use variables to set the variables when you activate an environment.
Alternative
- The projet Fugue propose another approch to distribute a job with Spark, Dask or Ray.