Compatibility

This project is just a wrapper. So, it inherits limitations and bugs from other projects. Sorry for that.

Limitations of Pandas like framework

pandas
All data must be in DRAM

modin
Read this

cudf
All data must be in VRAM
All data types in cuDF are nullable
Iterating over a cuDF Series, DataFrame or Index is not supported.
Join (or merge) and groupby operations in cuDF do not guarantee output ordering.
The order of operations is not always deterministic
Cudf does not support duplicate column names
Cudf also supports .apply() it relies on Numba to JIT compile the UDF and execute it on the GPU
.apply(result_type=...) not supported

dask
 transpose() and MultiIndex are not implemented
Column assignment doesn't support type list

dask_cudf
See cudf and dask.
Categories with strings not implemented

pyspark
Read this
Limitations of Numpy like framework

numpy
All data must be in RAM

cupy
Read this
- block() not implemented
- delete() not implemented
- insert() not implemented

dask array
Read this
- identity() not implemented
- asfarray() not implemented
- asfortranarray() not implemented
- ascontiguousarray() not implemented
- asarray_chkfinite() not implemented
- require() not implemented
- column_stack() not implemented
- row_stack() not implemented
- *split*() not implemented
- resize() not implemented
- trim_zeros() not implemented
- in1d() not implemented
- intersect1d() not implemented
- setdiff1d() not implemented
- setxor1d() not implemented
- column_stack() not implemented
- row_stack() not implemented
- fromiter() not implemented

For compatibility between numpy and cupy, see here.

File format compatibility

To be compatible with all framework, you must only use the common features. We accept some function to read or write files, but we write a warning if you use a function not compatible with others frameworks.

read_... / to_... pandas cudf modin dask dask_modin dask_cudf pyspark
vdf.read_csv
VDataFrame.to_csv
VSeries.to_csv
vdf.read_excel
VDataFrame.to_excel
VSeries.to_excel
vdf.read_feather
VDataFrame.to_feather
vdf.read_fwf
vdf.read_hdf
VDataFrame.to_hdf
VSeries.to_hdf
vdf.read_json
VDataFrame.to_json
VSeries.to_json
vdf.read_orc
VDataFrame.to_orc
vdf.read_parquet
VDataFrame.to_parquet
vdf.read_sql_table
VDataFrame.to_sql
VSeries.to_sql
load... / save... numpy cupy dask.array
vpd.load() npy
vpd.save() npy
vpd.savez() npz
vpd.loadtxt()
vpd.savetxt()

Cross framework compatibility

small data middle data big data
1-CPU pandas, numpy
Limits:+
n-CPU modin, numpy
Limits+
dask, dask_modin or pyspark and dask.array
Limits:++
GPU cudf, cupy
Limits:++
dask_cudf, pyspark+spark-rapids and dask.array
Limits:+++

To develop, you can choose the level to be compatible with others frameworks. Each cell is strongly compatible with the upper left part.

No need of GPU?

If you don't need to use a GPU, then develop for dask and use mode in bold.

small data middle data big data
1-CPU pandas, numpy
Limits:+
n-CPU modin, numpy
Limits+
dask, dask_modin or pyspark and dask.array
Limits:++
GPU cudf, cupy
Limits:++
dask_cudf, pyspark+spark-rapids and dask.array
Limits:+++

You can ignore this API:

  • VDataFrame.apply_rows()

No need of big data?

If you don't need to use big data, then develop for cudf and use mode in bold..

small data middle data big data
1-CPU pandas, numpy
Limits:+
n-CPU modin, numpy
Limits+
dask, dask_modin or pyspark and dask.array
Limits:++
GPU cudf, cupy
Limits:++
dask_cudf, pyspark+spark-rapids and dask.array
Limits:+++

You can ignore these API:

  • @delayed
  • map_partitions()
  • categorize()
  • compute()
  • npartitions=...

Need all possibility?

To be compatible with all modes, develop for dask_cudf.

small data middle data big data
1-CPU pandas, numpy
Limits:+
n-CPU modin, numpy
Limits+
dask, dask_modin or pyspark and dask.array
Limits:++
GPU cudf, cupy
Limits:++
dask_cudf, pyspark+spark-rapids and dask.array
Limits:+++

and accept all the limitations.