Compatibility
This project is just a wrapper. So, it inherits limitations and bugs from other projects. Sorry for that.
Limitations of Pandas like framework |
---|
pandas |
All data must be in DRAM |
modin |
Read this |
cudf |
All data must be in VRAM |
All data types in cuDF are nullable |
Iterating over a cuDF Series, DataFrame or Index is not supported. |
Join (or merge) and groupby operations in cuDF do not guarantee output ordering. |
The order of operations is not always deterministic |
Cudf does not support duplicate column names |
Cudf also supports .apply() it relies on Numba to JIT compile the UDF and execute it on the GPU |
.apply(result_type=...) not supported |
dask |
transpose() and MultiIndex are not implemented |
Column assignment doesn't support type list |
dask_cudf |
See cudf and dask. |
Categories with strings not implemented |
pyspark |
Read this |
Limitations of Numpy like framework |
---|
numpy |
All data must be in RAM |
cupy |
Read this |
- block() not implemented |
- delete() not implemented |
- insert() not implemented |
dask array |
Read this |
- identity() not implemented |
- asfarray() not implemented |
- asfortranarray() not implemented |
- ascontiguousarray() not implemented |
- asarray_chkfinite() not implemented |
- require() not implemented |
- column_stack() not implemented |
- row_stack() not implemented |
- *split*() not implemented |
- resize() not implemented |
- trim_zeros() not implemented |
- in1d() not implemented |
- intersect1d() not implemented |
- setdiff1d() not implemented |
- setxor1d() not implemented |
- column_stack() not implemented |
- row_stack() not implemented |
- fromiter() not implemented |
For compatibility between numpy and cupy, see here.
File format compatibility
To be compatible with all framework, you must only use the common features. We accept some function to read or write files, but we write a warning if you use a function not compatible with others frameworks.
read_... / to_... | pandas | cudf | modin | dask | dask_modin | dask_cudf | pyspark |
---|---|---|---|---|---|---|---|
vdf.read_csv | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
VDataFrame.to_csv | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
VSeries.to_csv | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |
vdf.read_excel | ✓ | ✓ | ✓ | ||||
VDataFrame.to_excel | ✓ | ✓ | ✓ | ||||
VSeries.to_excel | ✓ | ✓ | ✓ | ||||
vdf.read_feather | ✓ | ✓ | ✓ | ||||
VDataFrame.to_feather | ✓ | ✓ | ✓ | ||||
vdf.read_fwf | ✓ | ✓ | ✓ | ✓ | |||
vdf.read_hdf | ✓ | ✓ | ✓ | ✓ | ✓ | ||
VDataFrame.to_hdf | ✓ | ✓ | ✓ | ✓ | ✓ | ||
VSeries.to_hdf | ✓ | ✓ | ✓ | ✓ | ✓ | ||
vdf.read_json | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
VDataFrame.to_json | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
VSeries.to_json | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
vdf.read_orc | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
VDataFrame.to_orc | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
vdf.read_parquet | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
VDataFrame.to_parquet | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
vdf.read_sql_table | ✓ | ✓ | ✓ | ✓ | ✓ | ||
VDataFrame.to_sql | ✓ | ✓ | ✓ | ✓ | ✓ | ||
VSeries.to_sql | ✓ | ✓ | ✓ | ✓ | ✓ |
load... / save... | numpy | cupy | dask.array |
---|---|---|---|
vpd.load() npy | ✓ | ✓ | ✓ |
vpd.save() npy | ✓ | ✓ | ✓ |
vpd.savez() npz | ✓ | ✓ | |
vpd.loadtxt() | ✓ | ✓ | |
vpd.savetxt() | ✓ | ✓ |
Cross framework compatibility
small data | middle data | big data | |
---|---|---|---|
1-CPU | pandas, numpy Limits:+ |
||
n-CPU | modin, numpy Limits+ |
dask, dask_modin or pyspark and dask.array Limits:++ |
|
GPU | cudf, cupy Limits:++ |
dask_cudf, pyspark+spark-rapids and dask.array Limits:+++ |
To develop, you can choose the level to be compatible with others frameworks. Each cell is strongly compatible with the upper left part.
No need of GPU?
If you don't need to use a GPU, then develop for dask
and use mode in bold.
small data | middle data | big data | |
---|---|---|---|
1-CPU | pandas, numpy Limits:+ |
||
n-CPU | modin, numpy Limits+ |
dask, dask_modin or pyspark and dask.array Limits:++ |
|
GPU | cudf, cupy Limits:++ |
dask_cudf, pyspark+spark-rapids and dask.array Limits:+++ |
You can ignore this API:
VDataFrame.apply_rows()
No need of big data?
If you don't need to use big data, then develop for cudf
and use mode in bold..
small data | middle data | big data | |
---|---|---|---|
1-CPU | pandas, numpy Limits:+ |
||
n-CPU | modin, numpy Limits+ |
dask, dask_modin or pyspark and dask.array Limits:++ |
|
GPU | cudf, cupy Limits:++ |
dask_cudf, pyspark+spark-rapids and dask.array Limits:+++ |
You can ignore these API:
@delayed
map_partitions()
categorize()
compute()
npartitions=...
Need all possibility?
To be compatible with all modes, develop for dask_cudf
.
small data | middle data | big data | |
---|---|---|---|
1-CPU | pandas, numpy Limits:+ |
||
n-CPU | modin, numpy Limits+ |
dask, dask_modin or pyspark and dask.array Limits:++ |
|
GPU | cudf, cupy Limits:++ |
dask_cudf, pyspark+spark-rapids and dask.array Limits:+++ |
and accept all the limitations.