Technical point of view

virtual_dataframe framework patch others frameworks to unify the API.

VDF_MODE:

Pandas like
- pandas
- cudf
- modin
- dask
- dask_cudf
- pyspark
Numpy like
- numpy
- cupy
- dask_array

Pandas like frameworks

Pandas

Add vdf.BackEndDataFrame = pandas.DataFrame
Add vdf.BackEndSeries = pandas.Series
Add vdf.BackEndArray = numpy.ndarray
Add vdf.BackEndPandas = pandas
Add vdf.FrontEndPandas = pandas
Add vdf.FrontEndNumpy = numpy
Add vdf.compute() to return a tuple of args and be compatible with dask.compute()
Add vdf.concat() an alias of panda.concat()
Add vdf.delayed() to delay a calland be compatible with dask.delayed()
Add vdf.persist() to parameters and empty image and be compatible with dask.persist()
Add vdf.visualize() to return an empty image and be compatible with dask.visualize()
Add vdf.from_pandas() to return df and be compatible with `dask.from_pandas()
Add vdf.from_backend() an alias of from_pandas()
Add vdf.numpy an alias of numpy module
Remove extra parameters used by Dask in:
*.to_csv(), *.to_excel(), *.to_feather(), *.to_hdf(), *.to_json()
Update the pandas API to accept glob filename in:
vdf.read_csv(), vdf.read_excel(), vdf.read_feather(), vdf.read_fwf(), vdf.read_hdf, vdf.read_json(), vdf.read_orc(), vdf.read_parquet(), vdf.read_sql_table()
DF.to_csv(), DF.to_excel(), DF.to_feather(), DF.to_hdf(), DF.to_json(),
Series.to_csv(), Series.to_excel(), Series.to_hdf(), Series.to_json()
Add methods with _not_implemented
DF.to_fwf()
Add DF.to_pandas() to return self
Add DF.to_backend() to return self
Add DF.to_ndarray() an alias to to_numpy()
Add DF.apply_rows() to be compatible with cudf.apply_rows()
Add DF.map_partitions() to be compatible with dask.map_partitions()
Add DF.compute() to return self and be compatible with `dask.DataFrame.compute()
Add DF.repartition() to return self and be compatible with `dask.DataFrame.repartition()
Add DF.visualize() to return visualize(self) and be compatible with `dask.DataFrame.visualize()
Add DF.categorize() to return self and be compatible with `dask.DataFrame.categorize()
Add Series.to_pandas() to return self
Add Series.to_backend() to return self
Add Series.to_ndarray() alias of to_numpy
Add Series.compute() to return self and be compatible with `dask.Series.compute()
Add Series.map_partitions() to return self.map() and be compatible with `dask.Series.map_partitions()
Add Series.persist() to return self and be compatible with 'dask.Series.persist()
Add Series.repartition() to return self and be compatible with 'dask.Series.repartition()
Add Series.visualize() to return visualize(self) and be compatible with 'dask.Series.visualize()

cudf

Add vdf.BackEndDataFrame = cudf.DataFrame
Add vdf.BackEndSeries = cudf.Series
Add vdf.BackEndArray = cupy.ndarray
Add vdf.BackEndPandas = cudf
Add vdf.FrontEndPandas = cudf
Add vdf.FrontEndNumpy = cupy
Add vdf.compute() to return an tuple of args and be compatible with dask.compute()
Add vdf.concat() an alias of panda.concat()
Add vdf.delayed() to delay a calland be compatible with dask.delayed()
Add vdf.persist() to parameters and empty image and be compatible with dask.persist()
Add vdf.visualize() to return an empty image and be compatible with dask.visualize()
Add vdf.from_pandas() to return df and be compatible with `dask.from_pandas()
Add vdf.from_backend() an alias of from_pandas()
Add vdf.numpy an alias of cupy module
Remove extra parameters used by Dask in:
*.to_csv(), *.to_excel(), *.to_feather(), *.to_hdf(), *.to_json()
Update the pandas API to accept glob filename in:
vdf.read_csv(), vdf.read_feather(), vdf.read_json()
DF.to_csv(), DF.to_excel(), DF.to_feather(), DF.to_hdf(), DF.to_json(),
Series.to_hdf(), Series.to_json()
Add methods with _not_implemented
vdf.read_excel(), vdf.read_fwf(), vdf.read_sql_table()
DF.to_csv(), DF.to_excel()
Add pandas.DataFrame.to_pandas() to return self
Add DF.to_backend() to return self
Add DF.to_ndarray() to convert DataFrame to cupy.ndarray
Add DF.map_partitions() to be compatible with dask.map_partitions()
Add DF.compute() to return self and be compatible with `dask.DataFrame.compute()
Add DF.repartition() to return self and be compatible with `dask.DataFrame.repartition()
Add DF.visualize() to return visualize(self) and be compatible with `dask.DataFrame.visualize()
Add DF.categorize() to return self and be compatible with `dask.DataFrame.categorize()
Add pandas.Series.to_pandas() to return self
Add Series.to_backend() to return self
Add Series.to_ndarray() alias of to_numpy
Add Series.compute() to return self and be compatible with `dask.Series.compute()
Add Series.map_partitions() to return self.map() and be compatible with `dask.Series.map_partitions()
Add Series.persist() to return self and be compatible with 'dask.Series.persist()
Add Series.repartition() to return self and be compatible with 'dask.Series.repartition()
Add Series.visualize() to return visualize(self) and be compatible with 'dask.Series.visualize()

modin or dask_modin

Set MODIN_ENGINE=dask for dask_modin
Set MODIN_ENGINE=python for modin
Add vdf.BackEndDataFrame = modin.pandas.DataFrame
Add vdf.BackEndSeries = modin.pandas.Series
Add vdf.BackEndArray = numpy.ndarray
Add vdf.BackEndPandas = modin.pandas
Add vdf.FrontEndPandas = modin.pandas
Add vdf.FrontEndNumpy = numpy
Add vdf.compute() to return a tuple of args and be compatible with dask.compute()
Add vdf.concat() an alias of modin.pandas.concat()
Add vdf.delayed() to delay a calland be compatible with dask.delayed()
Add vdf.persist() to parameters and empty image and be compatible with dask.persist()
Add vdf.visualize() to return an empty image and be compatible with dask.visualize()
Add vdf.from_pandas() to return modin DataFrame or Series and be compatible with `dask.from_pandas()
Add vdf.from_backend() an alias of from_pandas()
Add vdf.numpy an alias of numpy module
Remove extra parameters used by Dask in:
*.to_csv(), *.to_excel(), *.to_feather(), *.to_hdf(), *.to_json()
Add warning when using:
read_excel(), read_feather(), read_fwf(), read_hdf(), read_sql_table()
DF.to_excel(), DF.to_feather(), DF.to_hdf(), DF.to_sql()
Series.to_csv(), Series.to_excel(), Series.to_hdf(), Series.to_json()
Update the pandas API to accept glob filename in:
vdf.read_excel(), vdf.read_feather(), vdf.read_fwf(), vdf.read_hdf, vdf.read_orc()
DF.to_excel(), DF.to_feather(), DF.to_hdf(), DF.to_sql()
Series.to_csv(), Series.to_excel(), Series.to_hdf(), Series.to_json()
Add methods with _not_implemented
DF.to_orc()
Add DF.to_pandas() to convert to panda.DataFrame
Add DF.to_backend() to return self
Add DF.to_ndarray() an alias to to_numpy()
Add DF.apply_rows() to be compatible with cudf.apply_rows()
Add DF.map_partitions() to be compatible with dask.map_partitions()
Add DF.compute() to return self and be compatible with `dask.DataFrame.compute()
Add DF.repartition() to return self and be compatible with `dask.DataFrame.repartition()
Add DF.visualize() to return visualize(self) and be compatible with `dask.DataFrame.visualize()
Add DF.categorize() to return self and be compatible with `dask.DataFrame.categorize()
Add Series.to_pandas() to return modin.pandas.Series.to_pandas()
Add Series.to_backend() to return self
Add Series.to_ndarray() alias of to_numpy
Add Series.compute() to return self and be compatible with `dask.Series.compute()
Add Series.map_partitions() to return self.map() and be compatible with `dask.Series.map_partitions()
Add Series.persist() to return self and be compatible with 'dask.Series.persist()
Add Series.repartition() to return self and be compatible with 'dask.Series.repartition()
Add Series.visualize() to return visualize(self) and be compatible with 'dask.Series.visualize()
And all patch in pandas

dask

Add vdf.BackEndDataFrame = pandas.DataFrame
Add vdf.BackEndSeries = pandas.Series
Add vdf.BackEndArray = numpy.ndarray
Add vdf.BackEndPandas = pandas
Add vdf.FrontEndPandas = dask.dataframe
Add vdf.FrontEndNumpy = dask.array
Add vdf.concat() an alias of dask.dataframe.multi.concat()
Add vdf.from_pandas() an alias of dask.dataframe.from_pandas()
Add vdf.from_backend() an alias of from_pandas()
Add vdf.numpy an alias of numpy module
Add warning in:
read_fwf(), read_hdf(), read_sql_table()
Add methods with _not_implemented
read_excel(), read_feather()
DF.to_excel(), DF.to_feather(), DF.to_fwf()
Add DF.to_pandas() to return self.compute()
Add DF.to_backend() an alias of to_pandas()
Add DF.to_numpy() to return self.compute().to_numpy()
Add DF.to_ndarray() an alias to dask.DataFrame.to_dask_array()
Add DF.apply_rows() to be compatible with cudf.apply_rows()
Patch DF.to_sql() and Series.to_sql() to accept con or uri
Add Series.to_pandas() to return self.compute()
Add Series.to_backend() an alias of to_pandas()
Add Series.to_numpy() to return self.compute().to_numpy()
Add Series.to_ndarray() alias of dask.dataframe.Series.to_dask_array()
And all patch in pandas

dask_cudf

Add vdf.BackEndDataFrame = cudf.DataFrame
Add vdf.BackEndSeries = cudf.Series
Add vdf.BackEndArray = cudf
Add vdf.BackEndPandas = pandas
Add vdf.FrontEndPandas = dask_cudf
Add vdf.FrontEndNumpy = cupy
Add vdf.compute() to dask.compute()
Add vdf.concat() to dask.dataframe.multi.concat()
Add vdf.delayed() to dask.delayed()
Add vdf.persist() to dask.persist()
Add vdf.visualize() to dask.visualize()
Add vdf.from_pandas() to dask_cudf.from_cudf()
Add vdf.from_backend() to dask_cudf.from_cudf()
Add vdf.numpy an alias of cupy module
Add a warning in:
Series.to_hdf(), Series.to_json()
Add methods with _not_implemented
read_excel(), read_feather(), read_fwf(), read_hdf(), read_sql_table()
DF.to_excel(), DF.to_feather(), DF.to_fwf(), DF.to_hdf(),DF.to_sql(),
Series.to_csv(), Series.to_excel(),
Add DF.to_pandas() to return self.compute().to_pandas()
Add DF.to_backend() to return self.compute() and return cudf.DataFrame
Add DF.to_numpy() to self.compute().to_numpy()
Add DF.to_ndarray() an alias to self.compute() and return cudf.DataFrame
Add Series.to_pandas() to return self.compute().to_pandas()
Add Series.to_backend() to return self.compute() and return cudf.Series
Add Series.to_numpy() to return self.compute().to_numpy()
Add Series.to_ndarray() to return a cudf.Series
Add Series.compute() to return self and be compatible with `dask.Series.compute()
Add Series.map_partitions() to return self.map() and be compatible with `dask.Series.map_partitions()
Add Series.persist() to return self and be compatible with 'dask.Series.persist()
Add Series.repartition() to return self and be compatible with 'dask.Series.repartition()
Add Series.visualize() to return visualize(self) and be compatible with 'dask.Series.visualize()
And all patch in cudf

pyspark

Add vdf.BackEndDataFrame = pandas.DataFrame
Add vdf.BackEndSeries = pandas.Series
Add vdf.BackEndArray = numpy.ndarray
Add vdf.BackEndPandas = pandas
Add vdf.FrontEndPandas = pyspark.pandas
Add vdf.FrontEndNumpy = numpy
Add vdf.compute() to return a tuple of args and be compatible with dask.compute()
Add vdf.concat() an alias of pyspark.pandas.concat()
Add vdf.delayed() to delay a call and be compatible with dask.delayed()
Add vdf.persist() to persist the current DF
Add vdf.visualize() to return an empty image and be compatible with dask.visualize()
Add vdf.from_backend() an alias of from_pandas()
Add vdf.numpy an alias of numpy module
Remove extra parameters used by Dask in:
*.to_csv(), *.to_excel(), *.to_feather(), *.to_hdf(), *.to_json()
from_pandas()
Add warning in:
read_excel(), reql_sql_table()
Update the pandas API to accept glob filename in:
vdf.read_csv(), vdf.read_excel(), vdf.read_json(), vdf.read_orc()
DF.to_csv(), DF.to_excel(), DF.to_feather(), DF.to_hdf(), DF.to_json(),
Series.to_csv(), Series.to_excel(), Series.to_hdf(), Series.to_json()
Add methods with _not_implemented
vdf.read_feather(), vdf.read_fwf(), vdf.read_hdf()
DF.to_sql(), Series.to_sql()
Add DF.to_backend() an alias of to_pandas()
Add DF.to_ndarray() an alias to to_numpy()
Add DF.apply_rows() to be compatible with cudf.apply_rows()
Add DF.categorize() to return self and be compatible with `dask.DataFrame.categorize()
Add DF.compute() to return self and be compatible with `dask.DataFrame.compute()
Add DF.map_partitions() to be compatible with dask.map_partitions()
Add DF.persist() to return self and be compatible with `dask.DataFrame.visualize()
Add DF.repartition() to return self and be compatible with `dask.DataFrame.repartition()
Add DF.visualize() to return visualize(self) and be compatible with `dask.DataFrame.visualize()
Add Series.to_backend() alias of to_pandas()
Add Series.to_ndarray() alias of to_numpy()
Add Series.compute() to return self and be compatible with `dask.Series.compute()
Add Series.map_partitions() to return self.map() and be compatible with `dask.Series.map_partitions()
Add Series.persist() to return self and be compatible with 'dask.Series.persist()
Add Series.repartition() to return self and be compatible with 'dask.Series.repartition()
Add Series.visualize() to return visualize(self) and be compatible with 'dask.Series.visualize()

Numpy like familly

Numpy

It's not possible to update some method in numpy.ndarray.

vdf.numpy is an alias of numpy
Add vdf.numpy.asnumpy(ar) to return ar
Add vdf.numpy.asndarray(ar) to return ar.to_numpy()
Add vdf.numpy.compute(...) to return a tuple with parameters
Add vdf.numpy.compute_chunk_sizes(ar) to return ar
Add vdf.numpy.rechunk(ar) to return ar
Add vdf.numpy.arange(), remove the parameter chunks, invoke numpy.arange() and return a view with Vndarray
Add vdf.numpy.from_array(), remove the parameter chunks, invoke numpy.arange() and return a view with Vndarray
Add vdf.numpy.load() to remove the parameter chunks
Add vdf.numpy.save() to remove the parameter chunks
Add vdf.numpy.savez() to remove the parameter chunks
Add vdf.numpy.random.* to remove the parameter chunks

cupy

vdf.numpy is an alias of cupy
Add vdf.numpy.asndarray(ar) to return ar.to_numpy()
Add vdf.numpy.compute(...) to return a tuple with parameters
Add vdf.numpy.compute_chunk_sizes(ar) to return ar
Add vdf.numpy.rechunk(ar) to return ar
Add vdf.numpy.arange(), remove the parameter chunks, invoke numpy.arange() and return a view with Vndarray
Add vdf.numpy.from_array(), remove the parameter chunks, invoke numpy.arange() and return a view with Vndarray
Add vdf.numpy.load() to remove the parameter chunks
Add vdf.numpy.save() to remove the parameter chunks
Add vdf.numpy.savez() to remove the parameter chunks
Add vdf.numpy.random.* to remove the parameter chunks

dask_array

vdf.numpy is an alias of dasl.array
Add vdf.numpy.asarray(ar) to return array of numpy or cupy
Add vdf.numpy.asndarray(ar) to return ar.to_numpy()
Add vdf.numpy.compute(...) to return a tuple with parameters
Add vdf.numpy.compute_chunk_sizes(ar) to return ar
Add vdf.numpy.rechunk(ar) to return ar
Add vdf.numpy.arange(), remove the parameter chunks, invoke numpy.arange() and return a view with Vndarray
Add vdf.numpy.from_array(), remove the parameter chunks, invoke numpy.arange() and return a view with Vndarray
Add vdf.numpy.load() to remove the parameter chunks
Add vdf.numpy.save() to remove the parameter chunks
Add vdf.numpy.savez() to remove the parameter chunks
Add vdf.numpy.random.* to remove the parameter chunks