API
| api | comments |
|---|---|
| VDF_MODE | The current mode |
| Mode | The enumeration of differents mode |
| FrontEndPandas | pandas, cudf, modin.pandas, dask_cudf, dask.dataframe or pyspark.pandas |
| FrontEndNumpy | numpy, cupy or dask.array |
| BackEndDataFrame | pandas or cudf |
| BackEndSeries | pandas.Series, cudf.Series or modin.pandas.Series |
| BackEndNDArray | numpy.ndarray or cupy.ndarray |
| BackEndNumpy | pandas, cudf or modin.pandas |
| vdf.@delayed | Delayed function (do nothing or dask.delayed) |
| vdf.concat(...) | Merge VDataFrame |
| vdf.read_csv(...) | Read VDataFrame from CSVs glob files |
| vdf.read_excel(...)* | Read VDataFrame from Excel glob files |
| vdf.read_fwf(...)* | Read VDataFrame from Fwf glob files |
| vdf.read_hdf(...)* | Read VDataFrame from HDFs glob files |
| vdf.read_json(...) | Read VDataFrame from Jsons glob files |
| vdf.read_orc(...) | Read VDataFrame from ORCs glob files |
| vdf.read_parquet(...) | Read VDataFrame from Parquets glob files |
| vdf.read_sql_table(...)* | Read VDataFrame from SQL |
| vdf.from_pandas(pdf, npartitions=...) | Create Virtual Dataframe from Pandas DataFrame |
| vdf.from_backend(vdf, npartitions=...) | Create Virtual Dataframe from backend dataframe |
| vdf.compute([...]) | Compute multiple @delayed functions |
| VDataFrame(data, npartitions=...) | Create DataFrame in memory (only for test) |
| VSeries(data, npartitions=...) | Create Series in memory (only for test) |
| VLocalCluster(...) | Create a dask Local Cluster (Dask, Cuda or Spark) |
| VDataFrame.compute() | Compute the virtual dataframe |
| VDataFrame.persist() | Persist the dataframe in memory |
| VDataFrame.repartition() | Rebalance the dataframe |
| VDataFrame.visualize() | Create an image with the graph |
| VDataFrame.to_pandas() | Convert to pandas dataframe |
| VDataFrame.to_backend() | Convert to backend dataframe |
| VDataFrame.to_csv() | Save to glob files |
| VDataFrame.to_excel()* | Save to glob files |
| VDataFrame.to_feather()* | Save to glob files |
| VDataFrame.to_hdf()* | Save to glob files |
| VDataFrame.to_json() | Save to glob files |
| VDataFrame.to_orc() | Save to glob files |
| VDataFrame.to_parquet() | Save to glob files |
| VDataFrame.to_sql()* | Save to sql table |
| VDataFrame.to_numpy() | Convert to numpy array |
| VDataFrame.categorize() | Detect all categories |
| VDataFrame.apply_rows() | Apply rows, GPU template |
| VDataFrame.map_partitions() | Apply function for each parttions |
| VDataFrame.to_ndarray() | Convert to numpy or cupy ndarray |
| VDataFrame.to_numpy() | Convert to numpy, cupy or dask array |
| VSeries.compute() | Compute the virtual series |
| VSeries.persist() | Persist the dataframe in memory |
| VSeries.repartition() | Rebalance the dataframe |
| VSeries.visualize() | Create an image with the graph |
| VSeries.to_pandas() | Convert to pandas series |
| VSeries.to_backend() | Convert to backend series (pandas, cudf, dask.Series) |
| VSeries.to_numpy() | Convert to numpy ndarray |
| VSeries.to_ndarray() | Convert to numpy backend (numpy, cupy or dask.array) |
| VClient(...) | The connexion with the cluster |
| import vdf.numpy | Import numpy, cupy or dask.array |
| vdf.numpy.array(..., chunks=...) | Create an numpy-like array |
| vdf.numpy.arange(...) | Return evenly spaced values within a given interval. |
| vdf.numpy.zeros(...) | Return a new array of given shape and type, filled with zeros. |
| vdf.numpy.zeros_like(...) | Return an array of zeros with the same shape and type as a given array. |
| vdf.numpy.asnumpy(d) | Convert to numpy.ndarray |
| vdf.numpy.save(d) | Save to npy format |
| vdf.numpy.load(d) | Load npy format |
| vdf.numpy.random.xxx(..., chunks=...) | Generate random values |
| vdf.numpy.compute(ar) | Similare of ar.compute() |
| vdf.numpy.rechuck(df) | Alias of `ar.rechuck() |
| vdf.numpy.asnumpy(df) | Return a numpy ndarray from a DataFrame |
| vdf.numpy.asndarray(df) | Return a ndarray from a DataFrame (may be be numpy, cupy or dask.array) |
* some frameworks do not implement it
With numpy like framework, it's impossible to add some methods like .compute() to numpy.ndarray or cupy.ndarray.
To be compatible, you must use a global simulare function like vnp.compute(a)[0] # in place of a.compute().
You can read a sample notebook here for Pandas or here for Numpy for an example of the use of API.
Keep in mind, the current framework are in FrontEndPandas and FrontEndNumpy,
and the backend API (use inside dask) are in BackEndPandas and BackEndNumpy.
To maintain this relationship, use:
.to_backend()in place ofdf.to_pandas().to_ndarray()in place ofdf.to_numpy()vdf.numpy.asndarray(df)in place ofnumpy.asarray(df)
The .to_backend() can return a Pandas or a Cudf dataframe.
The .to_ndarray() can return a Numpy, a Cupy or a dask.array array.
The .asndarray(...) can return a Numpy, a Cupy or a dask.array array.
Each API propose a specific version for each framework. For example:
- the
toPandas()with Panda, returnself @delayeduse the dask@delayedor do nothing, and apply the code when the function was called. In the first case, the function return a part of the dask graph. In the second case, the function return immediately the result.read_csv("file*.csv")read each file in parallele with dask, spark or pyspark by each worker, or read sequencially each file and combine each dataframe with a framework without distribution (pandas, modin, cudf)save_csv("file*.csv")write each file in parallele with dask, spark or pyspark, or write one file"file.csv"(without the star) with a framework without distribution (pandas, modin, cudf)- ...
All adjustement was described here