Cluster
To connect to a cluster, use VDF_CLUSTER
with protocol, host and optionaly, the port.
dask://locahost:8787
spark://locahost:7077
- or alternativelly,
- use
DASK_SCHEDULER_SERVICE_HOST
andDASK_SCHEDULER_SERVICE_PORT
- or
SPARK_MASTER_HOST
andSPARK_MASTER_PORT
VDF_MODE | DEBUG | VDF_CLUSTER | Scheduler |
---|---|---|---|
pandas | - | - | No scheduler |
cudf | - | - | No scheduler |
modin | - | - | No scheduler |
dask | Yes | - | synchronous |
dask | No | - | thread |
dask | No | dask://threads | thread |
dask | No | dask://processes | processes |
dask | No | dask://.local | LocalCluster |
dask_modin | No | - | LocalCluster |
dask_modin | No | dask://.local | LocalCluster |
dask_modin | No | dask://<host>:<port> | Dask cluster |
dask_cudf | No | dask://.local | LocalCUDACluster |
dask_cudf | No | dask://<host>:<port> | Dask cluster |
pyspark | No | spark:local[*] | Spark local cluster |
pyspark | No | spark://.local | Spark local cluster |
pyspark | No | spark://<host>:<port> | Spark cluster |
The special host name, ends with .local
can be used to start a LocalCluster
,
LocalCUDACluster
or Spark local[*]
when your program is started.
An instance of local cluster is started and injected in the Client
.
Sample:
from virtual_dataframe import VClient
with VClient():
# Now, use the scheduler
pass
If you want to manage the parameters of Local(CUDA)Cluster
or SparkCluster
,
use the alternative VLocalCluster()
.
from virtual_dataframe import VClient,VLocalCluster
with VClient(VLocalCluster(params=...)):
# Now, use the scheduler
pass
Dask local cluster
To update the parameters for the implicit Local(CUDA)Cluster
,
- you can use the Dask config file.
local:
scheduler-port: 0
device_memory_limit: 5G
- you can set some environment variables for dask,
export DASK_LOCAL__SCHEDULER_PORT=0
export DASK_LOCAL__DEVICE_MEMORY_LIMIT=5g
- or for Domino datalab,
export DASK_SCHEDULER_SERVICE_HOST=...
export DASK_SCHEDULER_SERVICE_PORT=7077
Spark cluster
To configure the spark cluster,
- use a file spark.conf
with the
Spark properties
-
use environment variables like
export spark.app.name=MyApp
. -
for
VLocalCluster
, use classical parameters, and replace dot to_
:
from virtual_dataframe import VClient,VLocalCluster
with VClient(VLocalCluster(
spark_app_name="MyApp",
spark_master="local[*]",
)):
# Now, use the scheduler
pass
- or for Domino datalab,
export SPARK_MASTER_HOST=...
export SPARK_MASTER_PORT=7077
Spark cluster with GPU
To use the Spark+rapids, download the file
rapids-4-spark_2.12-22.10.0.jar
(see here).
Then, in the file spark.conf
, add:
spark.jars=rapids-4-spark_2.12-22.10.0.jar
spark.plugins=com.nvidia.spark.SQLPlugin
spark.rapids.sql.concurrentGpuTasks=1