Cluster
To connect to a cluster, use VDF_CLUSTER with protocol, host and optionaly, the port.
dask://locahost:8787spark://locahost:7077- or alternativelly,
- use
DASK_SCHEDULER_SERVICE_HOSTandDASK_SCHEDULER_SERVICE_PORT - or
SPARK_MASTER_HOSTandSPARK_MASTER_PORT
| VDF_MODE | DEBUG | VDF_CLUSTER | Scheduler |
|---|---|---|---|
| pandas | - | - | No scheduler |
| cudf | - | - | No scheduler |
| modin | - | - | No scheduler |
| dask | Yes | - | synchronous |
| dask | No | - | thread |
| dask | No | dask://threads | thread |
| dask | No | dask://processes | processes |
| dask | No | dask://.local | LocalCluster |
| dask_modin | No | - | LocalCluster |
| dask_modin | No | dask://.local | LocalCluster |
| dask_modin | No | dask://<host>:<port> | Dask cluster |
| dask_cudf | No | dask://.local | LocalCUDACluster |
| dask_cudf | No | dask://<host>:<port> | Dask cluster |
| pyspark | No | spark:local[*] | Spark local cluster |
| pyspark | No | spark://.local | Spark local cluster |
| pyspark | No | spark://<host>:<port> | Spark cluster |
The special host name, ends with .local can be used to start a LocalCluster,
LocalCUDACluster or Spark local[*] when your program is started.
An instance of local cluster is started and injected in the Client.
Sample:
from virtual_dataframe import VClient
with VClient():
# Now, use the scheduler
pass
If you want to manage the parameters of Local(CUDA)Cluster or SparkCluster,
use the alternative VLocalCluster().
from virtual_dataframe import VClient,VLocalCluster
with VClient(VLocalCluster(params=...)):
# Now, use the scheduler
pass
Dask local cluster
To update the parameters for the implicit Local(CUDA)Cluster,
- you can use the Dask config file.
local:
scheduler-port: 0
device_memory_limit: 5G
- you can set some environment variables for dask,
export DASK_LOCAL__SCHEDULER_PORT=0
export DASK_LOCAL__DEVICE_MEMORY_LIMIT=5g
- or for Domino datalab,
export DASK_SCHEDULER_SERVICE_HOST=...
export DASK_SCHEDULER_SERVICE_PORT=7077
Spark cluster
To configure the spark cluster,
- use a file spark.conf with the
Spark properties
-
use environment variables like
export spark.app.name=MyApp. -
for
VLocalCluster, use classical parameters, and replace dot to_:
from virtual_dataframe import VClient,VLocalCluster
with VClient(VLocalCluster(
spark_app_name="MyApp",
spark_master="local[*]",
)):
# Now, use the scheduler
pass
- or for Domino datalab,
export SPARK_MASTER_HOST=...
export SPARK_MASTER_PORT=7077
Spark cluster with GPU
To use the Spark+rapids, download the file
rapids-4-spark_2.12-22.10.0.jar
(see here).
Then, in the file spark.conf, add:
spark.jars=rapids-4-spark_2.12-22.10.0.jar
spark.plugins=com.nvidia.spark.SQLPlugin
spark.rapids.sql.concurrentGpuTasks=1