Best practices
For write a code, optimized with all frameworks, you must use some best practices.
- Avoid very large partition: so that they fit in a worker's available memory.
- Avoid very large graphs: because that can create an overhead on task.
- Learn techniques for customization: in order to improve the efficiency of your processes.
- Stop using Dask when no longer needed: like when you are iterating on a much smaller amount of data (
to_backend()
. - Persist in distributed RAM when you can: in doing so, accessing RAM memory will be faster.
- Processes and threads: be careful to separate numeric work from text data to maintain efficiency.
- Load data with Dask or Spark: for instance, if you need to work with large Python objects, let Dask create them (instead of creating them outside Dask or Spark and then handing thom over the framework).
- Avoid using
VDataFrame
orVSeries
outside the unit tests - Avoid calling compute repeatedly: as this can lower performance.
- Avoid iterate over rows or columns of DataFrame: Use
apply()
orapply_rows()
to distribute the code in the cluster and GPU.
If you know that the volume of data is compatible with one node, you can convert a distributed dataframe
to BackendDataFrame
and continue to manipulate the data locally.
Use the .to_backend()
to convert a Dataframe
to a local Pandas like Dataframe
or .to_ndarray()
to convert a DataFrame
to a local numpy like array.