Ticker

6/recent/ticker-posts

Header Ads Widget

Responsive Advertisement

A quick comparison between Spark and Dask

Which one should I use Apache Spark or Dask?

A quick Comparison between Apache Spark and Dask


When you have a large chunk of data, then local machines are unable to process it, because of this we can not perform processing and modeling on given large data set, so we need large-scale distributed data processing and machine learning engines. So Here our two buddies can help us — Apache Spark and Dask. In case if you are not aware of Apache spark or Dask then here is a quick introduction.


“Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in different programming languages such as Scala, Java, Python, and R”. It also supports a rich set of higher-level tools including Spark SQL for SQL and DataFrames, MLlib for machine learning, GraphX for graph processing, and Structured Streaming for stream processing.

Dask is a flexible library for parallel computing in Python. Dask Composed of two parts. Dynamic task scheduling which performs the various optimized operation and big data collection such as a parallel array, Dataframes, lists. These parallel collections run on top of dynamic task schedulers.


Now both are used for large data processing and used distributed computing but which one I should use ??? Now here is a quick comparison between Spark and Dask.
  1. Dask is smaller and lighter weight compare to spark. Dask has fewer features. Dask uses and couples with libraries like numeric python(numpy), pandas, Scikit-learn to gain high-level functionality.
  2. Spark is written in Scala and supports various other languages such as R, Python, Java Whereas Dask is written in Python and only supports Python.
  3. Spark has its own ecosystem and it is well integrated with other Apache projects whereas Dask is a component of a large python ecosystem. Dask has the main aim to enhance and use libraries like pandas,numpy, scikit-learn.
  4. Spark is older and has become a dominant and well-trusted tool in the Big Data world. whereas Dask is younger and its extension of well trusted NumPy/Pandas/Scikit-learn/Jupyter stack.
  5. Spark Dataframe has its own API and memory model. Spark also implemented a large subset of complex SQL queries. Whereas Dask reuses Pandas API and memory model. it neither implemented SQL and query optimizer.
  6. When it comes to Machine learning Spark has MLlib that is easy to implement with spark Map-reduce style system. whereas Dask relies on and interoperates with existing popular machine learning and data science libraries like Scikit-Learn and XGBoost.
  7. Spark does not support multi-dimension array structure whereas Dask has full functionality of the numpy model.
  8. spark can process graph model using graphX library whereas Dask does not have any library or model for graph processing.

A Quick Demonstration of Dask

In this demonstration, we will see how Dask using pandas, numpy and scikit-learn functionality.

                         Dask as Machine learning modeling

Conclusion

So if you prefer scala or SQL and you have JVM infrastructure with this if you are looking for an all-in-one solution then you should choose Apache Spark.
If you prefer python, your business use case is complex and does not cleanly fit the spark computation model and you are looking for lighter weight transition from local computing to cluster computing then you should choose Dask.
When using Pandas, NumPy, or other Python computations, if you run into memory issues, storage limitations, or CPU boundaries on a single machine, Dask can help you scale up on all the cores on a single machine, or scale all the cores and memory across your cluster.
With this, there are many other situations where you can use both. For more information please read official documentation by Dask community( https://dask.org/)
we have seen how Dask is different from Apache Spark and how Dask uses the same Python popular API such as pandas,numpy and Scikit to gain high-level functionality.


References

Post a Comment

0 Comments