The extra-ordinary functional capabilities of Apache Spark make it a standalone project from Apache Software Foundation, which comes with high processing speed and efficiency like never-before. Let’s have a look at some of the functional features that Spark endorses.
Powerful Caching and Disk Persistence Capabilities
Spark allows you to cache even the raw data. You can also keep it in the memory, if you want to. It is not just that you are going to keep all the data in the memory. You might want some of the data in the disk also, for which there’s something called graceful degradation in Spark. You can either keep all your data in the memory or alternatively into the disk. In that case probably, Spark would be no better than Hadoop, but yes, it gives you various options and these options are embedded there within the APIs itself.
Faster Batch Processing
Spark enables faster batch processing, because of all the processing happening in the memory itself. So, even when it is batch processing, it is much faster in Spark. If you have seen processing in Storm, the word ‘batch’ is never used, as it is a true real-time processing system.
Real-time Stream Processing
It means the processing, which happens for data streams. When we talk about Spark Streaming, we are talking about the streams, which it gives. Now, that is something, which is real-time processing, wherein, you can probe a specific directory. Talking about the sliding word count, whatever new comes in at that point of time, you may simply run it. But as and when the data comes in, it will immediately be computed, and you will get immediate statistics. It will just do the word count on the continuously pouring in new data. This is called sliding word count; or else, whichever algorithm you want to apply, Spark Streaming is something, which is going to find its use there.
Faster Decision Making
The diverse libraries and faster batch as well as real-time data processing in Spark leads to lot of saving in terms of time. Compared to Hadoop, it enables 100 times faster processing of data, which clearly indicates how fast it can analyze your data and thereby the pace of making decisions, i.e. 100 times faster.
Since you data is there in memory, you can definitely go for Iterative Algorithms, which is the basic limitation with Hadoop. Hadoop, being a distributed processing system with no sharing of data leads towards non-iterative algorithms. Is there any way, you could support recursive algorithms in Hadoop? The answer is yes, with Iterative MapReduce.
Interactive Data Analysis
Interactive Data Analysis is yet another very good feature of Spark. In Hadoop, after having written a MapReduce and submitting it, you have to wait till the time it either fails or gives some result. Whatever happens in between, it is either visible in the form of counters or something alike. Though you may want to see, what happened after MapR, what happened after reducer, and when you take the output how exactly the data is transforming. This is something that you can do but it is much more painful and not quite interactive. Every time you want to do something, you have to do some code change, apply it, create the chart, submit it and then do what you wanted. It’s certainly not interactive.
However, when it comes to Spark, the best thing is Spark shell. With Spark, you can see how exactly your data is transforming stage-by-stage. It gives you an interactive way to perform your data analysis. Once you verify your theory, which involves determining what are the things you want to do, what are the ways it will give you the results, and if it is going to be fast. Once you are quite sure about it, you can go ahead and create your separate Scala app, or Java app. Then you can go ahead and supply your jobs, which makes your data analysis quite interactive, fast and effective.
Got a question for us? Mention them in the comments section and we will get back to you.