Hadoop was the first real entry into the big data race, and though it got somewhat of a slow start, it has emerged as the winning big data infrastructure. The Hadoop ecosystem has evolved to include a comprehensive set of tools, most of which are open-source, but are supported by deep-pocketed and deeply-committed vendors. This vendor support, combined with a strong and dedicated open source community, has helped propel a number of big data engines into the race for success.
Hadoop is Becoming Mainstream
The Hadoop ecosystem as a whole is expected to grow by an impressive 26 percent between this year and 2023. When it comes to engines, however, not all are created equally. Among the forerunners are Spark, Presto, Hive, Tez, and Impala, and each one features its own strengths and weaknesses.
For instance, if you’re just running numbers on historical data in the back room for the purposes of R&D or product development, a slower but massively scalable and highly reliable batch processing tool is your best bet. But if you’re dealing with streaming data in real time (which most big data initiatives involve nowadays), you definitely need speed more than anything else.
Speed & Performance: How the Hadoop Engines Stack Up
When it comes to speed and performance, how do these top-running big data tools stack up? Recently, one of the companies heavily vested in big data asked just this question. Here’s what you need to know.
• All of the engines (Hive, Impala, Presto, Spark) are stable and capable of supporting critical workloads. Each has strengths and weaknesses, but all are capable of performing at scale. You’ll need a criteria other than speed to make your differentiations among these options.
• What constitutes a good Hadoop engine? All of these engines are able to process billions or trillions of rows of data, without incurring errors and keeping response times inside the 10s or 100s of seconds range.
• There are significant differences in each new version release. If you are still working with an older version of your Hadoop engine, upgrading to the latest version should bring significant improvements, particularly in the area of speed and performance. All of these engines improved in terms of speed by two to four times over the past 1/2 year.
• Big data isn’t everything. There are many use cases out there for small data (that is, queries on small sets of data). Presto and Impala are particularly strong in small data processing.
• When it comes to accommodating lots of users, Impala and Presto are ideal, and Hive and Spark both have made significant strides in terms of supporting larger numbers of users. Overall, Impala is the go-to engine in enterprises with large numbers of users.
• It’s not uncommon to use two Hadoop engines in the same environment. For example, some enterprises use Hive for long-running queries, and most all Hive users also use Tez.
• MapReduce is all but gone. While MapReduce was a stable and reasonably reliable workhouse in the early days, there are simply too many tools today that are easier to use and much faster.
• Spark no longer offers a lot of advantages over Hive. It’s just as speedy, and the latest version is also as reliable. Yet Spark has so much more market saturation, it will likely emerge as the overall winner in the coming years.
One thing, however, can make a significant difference in the speed and performance of your big data workloads, and that’s your storage infrastructure. The Bigstep Metal Cloud offers significantly higher levels of speed and performance when working with big data. Learn more about us now.