The State of Open Source Data Tech

 2019-06-02

Open Source data projects such as Apache Hadoop, Spark, Kafka have been making headlines and attracting millions in VC capital. So why are some of the companies behind them going bust? Is it all hype? Should you have open source data tech in your stack?

In this post, I evaluate the state of Open Source data projects, particularly those that are offered as commercial products by some of Silicon Valley’s biggest “unicorns”.

Apache Hadoop (Hortonworks, Cloudera, MapR)

I’m calling it. Hadoop is dead. It was dead probably 18 months ago but I wasn’t as convinced then as I am today. The enterprise version of Hadoop came from three major vendors - Cloudera, Hortonworks and MapR. (For simplicity, let’s agree IBM BigInsights and Pivotal HD did not exist - it’s easier that way :)). Cloudera saw its stock price tank massively in April 2018 following a weak sales outlook. In early 2019, Cloudera merged with Hortonworks creating a Hadoop “powerhouse”. But even with a combined customer base and operating efficiencies, the new Cloudera remains unprofitable and the stock price continues to plummet. MapR recently announced it is laying off hundreds of direct sales staff and is moving to an indirect sales model. It remains unclear if MapR will be able to secure funding to continue operations. Whilst I feel for the people involved in building some of this great technology, we ought to ask ourselves whether the financial performance of these start ups are an indicator of failed customer success. I argue it is. Most organisations who implemented Hadoop failed to adequately realise intended business benefits from it. There is sufficient Gartner research to corroborate this. So what went wrong? Many organisations trying to implement Hadoop clusters simply struggled to find the necessary talent to keep the tech running. Public cloud adoption meant that it was easier to use other technologies (such as block storage instead of HDFS) to build a data lake and process data. Simply put, these distributions did not solve customer problems nearly as well as its alternatives did..

Should you have a Hadoop distro in your Data landscape? No. Unless they’re used as ephemeral clusters for compute (think AWS EMR or Google Cloud Dataproc). Even then, I find there are cloud native services better suited for distributed processing needs. If you’re running a long running cluster for general purpose data analytics needs, you’ll gain significant cost and operational efficiencies by moving to blob storage and/or other platforms (Google BigQuery, Snowflake).

Apache Spark (Databricks)

Apache Spark started its life and gained relevance as a credible, fast alternative to MapReduce. Databricks offers a managed spark platform with other goodies which it calls the UAP (Unified Analytics Platform). Whilst the initial platform focussed on ease of use for Spark users by providing a managed service, Databricks has added many other features in an effort to provide an end-to-end platform for analytics. Most notably, the addition of MLFlow and Deltalake are specifically geared towards making Databricks a complete analytics solution rather than just a managed Spark platform. In February of this year, Databricks raised $250m in Series E funding with a valuation of $2.75B so it clearly is still appealing to its investors. We don’t know if Databricks is making money or if it’s cashflow positive (because it’s not listed yet) so I remain skeptical of its long term viability and product market fit. But that’s just one part of evaluating the platform. Databricks, like it’s Hadoop counterparts, competes with public cloud services that provide ML services in various shapes and forms. It’s long term success, I argue therefore, depends on its ability to scale its user base by appealing to the lowest common denominator and strengthening its ML libraries and making data science pipeline development easier.

Should you have a Databricks in your Data landscape? Yes; especially if you’re running Microsoft Azure. Databricks enjoys first class citizen status on Azure. Databricks is incredibly popular with data science community and at the time of writing this blog has a unique feature set designed for data science community. Deltalake brings “database like (almost “Snowflake-sque”) features to the platform allowing creation of feature stores for ML development.

Apache Kafka (Confluent)

Kafka is kool! Talk to a Confluent engineer about Kafka and they will tell you Kafka is a distributed queue, a data storage platform, a commit log and a stream processing engine. And they’re (mostly) right. I often meet customers who view Kafka as an ingestion tool (which it can with Kafka Streams but it’s not the only thing it can do). Kafka allows developers to publish message payloads on topics. A topic can have many publishers and many subscribers - thus enabling a fan-in, fan-out architecture. Confluent provides an enterprise grade Kafka with a few goodies worth taking a look at. K-SQL and schema registry allow stream processing on messages in a topic.

Should you have a Confluent in your Data landscape? Maybe. If you have devices (IoT use cases) in the field and you want to use Kafka as an event sourcing/log aggregation platform, then Confluent is a good candidate. If you’re looking for a data storage solution to act as an intermediary between micro-services, then you may want look at Confluent. If you care about time ordered events, then you need Confluent. However if your only use case is data ingestion for Analytics, there are easier ways to accomplish this than using Confluent.

Apache Cassandra (DataStax)

I like Cassandra. If you want super fast, low latency, sub second response to SQL-like queries on structured data, you need to look at Cassandra. It is arguably the most underrated “Big Data” Apache project out there. DataStax provides enterprise grade Cassandra along with other goodies including a graph database (what used to be open source TitanDB) and full text search. DataStax, like Confluent solves a niche problem really well with a suite of technologies that are generally very cumbersome to run and manage (think TitanDB). DataStax has raised approximately $290m in funding and it is still unclear if it’s profitable or cash flow positive. Whether or not it captures enough market share to sustain product development efforts around Cassandra remains to be seen.

Should you have a DataStax in your Data landscape? Maybe. If you have a bunch of Cassandra or Graph DB use cases, then it probably makes a good fit. Over time, Datastax has become many things to many people. From a Cassandra niche, it has gone on to add support for graph, full text search and structured SQL support with Spark. It still remains the best enterprise ready product for Cassandra workloads. Operational reporting environments (such as Cyber Security) can benefit from Cassandra’s high performance linearly scaling database.

In the last decade, we’ve seen a rise in OSS projects and their deployments in the enterprise. However many OSS projects continue to solve niche problems and are often eclipsed by services available on public cloud. With a growing set of public cloud services on AWS, Google Cloud Platform and Microsoft Azure, the role OSS play continues to change and we must evaluate each project for its niche, available alternatives on public cloud.

What open source projects are you currently looking at and why? Comments, criticisms and counter views welcome.