Have you heard of fast big data?
In my work with big data deployments, I've seen a shift toward supporting low latency, real time capabilities. Many people talk about “streaming apps” but the term I prefer is “continuous apps.”
Databricks introduced the term at Spark Summit East, defining a continuous app as “an end to end application that acts on real time data.” My thinking on the topic owes a great deal to a framework that Tyler Akidau, tech lead for internal streaming systems at Google, introduced in two highly recommended posts.
Continuous Apps in Action
Big data tends to naturally capture continuous flows of events and activities, like app activity, clickstreams, sensor readings and more. These flows might be used as input to a continuous app that does something like mobile personalization, analytics or prediction of mean time to failure for a particular part.
Here’s an example: When an emergency such as a fire, earthquake, flood or something similar happens, agencies from various jurisdictions get involved. A continuous application can gather information from first responders, weather, news, cameras and images and get the right information to the right people, ensuring that everyone is working with the latest data. This distribution of “fast big data” to the right people can save lives.
Many data sources support this use case: static information such as maps, streaming data such as GPS coordinates of emergency vehicles in motion and frequently updated information including news and weather feeds. And the system can’t break if a change takes place in a single data feed — responders need the best information available at any given time.
Real-Time Is In the Eye of the Beholder
What does real-time mean for continuous apps? The definition varies depending on the application and consumer of the application.
If you’re a high frequency trader on Wall Street, real time might be 10 microseconds. A low latency algorithm in a data center might need 50 microseconds. For many real time applications, it could be a few hundred milliseconds or 5 minutes, depending on the needs of the user. Analytics often benefit from having data every 15 minutes or every hour rather than in an overnight batch. Updating analytics more frequently than every few minutes offers diminishing returns in most cases.
In other words, real time is very much in the eye of the beholder. This applies to many things when designing a continuous app. It requires thinking deeply about your particular use case and what would help serve the business requirement you are designing for the most, and trading off against increased complexity as you reduce latency.
How Much Data is Enough?
A continuous app starts with an event, but then you need to make a decision: when do you have enough data to provide a meaningful response? A watermark is a way of representing the point in processing time where there is enough event data to perform a useful calculation. Think of a watermark as the difference between the event time and the processing time (the time at which you have “enough” data for your particular use case).
There’s a tradeoff between having small windows and incomplete or rough analysis or really long windows and potentially long delays before you get insights. It’s definitely a balancing act, and, as usual, the decision you reach depends on the use case you have in mind.
Want to learn more about continuous apps? Check in next week for the next article in this series.
Title image by Eva Blue