Continuous apps have emerged as a way to capture and make timely use of the never-ending flow of big data.
These end-to-end applications act on real time data, and as such, have different requirements than traditional apps.
When designing a continuous app, keep these four criteria in mind:
Look Out for Late Data
In distributed systems, data failures and delays are inevitable. Therefore, design for and expect late arriving data.
Early versions of continuous applications are frequently built on overly optimistic assumptions about delays.
Systems must explicitly deal with late arriving data, especially in circumstances where devices are intermittently connected, such as mobile devices and cell phones. Users may be occasionally offline and data may come in hours or even days after an event occurred.
Generating Early Output
Generating output early can provide benefits. For example, the app might not have all the data it needs, but if 95 percent of the data is in, an approximation may be useful. In essence the app is saying, “Here’s an estimate, the final number’s still coming.”
Consider having an on-time version, which generates output when the app thinks it has all the data, and then something like errata for late data, where the app thought it had all the data but was wrong.
Cases will crop up very often where a real time application makes a rough analysis based on incomplete information.
When you launch a new ad campaign, it’s more important to know early on if it’s working well, even if only 90 percent of the data is in. This information allows you to stop or fix things that aren't working as they should (e.g., if someone made a mistake when setting up the advertisement’s creative).
Designers face the question of how to maintain the state of the application.
Any system — including continuous applications — wants to process data exactly once. You don’t want duplicates or omissions.
State management allows you to efficiently process everything exactly once, even in the face of failures and recovery. More deeply, you'll need to access state to provide context on events: what is the history of this user, what are his or her interests, and what are the characteristics of this section of a website?
Interactions need context to make sense.
A number of different mechanisms are available for state management. You might have distributed checkpoints for computation, with data on each streaming processing node that stores its local state, e.g., using RocksDB. You might have an external NoSQL or relational database, a distributed file system such as HDFS or Amazon S3, or a distributed queue like Kafka.
Choices abound for how to keep state and how long to keep it. Most continuous apps need an analytic ecosystem with more than one way to handle state. If you have a relatively inexpensive method of keeping deep history, there's less of a question. Expensive storage, such as data cached in RAM on servers, requires a more constrained amount of state and possibly a recovery strategy if reconstruction is required.
One Size Code Fits All
Early versions of continuous applications required writing separate versions of application code for batch and real time processing. This created obvious problems, with two versions of code to maintain when you want to change something.
An important breakthrough in more recent platforms is the use of a common set of APIs. With this approach, application logic can remain consistent, whether building a continuous app or a batch app. Having a common set of code that runs the same logic provides huge value and prevents many problems.
You can get more information about platforms that support this capability here (registration required), which discusses examples tied to technologies such as Apache Kafka, Apache Beam, Spark Streaming and Apache Flink, among others.
Hopefully knowing these four criteria will help you design continuous apps that effectively act on real time data.
Editor's Note: This is the second in a three part series.