Topic: How to speed up the Lambda architecture using Fast Data
When it comes to automotive big data, we often deal with just two of the four Vs – Velocity and Volume. Since IDL-specified, known data streams are channeled to the back-end, Variety (document variety) doesn’t apply. Plus, vehicles have to log on to the back end through an OAuth2-based authentication process. This ensures right level of reliability, also eliminating the need for another V – Veracity (inaccuracy and the reliability level of the data).
Picture source: http://www.intergen.co.nz/blog/Tim-Mole/dates/2013/8/musings-from-a-big-data-conference/
Let’s say a big data context with high variability is viewed through unstructured data and a large number of documents. Aside from the persistence of the raw data in a distributed file system such as HDFS or Amazon S3, a comprehensive implementation of integration and normalization procedures is required, since we’d like to evaluate this variable data using the most generic analyses possible:
Picture source: Bastian Kraus
Complexity increases similarly in each batch processing job and in each stream processing pipeline, since each piece of raw data has to first be integrated in the goal analysis format. Here it is, demonstrated in a highly simplified way:
Quelle: Picture source: Bastian Kraus
By contrast, in the automotive context, the structure of the data to be integrated is fixed. These data formats do not change after SOP (Start of Production) and remain the same for the duration of the life cycle of a vehicle series. So the integration and normalization processes only have to be created once. After that, they don’t change. It goes without saying that, in this context, we want to implement data analyses or subsequent processes as generically as possible and with low level of complexity. This boosts their execution efficiency, improves their serviceability and reduces the frequency of errors. Generally, in contexts with high data variability, raw data is first integrated and normalized. It follows that this normalization and integration should be done as the data are ingested in the system in order to reduce the complexity in all downstream processes. We now want to integrate the procedure previously outlined for the Lambda architecture pattern we often use.
The Lambda architecture is subdivided into three layers: the speed layer, batch layer and serving layer. The batch layer contains the master data set, also known as “The Truth” and is processed into the serving layer via batch processing in higher-level views. The speed layer contains stream processing analyses which procure their data from the live data stream and then provide the results to the serving layer, as well. In recent years, the Fast Data paradigm has been gaining in importance and is considered to be the next step in the evolution of Big Data. In this approach, we attempt to obtain information from the live data stream in quasi-real-time and to derive decisions directly from it. As you can see, it’s possible to combine these two approaches through forward displacement of the data integration during the ingestion of data into the system.
Picture source: Bastian Kraus
This approach involves a few additional functional requirements which the back-end has to meet in order to be able to handle a higher load:
- Dynamic load-driven scalability of the micro-services of ingestion and integration (both upscale and downscale)
- Highly robust integration algorithmics, since, otherwise, the truth is falsified
- The central buffer memory between ingestion, integration and forwarding must be dimensioned appropriately for the large quantities of data and load peaks (throughput and buffer size)
- The integration interfaces must implement backpressure in the data stream in order not to overload downstream systems
If the back-end fulfills these requirements, we can work both with our truth and our live data stream anywhere with a well-defined, multi-lingual and efficient data format. Examples of this include Apache Thrift (https://thrift.apache.org/), Avro (https://avro.apache.org/) and Google protocol buffers (https://developers.google.com/protocol-buffers/).
Aside from the reduced complexity regarding run-time and algorithms of the analysis code and processes, data analysts and developers no longer have to worry about normalization and integration. Moreover, this eliminates a major single-point-of-failure which potentially results in false analysis results due to a faulty integration code.
Author: Bastian K. (formerly Team Big Data, Concept Development & Tooling)
Picture sources: http://www.intergen.co.nz/blog/Tim-Mole/dates/2013/8/musings-from-a-big-data-conference/; Bastian K.
Topic December 2016: “H2O, when should I go to work?” – How to use a relatively simple method to develop a prototype for a machine-learning application using H2O