In my last blog article “, when should i go to work?” i showed how to build a random forest model using in order to count the number of cars on an image.
In data analytics it is a typical task to build first a prototype model in R or python. To bring this model to production it is often needed because of better integration in the given stack to rebuild the same model for example in Java. Using this is really an easy task. In the following, I will show how this can be implemented as an example in an real-time processing.
We have build the model using R as follows:
Now we can easily extract this model as POJO (Plain old Java object)1 from
and save it to the current working directory. The generated Java file has only a dependency to . So it is ideally suited to be embedded in a framework such as Apache Spark2.
It consists of classes, this means one class for each tree. Also one class for each category for each tree. Furthermore we have trained our model for images of size pixel, so we have 8800 features, this means one for each pixel. An Example looks like:
This is pretty much source code, but it is a very nice example to recognize the very frequent use of the Conditional Operator, that is used to rebuild the decision tree. Now we can compile this Java class and do predictions using the predictCSV class offered by 3 4. But alternatively we can call the generated Java classes as follows:
Here DRF_model_R_1478847643061_1 is the name of the downloaded POJO, and data is the array of the given features for which we want to get a prediction. We integrate this in a Prediction class which has the features as input.
Data procesing pipeline
The data processing pipeline can be visualized as follows:
The individual steps are described in more detail below.
The webcam images from BayernInfo looks like follows:
In order to receive the features, i.e. download these pictures, which will be needed for the prediction, we can implement this in Java like
Since the webcam pictures on the BayernInfo page are only updated every 60 seconds, we cannot realize a real-time streaming in the millisecond time range. So the above code will be called every minute, therefore we can get the number of cars viewed on the picture. Thus a (nearly) real-time processing can be achieved5. Here we additionally convert the given image to a byte-array for further processing. We send this byte-array to the Apache Kafka topic in.webcam.data. Apache Kafka is an open-source real time stream processing framework written in scala and Java 6 7.
I will give a short explanation how you can use Apache Kafka locally8:
- Download the latest version of Apache Kafka 0.10.2.0, which i used9.
- In order to work with Kafka one have to start zookeeper zookeeper-server-start.bat ../../config/zookeeper.properties in the bin/windows folder of Kafka.
- Start Kafka kafka-server-start ../../config/server.properties.
- Create the topic “in.webcam.data” with the following command: kafka-topics.bat –create –topic in.webcam.data –zookeeper localhost:2181 –partitions 1 –replication-factor 1
I do this only in order to run my showcase, but in an production environment it should be clear to run Apache Kafka on a cluster. Furthermore it is recommended to use e.g. Apache Thrift 10 to send and/or receive more complex data structures to a topic.
Similar as already described in my previous blog post the following processing is carried out on every image: Cut off the header, take only the left side of the image, gray-scale it and divide it into sub-images. I wrote an ImageDivider class, which you will find via Github together with the remaining source code.
Spark Streaming Job
The main part of the Spark Streaming Job looks like
The steps can be explained as follows:
- Consuming images from the topic in.webcam.data every 60 seconds, which can be set at Kafka consumer-config, see getConsumerConfig (TrafficAnalysisApplication class) on Github.
- Divide the images in sub-images by calling the ImageDivider class.
- Do the prediction on all sub-images, estimate 0, 1 or 2 cars on it.
- We get 16 sub-images, all sub-images have the same key, i.e. the timestamp when the image was received, so we can call reduceByKey to estimate the number of cars on the whole image.
- Save the result to the Elastisearch index analysis/log.
“Elasticsearch is a distributed, JSON-based search and analytics engine”12 and the actual version (5.2.2) can be downloaded from https://www.elastic.co/downloads/elasticsearch. It understands json easily, so we use the Jackson-Project Version (5.2.2) 13 to map the data to this format. To visualize the data and to configure and manage Elasticsearch, Kibana can be used naturally within the so called ELK-stack. You can get it from https://www.elastic.co/downloads/kibana in the actual version 5.2.2.
Now we can create a Line Chart in Kibana, see Github README, which look like:
Here we can see the estimated number of cars per time unit. It’s possible to set autorefresh to “1 minute”, so everytime a new image is received from the webcam we get the most likely number of cars on it.
It was shown, how we can take a previously build prototype in R and move it to production in order to create a real-time streaming application. We can do this kind of traffic analysis for more webcams in Bavaria. Thus it is possible to create a service, which will estimate the likely traffic along a planned route. Furthermore, it is conceivable to carry out a type of anomaly or outlier detection on these data, for example to identify traffic accidents or similar abnormal events. Moreover we can use advanced tools such as OpenCV14 or TensorFlow15 to improve the detection of vehicles on the images and thus the counting.
Author: Andreas W. (Entwicklung Konzepte & Tooling)
Picture sources: Andreas W.