Demo of Apache Kafka on Heroku

Data processing pipeline architecture using Apache Kafka on Heroku

Get started Architecture

Get Started

Overview

This system consumes data from the Twitter Streaming API, manipulates the data using a series of Heroku apps, and generates a dynamic visualization of the manipulated data.

The architecture uses five Heroku apps, each serving a different role in the data pipeline.

  • Data Ingest: Read from Twitter streaming API and produce messages for high volume ingest on Kafka topic
  • Data Fanout: Consume ingested messages and fans out to discrete keyword Kafka topics
  • Aggregate Statistic Calculation: Consume messages from keyword Kafka topics, and calculate and produce aggregate mention count to a topic
  • Related Terms Generation: Consumes messages from keyword Kafka topics, and produce related words & related word count to a topic
  • Visualization: Consume messages from aggregate and related words Kafka topics and generate the dynamic stream visualizations in a web application

Here are instructions to deploy the full system. See further below for an architecture diagram explaining the steps through which the data flows.

Pre-Requisites

Deploy

  1. Setup Data Ingest, by following the instructions in the kafka-tweet-producer README. You will...
    • Deploy the code to Heroku
    • Configure the Heroku app
    • Create the Apache Kafka on Heroku cluster that will be used by all apps in the data processing pipeline
    • Create a Twitter App so you can connect to the Twitter Streaming API with valid credentials
  2. Setup Data Fanout by deploying the kafka-twitter-fanout repo to a Heroku app. Follow the instructions in the repo's README. You will...
    • Deploy the code to Heroku
    • Configure the Heroku app
    • Attach the previously created Kafka on Heroku add-on to this app
  3. Setup Aggregate Statistics Calculation by deplying the kafka-twitter-aggregate repo to a Heroku app. Follow the instructions in the repo's README. You will...
    • Deploy the code to Heroku
    • Configure the Heroku app
    • Attach the previously created Kafka on Heroku add-on to this app
  4. Setup Related Terms Generation by deploying the kafka-twitter-relatedwords repo to a Heroku app. Follow the instructions in the repo's README. You will...
    • Deploy the code to Heroku
    • Configure the Heroku app
    • Attach the previously created Kafka on Heroku add-on to this app
  5. Setup the Visualization web app by deploying the kafka-demo repo to a Heroku app. Follow the instructions in the repo's README. You will...
    • Deploy the code to Heroku
    • Configure the Heroku app
    • Attach the previously created Kafka on Heroku add-on to this app

Configuration Options

Switch palette between Heroku and Salesforce colors

By default a Heroku-inspired color palette is used in the Visualization web app. This can be changed to a Salesforce-inspired color palette by setting the SALESFORCE_THEME environment variable to true.

Change or add tracked keywords

The data visualization has been tested with up to five keywords. It looks best with three to five. You may use more or less, but the data visualization may not behave as expected.

  1. Create the necessary Kafka topics. This may be done with the Heroku CLI or from the Kafka on Heroku dashboard. Three topics are required for each keyword. They must be named using the following format: dog-keyword, dog-aggregate, dog-relatedwords (an example if the keyword were 'dog').
  2. Update configuration of the Data Ingest app. The environment variable TWITTER_TRACK_TERMS must be a comma-separated list of all keywords you want to track. e.g. dog,cat,fish
  3. Update configuration of the Data Fanout app similar to the Data Ingest app. The environment variable TWITTER_TRACK_TERMS must be a comma-separated list of all keywords you want to track. e.g. dog,cat,fish
  4. Update the Procfile for the Aggregate Statistics Calculation app to include a process for each keyword.
  5. Update the Procfile for the Related Terms Generation app to include a process for each keyword.
  6. Update the configuration of the Visualization app similar to the Data Ingest and Data Fanout apps. The environment variable TWITTER_TRACK_TERMS must be a comma-separated list of all keywords you want to track. e.g. dog,cat,fish

Architecture

In addition to showing how the system is architected, this diagram also shows how data moves through the system.

The Kafka cluster is represented by the large light purple rectangle. Within that, each named rectangle represents a Kafka topic. The hexagons are Heroku apps that manipulate data. They produce data to and/or consume data from Kafka topics.