De-Coder’s Ring

Consumable Security and Technology

Category: kafka

Routing Messages through Kafka

I’m going through a major project with a client in regards to migrating to a Kafka based streaming message platform.  This is a heckuva lot of fun.   Previously, I’ve written about how Kafka made the message bus dumb, on purpose:   find it here. 

Building out a  new streaming platform is an interesting challenge.  There are so many ways to handle the logic.  I’ll provide more details as time goes on, but there are at least three ways of dealing with a stream of data.

The Hard Wired Bus

A message bus needs various publishers and subscribers.  You can very tightly couple each service by having them be aware of what’s upstream or what’s downstream.  Upstream is where the message came from, downstream is where it goes next.  When each component is aware of the route a message must take, it becomes brittle and hard to change over time.  Imagine spaghetti.

The Smart Conductor

A conductor travels on the train.  The conductor determines the best route while moving along the train tracks.  The conductor can handle every message, after every service to determine which service is next in line.  This cleans up the function of each service along the line but makes the conductor pretty brittle too.  The more complex the system gets, the more complex the conductor gets.  A rules engine would be a great component to add to the conductor if you choose this path.

The Map Maker

A map maker plots a route for a hiker to follow.  In our case, the map maker is a component that sits at the very beginning of every stream.  When the event comes to the map makers topic (in Kafka), the map maker determines the best route for the message to take.  Using metadata or embedded data in the event, the map maker can send the route down the chain with the event itself.  Each service can use a common library to read from its configured topic, allow the custom service to do some work, and then use the wrapper again to pass the message downstream.  The biggest advantage here is that each service doesn’t care where the message comes from, or where it goes next.   This works great for streams that are static, and the route can be determined up front.    If there are decisions down stream, then it may need a ‘switch’ service that is allowed to update the route.

What’s the best path for your application?    Have something better that I haven’t thought of yet?

 

 

 

 

 

Inverting the Message Bus

I had a conversation this morning, where I just (maybe I’m slow) realized how Apache Kafka has inverted the responsibility in the world of message passing.

Traditional enterprise services busses ( Wikipedia: Enterprise Service Bus ) typically have some smarts built in.  The bus itself routes messages, transforms messages and orchestrates actions based on message attributes.  This was the first attempt at building a great mediation layer in an enterprise.  Some advantages of the traditional ESB were:

  • Producer/Consume Language Agnostic
  • Input/Output format changes (XML, JSON, etc)
  • Defined routing and actions on messages

The challenges were typical for traditional enterprise software.  Scaling was a mess and licenses could be cost prohibitive to scale.   This meant lower adoption and general loss of the advantages for smaller projects or customers.

Talk about a huge and complex stack!   Look at this picture for the ‘core’ capabilities of an Enterprise Service Bus:

 

ESB Component Hive

ESB Component Hive

Now let’s take a look at Apache Kafka.

Kafka Diagram

Kafka Diagram

Ok, that’s a lot of arrows, and lines and block, oh my.

BUT, The thing to notice here that’s SUPER important, is that they’re all outside the Kafka box.  Kafka isn’t smart.  In fact, Kafka was designed to be dumb.    There is no message routing, there’s no message format changes, nothing.    The big box in the middle is dumb.    It scales really well, and stays dumb.

In fact, the only ‘type’ of communication that Kafka has is publish/subscribe.   One(to-many) clients produce messages to a topic.    They send in data.   Doesn’t matter if it’s JSON, XML, yiddish, etc.   It goes to the topic.   Kafka batches them up, and ‘persists’ them as a log file.   That’s it.  A big old data file on disk.  The smarts of Kafka comes next…  One Consumer Group (which may be MANY actual instances of software, but with the same group ID) subscribe to a topic… or more than one topic.    Kafka (Zookeeper help) remembers which client in the client group has seen which block of messages.  Ok, that sounds confusing. I’ll try again.

Kafka coordinates which blocks of data get to which client.   If the clients are in the same client group, then data is only sent out once to a member of the client group.    More than one client group can subscribe to a topic, so you can have multiple consumer processes for each topic.

Now, instead of the message bus sending messages from one function to another, that work is left up to the clients.   For instance, let’s say you have to ingest an email from a mail server and test it to see if there’s a malicious reply-to address.

First, the message comes in as plain text to the ‘email_ingest‘ topic.   This can be published to by many clients reading data from many servers.  Let’s assume Logstash.  Logstash will send the message in as plain text.    After the message is in the ‘email_ingest‘ topic, another program will transform that message to JSON.  This program subscribes to ‘email_ingest‘, pulls each message, transforms to JSON, and publishes it back to another topic ‘email_jsonified‘.

The last piece of the puzzle is the code that calls the email hygiene function.   This piece of code takes the longest, due to calling an external API, so needs to scale horizontally the most.    This function reads from ‘email_jsonified‘, calls the external API, and if there’s a malicious IP or reply-to detected, publishes the message on the last topic ’email_alert’.   ‘email_alert‘ is subscribed to by another Logstash instance, to push the message into Elasticsearch for visualization in Kibana.

Sounds complicated right?

The big difference here, is that the intelligence moved into the clients.   The clients need to handle the orchestration, error handling, reporting, etc.   That has some pros and cons.  It’s great that clients can now be written in many technologies, and there is more ‘freedom’ for a development group to do their own thing in a language or framework they’re best suited for.   That can also be bad.  Errors add a new challenge.  Dead letter queues can be a pain to manage, but, again, it puts the onus on the client organization (in the case of a distributed set of teams) to handle their own errors.

Kafka scales horizontally on a small footprint really easily.  It’s mostly a network IO bound system, instead of a CPU or memory bound system.  It’s important to keep an eye on disk space, memory and CPU, but they tend not to be an issue if you set up your retention policies in an environment appropriate manner.

Reach out if you have any questions

Do you prefer RabbitMQ?  ActiveMQ?  Kafka?  (They’re not the same, but similar!)

© 2017 De-Coder’s Ring

Theme by Anders NorenUp ↑