De-Coder’s Ring

Consumable Security and Technology

Tag: neo4j

Neo4J: Driver Comparison

neo4j-perch-s3

I’m not going to get into a detailed and analyzed discussion here, with key points on WHY there’s a huge performance difference, but, I want to illustrate why it’s good to attempt different approaches.  This ties into my Enough About Enough post from a while ago.  If I only knew python, I’d be stuck.  Thankfully I don’t, and I was able to whip out a Java program to do this comparison…. ok, I’m getting ahead of myself, let me start from scratch:

As I’ve written about before, I’m dealing with graph data in Neo4J.  Big shout out to the @Neo4j folks, as they’ve been instrumental in guiding me through some of my cypher queries, and whatnot.  Especially Michael H, who is apparently the Cypher Wizard to beat all Wizards.

Cyber threat intelligence data can be transmitted in TAXII format.  TAXII is an XML format that defines aspects of threat intelligence.  Read more in my blog post Stix, Taxii: Understanding Cybersecurity Intelligence

Since there are some nested and relations in this data that isn’t exactly ‘easy’ to model in an RDBMS or a document store, I decided to shove it into a graph database.  (Honestly, I’ve been looking for a Neo4j project for a while, and this time it worked!).   At Perch Security, our customers have access to threat intelligence data, paid and open source.   We want to give them access to that data in a specific way, so, I have to store and query it.  Storage is straight forward, I can get into that more later, but right now, I’m looking at querying this data.

After learning a few tricks to my cypher, again thanks @michael.neo on slack, I plugged it into my python implementation.  It took a while for me to figure out how to get access to the Cursor, so I could stream results ,and not actually pull the whole record set.  After all, I’m trying to create 100k records (which turns out to be approximately 500k nodes from Neo4j in order to do that.

My Data Flow

The gist of my data flow is simple, over time, I’m constantly polling an external data source and shoving the TAXII data into Neo4j.  My relationships are set, I try to consolidate objects where possible, and I’m off to the races.

When I query, I issue a big statement to execute my cypher with a max results of a lot… basically give me all the records until I stop reading them.  In other words, I use limit in my cypher that’s much  higher than what I will actually need.

My code starts streaming the results, and one by one shoves them into a ‘collector’ type object.  When the collector hits a batch size (5MB+), I add a part to my AWS S3 multi part upload.   When I’m done reading records (either I ran out of records, or hit my limit), I force an upload of the rest of the data I have in my collector, finalize the multi part upload and that’s it.

My python code, took about 15 minutes.  No lie, 15 minutes.   I tried optimizing, I tried smaller batches, etc, and my results were always linear.   I tried ‘paginating’ (using skip and limit), but that didn’t help… actually, I did skip/limit first, then I went to the big streaming block..

Ugh. 15 minutes.   I’m seeing visions of a MAJOR problem when we’re in full mode production.   Imagine.  I have 300 customers.  I have to run a daily (worst case) batch job for each customer.   Holy cow.   I’m going to have to scale like CRAZY if I’m going to match that.  … I’m sweating.  The doubt kicks in.   Neo4j wasn’t the right choice!  I wanted a pet project, and it’s going to kick my butt and get me fired!!!!!

Last ditch effort,  I rewrite in Java.

I’ve done a lot of Java in my life.. 10+ years full time.. but, it’s been over 6 years since I’ve REALLY written much Java code.. Man has it changed.

but I digress.   I download Eclipse.  Do some googling on Java/Maven/Docker

(  ended up using the Azul Docker instance, way smaller than the standard Java 8:   azul/zulu-openjdk:8)

and I’m off to the races.    I get to learn how to read from an SQS queue, query Neo4j, write to S3 all in 4 hours so I don’t get fired.

After a bunch of testing, getting Docker running and uploaded to AWS ECR… I run it..    It runs…. craps out after 15 seconds.   Shoot.. where did I break something.

I go to my logger output.. hmm.. no Exceptions.   No bad code blocks or stupid logic (I got rid of those in testing).. .

I run it again.

15 seconds.

HOLY COW!

15 seconds?  I check my output file.  It looks good. It matches the one created from python.    15 seconds?!

Something is wrong with py2neo.. that’s for darn sure.

Would anyone be interested in an example chunk of code to do this?

Email me if you do.

Stix, Taxii: Understanding Cybersecurity Intelligence

Cyber Intelligence Takes Balls

Cyber Intelligence Takes Balls

Introduction
I spent years building a packet capture and network forensics tool. Slicing and dicing packets makes sense to me. Headers, payloads, etc.. easy peasy (no, it’s not really easy, but like I said, years). Understanding complex data structures comes with the territory, and so far, I haven’t met a challenge that took me too long to understand.

Then I met Taxii. Then Stix. I forgot how painful XML was.

Taxii: Trusted Automated eXchange of Indicator Information

STIX: Structured Threat Information eXpression

FYI:  All the visualizations and screen shots are grabbed from Neo4J. The top rated and most used Graph database in the world.  My work has some specific requirements that I think are best suited with nodes, edges and finding relationships between data, so I thought I’d give it a shot.  Nice to see a built in browser that does some pretty fantastic drawing and layouts without any work on my part.  (Docker image to boot!)

Background
TAXII is a set of instructions or standards on how to transport intelligence data. The standard (now an OASIS standard), defines the interactions with a web server (HTTP(s)) requests to query and receive intelligence. For most use cases, there are three main phases of interactions with a server:

  1. Discovery – Figure out the ‘other’ end points, this is where you start
  2. Collection Information – Determine how the intelligence is stored. Think of collections as a repository, or grouping of intelligence data within the server.
  3. Poll (pull) – (or push, but I’m focusing on pull). Receive intelligence data for further processing. Poll requests will result in different STIX packages (more to come)

I’m not going to go into details on the interactions here, but the python library for TAXII does a good enough job to get you started.  It’s not perfectly clear, but it helps.

STIX defines some data structures around intelligence data.   Everything is organized in a ‘package’.  The package contains different pieces of information about the package and about the intelligence.  In this article, I’ll focus on ‘observables’ and ‘indicators’.  The items I won’t talk much about are:

  • TTPs:  Tactics, Techniques and Procedures.  What mechanisms are the ‘bad guys’ using.  Software packages, exploit kits, etc.
  • Exploit Target:  What’s being attacked
  • Threat Actor: If known, who/what’s attacking?
  • TLPs, Kill chains, etc

Observables

Observables are the facts.  They are pieces of data that you may see on your network, on a host, in an email, etc.  These can be URLs, email addresses, files (and their corresponding hashes), IP addresses, etc.   A fact is a fact.  There’s no context around it, it’s just a fact.

A URL that can be seen on a network

A URL that can be seen on a network

 

Indicators

Indicators are the ‘why’ around the facts.  These tell you what’s wrong with an IP address, or give the context and story about an email that was seen.

Context around an observable

Context around an observable

In the above pictures, you’ll see a malicious URL (hulk**, seriously, don’t follow it).   The observable component is the URL.  The indicator component tells us that it’s malicious.  The description above tells us that the intelligence center at phishtank.com identified the URL as part of a phishing scheme.

Source of data

All security analysts are well aware of some open source intelligence data. Emerging Threat, PhishTank, etc.  This data is updated regularly, and provided in their own format.  Since we’re talking about using TAXII to transport this data, we need an open source/free Taxii source.  Step in http://hailataxii.com

When you make a query against Hailataxii’s discovery end point, you learn the collections and poll URLs.  Additionally, the inbox URL, but we’re not using that today.  (Coincidentally, HAT’s URLs are all the same)

Once you query the collection information end point, you see approximately 11 (At the time of writing) collections.  I will list those below.  From there, we can make Poll requests to each collection, and start receiving (hundreds? Thousands?) of STIX packages.

STIX Package

Since I’m a network monitoring junky, I want to see the observables I can monitor.  Specifically IPs and URLs.  Parsing through the data, I find some interesting tidbits.  Some packages have observables at the top level, and some have observables as children of the indicators.  No big deal, we’ll keep it all and start storing/displaying.

Once it’s all parsed using some custom python (what a mess!), I’m able to start loading my Nodes and edges.  Straight forward, I build nodes for the Community (Hailataxii), the Collection, the Package, Indicators and Observables.  The observables can be related to the Indicator and/or the Package.

Community view from the top down

Community view from the top down

Yellow circle is the community, green circle is the collection, small blue circle is the package (told you it could be hundreds), purple is the indicator and reddish is the observable.

Indicators and Observables

Indicators and Observables

That’s about it!  Don’t forget to check out my last post on Suricata NSM fields to see how some of these observables can be found on a network.

Suricata NSM Fields

Please leave feedback if you have any questions!

 

 

 

 

 

 

 

Collections from Hail  A Taxii:

  1. guest.dataForLast_7daysOnly
  2. guest.EmergingThreats_rules
  3. guest.phishtank_com
  4. system.Default
  5. guest.EmergineThreats_rules
  6. guest.dshield_BlockList
  7. guest.Abuse_ch
  8. guest.MalwareDomainList_Hostlist
  9. guest.Lehigh_edu
  10. guest.CyberCrime_Tracker
  11. guest.blutmagie_de_torExits

© 2017 De-Coder’s Ring

Theme by Anders NorenUp ↑