Fauie Technology

eclectic blogging, technology and hobby farming

Category: neo4j

Neo4J Tutorial – Published Video!

You may have noticed that my stream of thought posts on Neo4J.  It’s pained my, cause you know, drawing the balls is fun.

Today, I get to announce a published video tutorial on Neo4J by Packt Publishing!

We developed an in depth course, covering a bunch of graph database and Neo4J problems, ranging from:

  • Installation
  • What is a graph database?
  • comparing to a relational database
  • Using Cypher Query Language (Cypher, CQL)
  • Looking at various functions in Neo4j
  • Query profiling

It was a ton of work, over a few months, but, the support from Packt was great.  I’m really looking forward to getting feedback on the course!


Packt Publishing -  Learning Neo4j Graphs and Cypher

Packt Publishing – Learning Neo4j Graphs and Cypher




The Cyber Pivot

Over the years, I’ve put a lot of thought into pivoting.   Not in the startup lingo, but in the data sense.   Pivot from one piece of data to another, in order to build a picture.

Data is all about pivoting.   When I’m investigating an alert, I very rarely have a good picture of all the events/correlating data surround an alert. This leads to some frustrating , repeated times for each alert triage.   Looking at external sources, internal sources, etc.

The ‘simplest’ challenge to solve would be to auto-pivot around the internal log data we have.   Since I’m a wanna-be SOC analyst, but a pretty good software engineer, I need to build some code that will auto pivot.  Basically, given a specific moment in time, or a specific net flow record, spider out until I get 6 degrees of separation.  Effectively, build my own graph database from log storage.

To the code!

I’ve started writing some code, and will probably post it here, as long as we don’t want to claim IP at Perch (https://perchsecurity.com), but , we may, so hang tight.  Keep an eye out here:


If this were Neo4J:

MATCH (a:Alert) – [] – (f:Flow) – [] – (n:NSM)

OPTIONAL MATCH (n) – [r:REFERS] – (n2:NSM {type:”HTTP”})

return a, n, f, n2

… or something like that.




Finding Relationships in Elasticsearch document data with Neo4j

I’m working on a new project, in all my spare time ( ha ha ha), that I thought was interesting.   We’ve been building relational data bases for a bajillion years (or almost as long as computers were storing data?).  DAta gets normalized, stored, related, queried, joined, etc.

These days, we have Elasticsearch, which is a No SQL database that stores documents.  These documents are semi-structured, and have almost no relationships that can be defined.  This is all well and good when you’re just looking at data in a full text index, but, when looking for insights or patterns, we need some tricks to find those needles in the giant haystacks that we’re tackling.

Finding Unknown Relationships

Elasticsearch is a staple in my tool belt.  I use it for almost any project that will collect data over time, or has a large ingestion rate.  Look at the ELK stack ( http://elastic.co ) and you’ll see some cool things demoed.  Using kibana, logstash and elasticsearch, you can build a very simple monitoring solution, log parser and collector, dashboarding tool, etc.

One of the ideas that I came up with today while working on the Perch Security platform was a way to discover relationships within the data.  Almost taking a denormalized data set, and normalizing it.   Well, that’s not entirely accurate.   What if we can find relationships in seemingly unrelated data?

Pulling a log file from Apache?   Also getting data from a usegroup on cyber attack victims?  Your data structures may be set to relate those data components, and you may not have thought about it..  That’s kind of a contrived example, because it’s pretty obvious you can search ‘bad’ data looking for indicators that are in your network, but my point is, what can we find out there?

Here’s a github repo for what I’m going to be working on.   It’s not really meant for public consumption, but I threw an MIT license on it if your’e interested in helping.



Oh!   I forgot to mention, I’m going to be using a lot of my recently learned Graph database skills (Thanks again Neo4j) to help discovery some relationships.    It will help with the ‘obvious’ relationships, and maybe even when I get into figuring out “significant” terms and whatnot.     Let’s see if I can get some cool hits!

Any data sets?





Neo4J: Driver Comparison


I’m not going to get into a detailed and analyzed discussion here, with key points on WHY there’s a huge performance difference, but, I want to illustrate why it’s good to attempt different approaches.  This ties into my Enough About Enough post from a while ago.  If I only knew python, I’d be stuck.  Thankfully I don’t, and I was able to whip out a Java program to do this comparison…. ok, I’m getting ahead of myself, let me start from scratch:

As I’ve written about before, I’m dealing with graph data in Neo4J.  Big shout out to the @Neo4j folks, as they’ve been instrumental in guiding me through some of my cypher queries, and whatnot.  Especially Michael H, who is apparently the Cypher Wizard to beat all Wizards.

Cyber threat intelligence data can be transmitted in TAXII format.  TAXII is an XML format that defines aspects of threat intelligence.  Read more in my blog post Stix, Taxii: Understanding Cybersecurity Intelligence

Since there are some nested and relations in this data that isn’t exactly ‘easy’ to model in an RDBMS or a document store, I decided to shove it into a graph database.  (Honestly, I’ve been looking for a Neo4j project for a while, and this time it worked!).   At Perch Security, our customers have access to threat intelligence data, paid and open source.   We want to give them access to that data in a specific way, so, I have to store and query it.  Storage is straight forward, I can get into that more later, but right now, I’m looking at querying this data.

After learning a few tricks to my cypher, again thanks @michael.neo on slack, I plugged it into my python implementation.  It took a while for me to figure out how to get access to the Cursor, so I could stream results ,and not actually pull the whole record set.  After all, I’m trying to create 100k records (which turns out to be approximately 500k nodes from Neo4j in order to do that.

My Data Flow

The gist of my data flow is simple, over time, I’m constantly polling an external data source and shoving the TAXII data into Neo4j.  My relationships are set, I try to consolidate objects where possible, and I’m off to the races.

When I query, I issue a big statement to execute my cypher with a max results of a lot… basically give me all the records until I stop reading them.  In other words, I use limit in my cypher that’s much  higher than what I will actually need.

My code starts streaming the results, and one by one shoves them into a ‘collector’ type object.  When the collector hits a batch size (5MB+), I add a part to my AWS S3 multi part upload.   When I’m done reading records (either I ran out of records, or hit my limit), I force an upload of the rest of the data I have in my collector, finalize the multi part upload and that’s it.

My python code, took about 15 minutes.  No lie, 15 minutes.   I tried optimizing, I tried smaller batches, etc, and my results were always linear.   I tried ‘paginating’ (using skip and limit), but that didn’t help… actually, I did skip/limit first, then I went to the big streaming block..

Ugh. 15 minutes.   I’m seeing visions of a MAJOR problem when we’re in full mode production.   Imagine.  I have 300 customers.  I have to run a daily (worst case) batch job for each customer.   Holy cow.   I’m going to have to scale like CRAZY if I’m going to match that.  … I’m sweating.  The doubt kicks in.   Neo4j wasn’t the right choice!  I wanted a pet project, and it’s going to kick my butt and get me fired!!!!!

Last ditch effort,  I rewrite in Java.

I’ve done a lot of Java in my life.. 10+ years full time.. but, it’s been over 6 years since I’ve REALLY written much Java code.. Man has it changed.

but I digress.   I download Eclipse.  Do some googling on Java/Maven/Docker

(  ended up using the Azul Docker instance, way smaller than the standard Java 8:   azul/zulu-openjdk:8)

and I’m off to the races.    I get to learn how to read from an SQS queue, query Neo4j, write to S3 all in 4 hours so I don’t get fired.

After a bunch of testing, getting Docker running and uploaded to AWS ECR… I run it..    It runs…. craps out after 15 seconds.   Shoot.. where did I break something.

I go to my logger output.. hmm.. no Exceptions.   No bad code blocks or stupid logic (I got rid of those in testing).. .

I run it again.

15 seconds.


15 seconds?  I check my output file.  It looks good. It matches the one created from python.    15 seconds?!

Something is wrong with py2neo.. that’s for darn sure.

Would anyone be interested in an example chunk of code to do this?

Email me if you do.

[contact-form-7 404 "Not Found"]

© 2023 Fauie Technology

Theme by Anders NorenUp ↑