I’m not going to get into a detailed and analyzed discussion here, with key points on WHY there’s a huge performance difference, but, I want to illustrate why it’s good to attempt different approaches. This ties into my Enough About Enough post from a while ago. If I only knew python, I’d be stuck. Thankfully I don’t, and I was able to whip out a Java program to do this comparison…. ok, I’m getting ahead of myself, let me start from scratch:
As I’ve written about before, I’m dealing with graph data in Neo4J. Big shout out to the @Neo4j folks, as they’ve been instrumental in guiding me through some of my cypher queries, and whatnot. Especially Michael H, who is apparently the Cypher Wizard to beat all Wizards.
Cyber threat intelligence data can be transmitted in TAXII format. TAXII is an XML format that defines aspects of threat intelligence. Read more in my blog post Stix, Taxii: Understanding Cybersecurity Intelligence
Since there are some nested and relations in this data that isn’t exactly ‘easy’ to model in an RDBMS or a document store, I decided to shove it into a graph database. (Honestly, I’ve been looking for a Neo4j project for a while, and this time it worked!). At Perch Security, our customers have access to threat intelligence data, paid and open source. We want to give them access to that data in a specific way, so, I have to store and query it. Storage is straight forward, I can get into that more later, but right now, I’m looking at querying this data.
After learning a few tricks to my cypher, again thanks @michael.neo on slack, I plugged it into my python implementation. It took a while for me to figure out how to get access to the Cursor, so I could stream results ,and not actually pull the whole record set. After all, I’m trying to create 100k records (which turns out to be approximately 500k nodes from Neo4j in order to do that.
My Data Flow
The gist of my data flow is simple, over time, I’m constantly polling an external data source and shoving the TAXII data into Neo4j. My relationships are set, I try to consolidate objects where possible, and I’m off to the races.
When I query, I issue a big statement to execute my cypher with a max results of a lot… basically give me all the records until I stop reading them. In other words, I use limit in my cypher that’s much higher than what I will actually need.
My code starts streaming the results, and one by one shoves them into a ‘collector’ type object. When the collector hits a batch size (5MB+), I add a part to my AWS S3 multi part upload. When I’m done reading records (either I ran out of records, or hit my limit), I force an upload of the rest of the data I have in my collector, finalize the multi part upload and that’s it.
My python code, took about 15 minutes. No lie, 15 minutes. I tried optimizing, I tried smaller batches, etc, and my results were always linear. I tried ‘paginating’ (using skip and limit), but that didn’t help… actually, I did skip/limit first, then I went to the big streaming block..
Ugh. 15 minutes. I’m seeing visions of a MAJOR problem when we’re in full mode production. Imagine. I have 300 customers. I have to run a daily (worst case) batch job for each customer. Holy cow. I’m going to have to scale like CRAZY if I’m going to match that. … I’m sweating. The doubt kicks in. Neo4j wasn’t the right choice! I wanted a pet project, and it’s going to kick my butt and get me fired!!!!!
Last ditch effort, I rewrite in Java.
I’ve done a lot of Java in my life.. 10+ years full time.. but, it’s been over 6 years since I’ve REALLY written much Java code.. Man has it changed.
but I digress. I download Eclipse. Do some googling on Java/Maven/Docker
( ended up using the Azul Docker instance, way smaller than the standard Java 8: azul/zulu-openjdk:8)
and I’m off to the races. I get to learn how to read from an SQS queue, query Neo4j, write to S3 all in 4 hours so I don’t get fired.
After a bunch of testing, getting Docker running and uploaded to AWS ECR… I run it.. It runs…. craps out after 15 seconds. Shoot.. where did I break something.
I go to my logger output.. hmm.. no Exceptions. No bad code blocks or stupid logic (I got rid of those in testing).. .
I run it again.
15 seconds? I check my output file. It looks good. It matches the one created from python. 15 seconds?!
Something is wrong with py2neo.. that’s for darn sure.
Would anyone be interested in an example chunk of code to do this?
Email me if you do.