Fauie Technology

eclectic blogging, technology and hobby farming

Category: api (page 2 of 2)

Finding Relationships in Elasticsearch document data with Neo4j

I’m working on a new project, in all my spare time ( ha ha ha), that I thought was interesting.   We’ve been building relational data bases for a bajillion years (or almost as long as computers were storing data?).  DAta gets normalized, stored, related, queried, joined, etc.

These days, we have Elasticsearch, which is a No SQL database that stores documents.  These documents are semi-structured, and have almost no relationships that can be defined.  This is all well and good when you’re just looking at data in a full text index, but, when looking for insights or patterns, we need some tricks to find those needles in the giant haystacks that we’re tackling.

Finding Unknown Relationships

Elasticsearch is a staple in my tool belt.  I use it for almost any project that will collect data over time, or has a large ingestion rate.  Look at the ELK stack ( http://elastic.co ) and you’ll see some cool things demoed.  Using kibana, logstash and elasticsearch, you can build a very simple monitoring solution, log parser and collector, dashboarding tool, etc.

One of the ideas that I came up with today while working on the Perch Security platform was a way to discover relationships within the data.  Almost taking a denormalized data set, and normalizing it.   Well, that’s not entirely accurate.   What if we can find relationships in seemingly unrelated data?

Pulling a log file from Apache?   Also getting data from a usegroup on cyber attack victims?  Your data structures may be set to relate those data components, and you may not have thought about it..  That’s kind of a contrived example, because it’s pretty obvious you can search ‘bad’ data looking for indicators that are in your network, but my point is, what can we find out there?

Here’s a github repo for what I’m going to be working on.   It’s not really meant for public consumption, but I threw an MIT license on it if your’e interested in helping.



Oh!   I forgot to mention, I’m going to be using a lot of my recently learned Graph database skills (Thanks again Neo4j) to help discovery some relationships.    It will help with the ‘obvious’ relationships, and maybe even when I get into figuring out “significant” terms and whatnot.     Let’s see if I can get some cool hits!

Any data sets?





Neo4J: Driver Comparison


I’m not going to get into a detailed and analyzed discussion here, with key points on WHY there’s a huge performance difference, but, I want to illustrate why it’s good to attempt different approaches.  This ties into my Enough About Enough post from a while ago.  If I only knew python, I’d be stuck.  Thankfully I don’t, and I was able to whip out a Java program to do this comparison…. ok, I’m getting ahead of myself, let me start from scratch:

As I’ve written about before, I’m dealing with graph data in Neo4J.  Big shout out to the @Neo4j folks, as they’ve been instrumental in guiding me through some of my cypher queries, and whatnot.  Especially Michael H, who is apparently the Cypher Wizard to beat all Wizards.

Cyber threat intelligence data can be transmitted in TAXII format.  TAXII is an XML format that defines aspects of threat intelligence.  Read more in my blog post Stix, Taxii: Understanding Cybersecurity Intelligence

Since there are some nested and relations in this data that isn’t exactly ‘easy’ to model in an RDBMS or a document store, I decided to shove it into a graph database.  (Honestly, I’ve been looking for a Neo4j project for a while, and this time it worked!).   At Perch Security, our customers have access to threat intelligence data, paid and open source.   We want to give them access to that data in a specific way, so, I have to store and query it.  Storage is straight forward, I can get into that more later, but right now, I’m looking at querying this data.

After learning a few tricks to my cypher, again thanks @michael.neo on slack, I plugged it into my python implementation.  It took a while for me to figure out how to get access to the Cursor, so I could stream results ,and not actually pull the whole record set.  After all, I’m trying to create 100k records (which turns out to be approximately 500k nodes from Neo4j in order to do that.

My Data Flow

The gist of my data flow is simple, over time, I’m constantly polling an external data source and shoving the TAXII data into Neo4j.  My relationships are set, I try to consolidate objects where possible, and I’m off to the races.

When I query, I issue a big statement to execute my cypher with a max results of a lot… basically give me all the records until I stop reading them.  In other words, I use limit in my cypher that’s much  higher than what I will actually need.

My code starts streaming the results, and one by one shoves them into a ‘collector’ type object.  When the collector hits a batch size (5MB+), I add a part to my AWS S3 multi part upload.   When I’m done reading records (either I ran out of records, or hit my limit), I force an upload of the rest of the data I have in my collector, finalize the multi part upload and that’s it.

My python code, took about 15 minutes.  No lie, 15 minutes.   I tried optimizing, I tried smaller batches, etc, and my results were always linear.   I tried ‘paginating’ (using skip and limit), but that didn’t help… actually, I did skip/limit first, then I went to the big streaming block..

Ugh. 15 minutes.   I’m seeing visions of a MAJOR problem when we’re in full mode production.   Imagine.  I have 300 customers.  I have to run a daily (worst case) batch job for each customer.   Holy cow.   I’m going to have to scale like CRAZY if I’m going to match that.  … I’m sweating.  The doubt kicks in.   Neo4j wasn’t the right choice!  I wanted a pet project, and it’s going to kick my butt and get me fired!!!!!

Last ditch effort,  I rewrite in Java.

I’ve done a lot of Java in my life.. 10+ years full time.. but, it’s been over 6 years since I’ve REALLY written much Java code.. Man has it changed.

but I digress.   I download Eclipse.  Do some googling on Java/Maven/Docker

(  ended up using the Azul Docker instance, way smaller than the standard Java 8:   azul/zulu-openjdk:8)

and I’m off to the races.    I get to learn how to read from an SQS queue, query Neo4j, write to S3 all in 4 hours so I don’t get fired.

After a bunch of testing, getting Docker running and uploaded to AWS ECR… I run it..    It runs…. craps out after 15 seconds.   Shoot.. where did I break something.

I go to my logger output.. hmm.. no Exceptions.   No bad code blocks or stupid logic (I got rid of those in testing).. .

I run it again.

15 seconds.


15 seconds?  I check my output file.  It looks good. It matches the one created from python.    15 seconds?!

Something is wrong with py2neo.. that’s for darn sure.

Would anyone be interested in an example chunk of code to do this?

Email me if you do.

[contact-form-7 404 "Not Found"]

Five Features of a Successful API Platform – PDF

'API Culture'

I've coined this phrase to help indicate a healthy technology organization that strives to build a solid set of capabilities that can be leveraged across a wider audience than an engineering team typically gets.

Building an API is more than just writing a web service.  It's more than using AJAX or using a REST Framework within your code.  Building an 'API Culture' is all about providing your development teams the structure and ability to be as effective as they can be.  Increase collaboration, increase code quality and increase the reuse of applications that they build.  We're not talking about a specific tool or methodology, instead, we're talking about an attitude that your teams will adopt in order to enjoy their day to day working life a lot more.

Read the rest after a free download –

    Newer posts »

    © 2023 Fauie Technology

    Theme by Anders NorenUp ↑