De-Coder’s Ring

Software doesn’t have to be hard

Tag: python

Neo4J: Driver Comparison

neo4j-perch-s3

I’m not going to get into a detailed and analyzed discussion here, with key points on WHY there’s a huge performance difference, but, I want to illustrate why it’s good to attempt different approaches.  This ties into my Enough About Enough post from a while ago.  If I only knew python, I’d be stuck.  Thankfully I don’t, and I was able to whip out a Java program to do this comparison…. ok, I’m getting ahead of myself, let me start from scratch:

As I’ve written about before, I’m dealing with graph data in Neo4J.  Big shout out to the @Neo4j folks, as they’ve been instrumental in guiding me through some of my cypher queries, and whatnot.  Especially Michael H, who is apparently the Cypher Wizard to beat all Wizards.

Cyber threat intelligence data can be transmitted in TAXII format.  TAXII is an XML format that defines aspects of threat intelligence.  Read more in my blog post Stix, Taxii: Understanding Cybersecurity Intelligence

Since there are some nested and relations in this data that isn’t exactly ‘easy’ to model in an RDBMS or a document store, I decided to shove it into a graph database.  (Honestly, I’ve been looking for a Neo4j project for a while, and this time it worked!).   At Perch Security, our customers have access to threat intelligence data, paid and open source.   We want to give them access to that data in a specific way, so, I have to store and query it.  Storage is straight forward, I can get into that more later, but right now, I’m looking at querying this data.

After learning a few tricks to my cypher, again thanks @michael.neo on slack, I plugged it into my python implementation.  It took a while for me to figure out how to get access to the Cursor, so I could stream results ,and not actually pull the whole record set.  After all, I’m trying to create 100k records (which turns out to be approximately 500k nodes from Neo4j in order to do that.

My Data Flow

The gist of my data flow is simple, over time, I’m constantly polling an external data source and shoving the TAXII data into Neo4j.  My relationships are set, I try to consolidate objects where possible, and I’m off to the races.

When I query, I issue a big statement to execute my cypher with a max results of a lot… basically give me all the records until I stop reading them.  In other words, I use limit in my cypher that’s much  higher than what I will actually need.

My code starts streaming the results, and one by one shoves them into a ‘collector’ type object.  When the collector hits a batch size (5MB+), I add a part to my AWS S3 multi part upload.   When I’m done reading records (either I ran out of records, or hit my limit), I force an upload of the rest of the data I have in my collector, finalize the multi part upload and that’s it.

My python code, took about 15 minutes.  No lie, 15 minutes.   I tried optimizing, I tried smaller batches, etc, and my results were always linear.   I tried ‘paginating’ (using skip and limit), but that didn’t help… actually, I did skip/limit first, then I went to the big streaming block..

Ugh. 15 minutes.   I’m seeing visions of a MAJOR problem when we’re in full mode production.   Imagine.  I have 300 customers.  I have to run a daily (worst case) batch job for each customer.   Holy cow.   I’m going to have to scale like CRAZY if I’m going to match that.  … I’m sweating.  The doubt kicks in.   Neo4j wasn’t the right choice!  I wanted a pet project, and it’s going to kick my butt and get me fired!!!!!

Last ditch effort,  I rewrite in Java.

I’ve done a lot of Java in my life.. 10+ years full time.. but, it’s been over 6 years since I’ve REALLY written much Java code.. Man has it changed.

but I digress.   I download Eclipse.  Do some googling on Java/Maven/Docker

(  ended up using the Azul Docker instance, way smaller than the standard Java 8:   azul/zulu-openjdk:8)

and I’m off to the races.    I get to learn how to read from an SQS queue, query Neo4j, write to S3 all in 4 hours so I don’t get fired.

After a bunch of testing, getting Docker running and uploaded to AWS ECR… I run it..    It runs…. craps out after 15 seconds.   Shoot.. where did I break something.

I go to my logger output.. hmm.. no Exceptions.   No bad code blocks or stupid logic (I got rid of those in testing).. .

I run it again.

15 seconds.

HOLY COW!

15 seconds?  I check my output file.  It looks good. It matches the one created from python.    15 seconds?!

Something is wrong with py2neo.. that’s for darn sure.

Would anyone be interested in an example chunk of code to do this?

Email me if you do.

LinkedIn Recommendations: How to Improve

We’ve seen countless articles on the inefficiency of LinkedIn’s recommendations engine.  I’ve even seen articles saying things like “No matter the noise in my recommendations  the trend is what matters!”.

As a data nerd, this doesn’t sit well with me.  I like valid, verified and relational data that is accurate.  This seems like such a simple one, and it probably shouldn’t bother me, but it does.  I’ve wanted to write this up for a while, but always thought: “Maybe it’s not that big of a deal. Who reads these recommendations anyway?”.  The endorsement I received today was just funny.

My second job out of college was for SAIC (www.saic.com) on a government contract.  We were working on a regulatory project with lots of document management, workflow stuff.  Tool sets were Documentum, some scanning/OCR software, Java 1.4, custom built MVC framework etc.  My toolkit consisted of Java.  I didn’t know better.. That’s the only hammer I had.

Python, is my new favorite.  I’ve been almost 100% python (and web) for the past two years.  I haven’t written more 10 lines of Java code (esper, http://esper.codehaus.org/ , complex event processing, that’s a cool tech…  think of what you can do with network security data in that space!).

Ok, back to the point.  I left that SAIC position in 2003.  Fast forward 10 years, and someone from there if recommending me for python!  Now, I know I write my python like Java code.. not very ‘pythonic’… but, it’s not nearly the same.

Crap referral on LinkedIn

Crap referral on LinkedIn

Don’t get me wrong here, it’s not their fault for recommending me and python.  LinkedIn is REALLY obtrusive when they want you to refer someone. It’s almost like a freaking popup ad.  You just click it to get it to go away.

Essentially a popup.. just click stuff to make it go away!

Essentially a popup.. just click stuff to make it go away!

Out of that list, two of the people are family members.  I have no idea how they are at work, so, in good conscious, I can’t recommend their skills.   Another one, I knew from a previous employer, but didn’t really work with them.  So, I can ignore that.  The 4th, that one was good.  I can click there and feel good about endorsing them.

How do we fix it?

My biggest complaint is people who know me, and know what I can do, but not necessarily the skills that LinkedIn shows them.  I think the easiest way to avoid this, and to really get better data, is to have each person link their skills to a specific position.  If you didn’t have it linked there, and someone wants to endorse it, ‘add it’ to that position.  The screen shot below is a new one for me, and I didn’t know they let you add new skills on the fly, but I like it!   Just take it a small step further, and relate the skill to the position.  Then, prompt people who you’ve worked with, based on the position/company you shared with them.   It’s an extra step, but, I think you can get rid of half of the crap endorsements.

linked_in_newskill

 

redis-py: Connection Pool with unix sockets

This is a note for myself in the future.
When using redis-py ( https://github.com/andymccurdy/redis-py ), connection pools help save resources.
The construct for connection pools with unix sockets is a little different.

Normal connection with a unix socket:

import redis
r_server = redis.Redis(unix_socket_path='/tmp/my_redis.sock')

Normal connection pooling:

import redis
pool = redis.ConnectionPool( )
r_server = redis.Redis(connection_pool=pool)

Logical connection pool with unix socket:

import redis
pool = redis.ConnectionPool(unix_socket_path='/tmp/my_redis.sock' )
r_server = redis.Redis(connection_pool=pool)

actual code to use unix sockets and connection pools:
import redis

from redis.connection import UnixDomainSocketConnection
pool = redis.ConnectionPool(connection_class=UnixDomainSocketConnection, path='/tmp/my_redis.sock')
r_server = redis.Redis(connection_pool=pool)

Of note, specifying which connection class, instead of redis code figuring it out… and the renaming of the unix_socket_path.
Forking now, sending back shortly.

© 2017 De-Coder’s Ring

Theme by Anders NorenUp ↑