Planet Neo4j

Neo4j home | news | blog | Neo Technology

Neo4j News

Neo4j visualization with Gephi

During the summer Martin Škurla has developed support for the Neo4j graph database in the Gephi visualization and exploration platform. This Google Summer of Code project is now approaching its finish.

The basic idea of the project is twofold:

  1. Users of Neo4j get better visualization support
  2. Users of Gephi can work with larger graphs

The graph visualization can give you output like this:

Martin just published an article summarizing the project, head over and read it! For the future development of the Neo4j plugin, there's a survey where everyone has the chance to give feedback on what features they want to use.

To try it out, download NetBeans 6.9 and use it to build and execute the source code (Gephi depends on the NetBeans platform version 6.9). To check out the source, make sure to have Bazaar installed, and then do:

bzr checkout lp:~bujacik/gephi/support-for-neo4j

Finally, you're welcome to discuss the plugin on the Neo4j mailing list and in the Gephi forums.

Update: See the comments for information on configuring NetBeans to build Gephi!

by Anders Nawroth (noreply@blogger.com) at August 17, 2010 01:54 PM

Neo4j Blog

The top seven news in Neo4j 1.1

The Neo4j graph database release 1.1 has just arrived, so here's some information on the new things that have been included. The main points are the additions of monitoring support, an event framework and a new traversal framework to the kernel. Then two useful components have been added to the default distribution (called "Apoc"): graph algorithms and online backup.

1. Graph algorithms

Since the previous release, the graph algorithm component has been promoted to the default Neo4j distribution package. Here you'll find implementations of algorithms that will help you find the shortest paths, all simple paths or, if you want, all paths between two nodes. Or you can use the Dijkstra algorithm to handle weighted paths, or why not the cool A* algorithm wich is useful in a geospatial context among other uses:

astar node space

The above image is from an A* routing example project with code in Ruby or Java.

Since the previous release of the graph algorithm component, much work has been done to improve memory efficiency and speed. It now also uses the new traversal framework under the hood. As a developer, your starting point is the GraphAlgoFactory, wich provides access to the algorithms.

2. Event framework

The Neo4j 1.1 kernel includes support for a simple but powerful event framework, which allows you to hook into and react to any substantial change of the graph. For example, let's say that you have a UI widget that displays a specific property on a node. In previous releases, you'd have to manually add code to refresh the widget wherever you modify that specific node. With the 1.1 release, you instead write a simple listener that detects changes to that node and repaints the widget.

You can listen to the following events:

  • beforeCommit
  • afterCommit
  • afterRollback

This will allow you to perform actions such as:

  • read changes before commit
  • modify the transaction
  • decide if the transaction should be committed or not

The most powerful use of the event framework is likely in framework-like components that implement horizontal concerns such as validation and integration of data. You can imagine writing a simple component that automatically keeps a Neo4j IndexService up to date with the graph, so you won't have to manually maintain indices.

This is where you can hook into the database life cycle:

  • shutdown event
  • kernel panic - for instance when the disk is full

In this case, a typical use is to make sure layers on top of Neo4j are properly shut down before the database shuts down.

3. Traversal framework

To extend the possibilities for how to traverse a graph, a new traversal framework has been introduced. It's still in an early stage and has been included in the 1.1 release to gather feedback from users. Even if it's indeed already very useful, be preperad for the API to change somewhat!

One main design goal of the new traversal framework has been to increase the flexibility in how a traverser can be controlled. Examples of improvements compared to the old traversal framework are:

  • The user can select in which order relationships will be followed, which opens up for best-first traversals and fine grained traversal control, e.g. weighted traversals. Breadth-first and Depth-first traversals are just trivial examples of a global static branch selection policy implemented for convenience.
  • Paths now play a central role: the current path during traversal is exposed in the traversal context and paths can be returned as the traversal result.
  • For convenience, traversal results can be returned as nodes, relationships or paths.
  • The uniqueness constraints in a traversal now have a wide range of possibilities, like visiting nodes only once, visiting relationships only once (but possibly revisiting nodes), visiting the same nodes and relationships but in different paths and so on.
  • To reduce the memory footprint of the traversal, uniqueness constraints on nodes or relationships can be set to only guarantee uniqueness among the most recent visited nodes, with a configurable count.

There's more to it, but the above list should suffice for this blog post! Let's look at a code example to get a view of how the new traversal framework used.

As seen from this example the framework uses a fluent API:

for ( Path position : Traversal.description()
.depthFirst()
.relationships( KNOWS )
.relationships( LIKES, Direction.INCOMING )
.prune( Traversal.pruneAfterDepth( 5 ) )
.traverse( myStartNode ) )
{
System.out.println( "Path from start node to current position is " + position );
}

The traversal descriptions are immutable and can be reused to create new traversal descriptions. Here's an example of how this is done:

static final TraversalDescription FRIENDS_TRAVERSAL = Traversal.description()
.relationships( KNOWS )
.depthFirst()
.uniqueness( Uniqueness.RELATIONSHIP_GLOBAL );
// ...
// Don't go further than depth 3
for ( Path position : FRIENDS_TRAVERSAL
.prune( Traversal.pruneAfterDepth( 3 ) )
.traverse( myNode ) ) {}
// Don't go further than depth 4
for ( Path position : FRIENDS_TRAVERSAL
.prune( Traversal.pruneAfterDepth( 4 ) )
.traverse( myNode ) ) {}

Please note again that the API is likely to change before going final in the next release. We would love feedback on the new traverser framework! Just head over to the mailing list (which is a great place to hang out if you want to learn more about graph databases!) or say something on twitter.

4. Monitoring

Neo4j now supports monitoring over JMX. For example you can use a tool like JConsole to inspect what's happening in a live Neo4j instance. For example, say that your Neo4j-backed web site has been up and running for two days since the last restart. You can then go in and get statistics on how many transactions have been commited, the number of transactions open right now, the total number of transactions that have been rolled back, and LOTS more. Here's an image of the transaction information that is available:

jmx.transactions

Find out more on the wiki!

If you're using the Neo4j rest server (also see this blog post) there's a lot to look forward to with regard to monitoring. Namely, there's the new Neo4j webadmin project. At the moment it includes:

  • Lifecycle management
  • Monitoring of memory usage, disk usage, cache status and database primitives (nodes, relationships and properties)
  • JMX overview
  • Data browsing
  • Advanced data manipulation via Gremlin console
  • Server configuration
  • Online backups

Here's how the webadmin tool looks at the moment (click for bigger version):

neo4j webadmin dashboard

Currently, the webadmin tool builds on the Neo4j rest layer. In subsequent releases, it will be adapted to also work with embedded Neo4j.

5. Kernel

Much of the news in the Neo4j kernel has already been mentioned, but there's still some points to be adressed:

  • Read operations are not required to be performed inside a transaction any more. This can for example help a lot when doing traversals and gives you more options on the architecture side of things. Note however that in order to read uncommitted data, the read operations still have to be carried out in the corresponding transactional context.
  • At startup, the Neo4j kernel will look at the available amount of RAM and heap space and configure itself accordingly. In most cases there should be no more need to use a detailed configuration. If you have added a lot of data, a restart of Neo4j will let the automatic configuration catch up on what's happened and optimize the configuration.
  • At the creation of a new database, the block sizes for strings and arrays on the storage level can be configured. This settings can't be changed after the database has been created. If you know your data very well, this could be useful if the default settings doesn't cut it for you.
  • The GraphDatabaseService can now be accessed from every node and relationship, so you don't need to pass the instance around or inject it or whatevery you did before.
  • For your conveneince, a helpers package has been added to the kernel (previously it was a separate component named "commons"). The most interesting part is the collection helpers. By using them, creating a traverser that returns you domain objects instead of nodes/relationships is a breeze. There's other goodies in there as well, take a look!

6. Index

Other than bug fixes and performance optimizations, the integrated Lucene index has got some new features:

  • Improved support for removing indexes.
  • Index lookups can be performed without being in a transaction.
  • Exact lookups can be carried out (even when using a fulltext service).
  • Indexing of array values. If a value is an array it's split up and each value in that array is indexed separately.

7. Online backup

Of course you want to backup your Neo4j database while it's running, and in this release the online backup component is included in the default distribution package. This is an example of how to use it from your code:

EmbeddedGraphDatabase graphDb = getTheGraphDbFromApp();
String location = "/var/backup/neo4j-db";
// this will include the integrated lucene indexes as well
Backup backup = Neo4jBackup.allDataSources( graphDb, location );
backup.doBackup();

Conclusion

If you haven't already played with Neo4j there's now some more reasons to do so! And if you have, the fun will be even greater now!

The main starting points are:

by Anders Nawroth (noreply@blogger.com) at July 30, 2010 05:19 PM

Neo4j Blog

Neo4j development news

The development of the Neo4j graph database speeds up more and more and it's time to track some of the news! This post will focus on some of the contributions from the core team, while I leave the wealth of contributions from other community members for a later post.

Videos

We've created a page containing videos. At the moment there's:

  • Robert Scoble interviewing Emil Eifrem
  • Getting started with Neo4j screencast
  • Getting started with Neoclipse screencast

Head over to watch the videos: neo4j.org/doc/video/

Screenshots

It's great fun to visualize graph data models, so the team collected some images on a web page. At the moment there are social networks, road networks, product/category hierarchies and some more examples. Please comment on what you'd like to see in there!

Take a look at the screenshots here: neo4j.org/doc/screenshots/

road network

Event feed and other feeds

To make it easier to keep track on upcoming Neo4j events, we've set up a feed (web version). This is intended for all Neo4j events, not only those initiated by the Neo4j team members. Just get in touch if you're going to do a Neo4j presentation/workshop/whatever.

While at it, we also created a few other feeds, found at: neo4j.org/community/feeds/. At the moment you'll find for example Neo4j questions and commits from contributing projects. Expect more feeds to be added there soon!

The new traversal framework

The new traversal framwork has been integrated into the kernel, and is ready to try out. The best starting point at the moment is the apidocs, but there's also some information on the wiki. You can read up on the mailing list discussions on this topic, and this thread too.

The new event framework

Neo4j is getting a new event framework as well, which is documented on the wiki and in the apidocs. There's been quite some mailing list discussions on this topic.

REST API news

General information on the REST API is found at the component site and in the wiki.

Monitoring: JMX support

Neo4j now has JMX support, so that you can connect to it using JConsole:

jconsole

It exposes for example cache sizes:

cache sizes

More information in the Monitoring and Deployment wiki page.

Graph algorithms

Design and performance of the graph-algo component has been improved. Read up on it on its component site.

The component contains implementations of common graph algoritms like shortest paths, all paths, all simple paths, Dijkstra and A* etc.

Indexing

You can now index array properties as well, where every item in the array will be indexed.

Configurable block size

In special cases it may be useful to create a Neo4j database with other block size settings than the defaults. Head over to the configuration wiki page for the details. Note that these settings can only be applied at database creation.

by Anders Nawroth (noreply@blogger.com) at June 30, 2010 06:07 PM

Neo4j News

NOSQL summer in Malmö!

Hi all,
the summer is finally here, and with it soooo much spare time. Why not meet up with some peeps and discuss some of the cool new paradigms of database technology popping up in the NOSQL space?

Tim Anglade has been taking up the flag and is pulling off the NOSQL summer. If you are up for some cool discussions, join one of the places! NeoTechnology is providing space and beer for a first meetup 16th of June in Malmö. We are starting off the the Amazon Dynamo paper, a classic in new data thinking. If you are in Sweden or Denmark, ping us on the list and come along!

by Peter Neubauer (noreply@blogger.com) at June 03, 2010 02:18 PM

Neo4j Blog

Mashups with the Facebook Graph API and Neo4j

In case you didn't notice already, graph databases like Neo4j are hot nowadays. People ask questions, write about them, also in the contexts of NOSQL and RDF. Recently Twitter open sourced their graphdb implementation, targeted at shallow, distributed graphs. And then Facebook revealed their new Graph API using the Open Graph Protocol.

Today, we're going to show you how easy it is to use the Facebook Graph API to mash up data from Facebook with data in a locally hosted graph database!

It's movie time!

Let's say you want to see a movie with one of your friends. Wouldn't it be neat with a service that uses the Facebook social graph to collect movies your friend liked, and combines this with IMDB data to produce a movie suggestion? Turns out that an app like that is pretty straight forward with a graph database.

The first step is to connect to Facebook to fetch a list of your friends, so that's where the app will start out:


Next a list of your friends will show up:

Now, just click one of your friends and a movie suggestion will be generated:

Under the Hood

What we need to do is simply to let our mashup talk to both the Facebook Graph API and the IMDB API. Uh-oh - IMDB doesn't have a public API that you can throw requests at. Well, that's simple enough: we'll just import the data into a local Neo4j graph database and then access it through the Facebook Graph API!

So, let's see how to solve this. Here's the basic structure of our app:

MovieNight.js is the mashup itself, embedded in the web page. It uses the Facebook Graph API to get information about the friends of the visitor and the movies that your friends like. SuggestionEngine.js uses the Graph API to talk to a Neo4j database containing movie information (a small example data set from IMDB). The movie suggestion is based on what movies your friend has liked in the past. It simply tries to find other movies starring some actor from the liked ones.

Using the same Graph API to connect to both Facebook and the Neo4j graph database backend makes for convenience: it means that you can use tools written for Facebook for locally hosted data as well - and that's what we're doing here. To download the source, go to the download page.

Facebook data

To get your friends from Facebook, just use the common Facebook graph API:

FB.api('/me/friends', function(response) {
friends = response.data;
// Load friends into UI
friend_list.empty();
for ( var i = 0; i < friends.length; i ++ ) {
add_friend( friends[i] ); // write to UI
}
});

Getting the movies a friend likes is very similar to getting the friends list:

FB.api("/" + friend.id + "/movies", function(result)
{
/* handle the response here */
}

For more information, see the Graph API documentation.

Neo4j data

To connect to the Neo4j graph server we had to hack the connect-js library slightly, as it's hard coded to send requests to facebook.com. What we added is the possibility to add prefixes for different data sources. It still defaults to graph.facebook.com etc., but makes a "fb:" prefix available to make your code easier to read. To hook in a data source, we modify the FB.init() call like this:

FB.init({
appId : '', // NOTE: create an appid and add it here
status : true, cookie : true, xfbml : true,
// time to add our IMDB backend to the mix
external_domains : {
imdb : 'http://localhost:4567/'
}
});

Now we're able to send reqests to our own server as well, using code similar to the following:

FB.api("imdb:/path/to/data/in/graph", function(data) {
// data is available here :)
});

So now that we can send requests, what can we do with the Neo4j backend here? Here's a comprehensive list showing precisely that in some detail (all requests are GET from http://localhost:4567):

Get Actor (or Movie) by Id
RequestResponse
/56
{
"name": "Bacon, Kevin",
"id": 56
}
Extended information about Actor(/Movie)
RequestResponse
/56?metadata=1
{
"name": "Bacon, Kevin",
"id": 56,
"metadata": {
"connections": "http://localhost:4567/56/acted_in"
},
"type": "actor"
}
All the Movies an Actor had a Role in
RequestResponse
/56/acted_in
{
"data": [
{
"id": 57,
"title": "Woodsman, The (2004)"
},
{
"id": 59,
"title": "Wild Things (1998)"
}
// tons of movies here ...
]
}
Get (Actor or) Movie by Id
RequestResponse
/59
{
"title": "Wild Things (1998)",
"year": "1998",
"id": 59
}
Extended information about (Actor/)Movie
RequestResponse
/59?metadata=1
{
"title": "Wild Things (1998)",
"year": "1998",
"id": 59,
"metadata": {
"connections": "http://localhost:4567/59/actors"
},
"type": "movie"
}
All the Actors that have a Role in this Movie
RequestResponse
/59/actors
{
"data": [
{
"id": 56,
"name": "Bacon, Kevin"
},
{
"id": 528,
"name": "Dillon, Matt (I)"
}
// loads of actors here ...
]
}
Search for Actors with "bacon" in their name
RequestResponse
/search?q=bacon&type=actor
[
{
"name": "Bacon, Kevin",
"id": 56
},
{
"name": "Bacon, Travis",
"id": 14242
}
// more bacons here ...
]
Search for Movies with "wild things" in their title
RequestResponse
/search?q=wild%20things&type=movie
[
{
"title": "Wild Things (1998)",
"year": "1998",
"id": 59
},
{
"title": "River Wild, The (1994)",
"year": "1994",
"id": 74
}
// more wild movies here ...
]

Ok, but how do we use this stuff then?! Well, that's what we're going to look into right away, to see the Facebook Graph API used from JavaScript with a Neo4j/IMDB backend. To get started, here's how to perform a search:

self.movie_info = function( movie_name, callback ) {
// The search API uses commas for AND-type searches, spaces become OR, so for
// the movie names, we switch spaces out for commas.
movie_name = movie_name.replace(/ /g, ",");
FB.api("imdb:/search", {type:'movie', q:movie_name }, callback );
};

The request to get the movies an actor has acted in goes like this:

FB.api("imdb:/" + actor.id + "/acted_in", function( result ) {
for (var i = 0; i < result.data.length; i++)
{
movie = result.data[i];
// do something with the movie here!
}
});

To get all actors in a movie, simply use the following request:

FB.api("imdb:/" + movie.id + "/actors", function(result) {
for (var i = 0; i < result.data.length; i++)
{
actor = result.data[i];
// do something with the actor here!
}
});

Actually, these three different requests are all our small suggestion engine needs to fullfill it's task. Have a look at SuggestionEngine.js to see the full code.

How to create a Graph API service on top of Neo4j

Let's take a closer look at the movie backend now. It's built using the Neo4j Ruby bindings. In our example data set we have Actors and Movies connected through Roles, here's how these look in Ruby code:

class Movie; end

class Role
include Neo4j::RelationshipMixin
property :title, :character
end

class Actor
include Neo4j::NodeMixin
property :name
has_n(:acted_in).to(Movie).relationship(Role)
index :name, :tokenized => true
end

class Movie
include Neo4j::NodeMixin
property :title
property :year
index :title, :tokenized => true

# defines a method for traversing incoming acted_in relationships from Actor
has_n(:actors).from(Actor, :acted_in)
end

The code above is from the backend/model.rb file. On the Neo4j level, this is the kind of structure we'll have:

By defining indexes on Actor and Movie we can later use the find method on the classes to perform searches.

Our next step is to expose this model over the Graph API, where we'll use Sinatra and WEBrick to do the heavy lifting. The application is defined in the backend/neo4j_app.rb file - we'll dive into portions of that code right here. To begin with, how to return data for an Actor or Movie by Id?

get '/:id' do # show a node
content_type 'text/javascript'
node = node_by_id(params[:id])
props = external_props_for(node)
props.merge! metadata_for(node) if params[:metadata] == "1"
json = JSON.pretty_generate(props)
json = callback_wrapper(json, params[:callback])
json
end

The Sinatra route above uses a few small utility functions, let's look into them as well. The first one is very simple, but useful if we want to extend the URIs to allow for requesting for example /{moviename}/actors and not only numeric IDs.

def node_by_id(id)
node = Neo4j.load_node(id) if id =~ /^(\d+)$/
halt 404 if node.nil?
node
end

The next function returns the properties of a node, while filtering out those that have a name starting with a "_" character. It also adds the node id to the result.

def external_props_for(node)
ext_props = node.props.delete_if{|key, value| key =~ /^_/}
ext_props[:id] = node.neo_id
ext_props
end

Then there's a function that gathers metadata for a node, including a link to the list of connections to other nodes, and the type of the node.

def metadata_for(node)
if node.kind_of? Actor
connections = url_for(node, "acted_in")
elsif node.kind_of? Movie
connections = url_for(node, "actors")
end
metadata = { :metadata => { :connections => connections }, :type => node.class.name.downcase }
end

There's a couple more utility functions, but we'll skip them here as they are unrelated to Neo4j.

Next up is getting the relationships from an Actor or Movie. The code will only care about valid paths, that is, paths having /acted_in or /actors in the end. In other cases, an empty data set is returned. Other than that, it simply delegates the work to the domain classes, by doing node.send(relationship) to get the relationships. Using the send method in Ruby will here equal the statements node.acted_in or node.actors.

get '/:id/:relation' do # show a relationship
content_type 'text/javascript'
node = node_by_id(params[:id])
data = []
[ :acted_in, :actors ].each do |relationship|
if params[:relation] == relationship.to_s and node.respond_to? relationship
data = node.send(relationship)
end
end
data = data.map{|node| node_data(node)}
json = JSON.pretty_generate({:data => data})
json = callback_wrapper(json, params[:callback])
json
end

When viewing the relationships, we only want to show the most basic node info, so there's a utility function to do that as well:

def node_data(node)
data = { :id => node.neo_id }
[ :name, :title ].each do |property|
data.merge!({ property => node[property] }) unless node[property].nil?
end
data
end

Performing the searches are basically handled by adding indexes to the model (see the code further above). So what's left to do in the application is some sanity checks, delegating the search to the model and finally to format the output properly. Here goes:

get '/search' do
content_type 'text/javascript'
q = params[:q]
type = params[:type]
halt 400 unless q && type
result = case type
when 'actor'
Actor.find(to_lucene(:name, q))
when 'movie'
Movie.find(to_lucene(:title, q))
else
[]
end
json = JSON.pretty_generate(result.map{|node| external_props_for(node)})
json = callback_wrapper(json, params[:callback])
json
end

Wrap up

Here's some major takeaways from this post:

  • Graphs are going mainstream, as evidenced by initiatives like the Facebook Graph API.
  • It's often convenient to look at your data in the form of a graph, and with recent support in graph databases like Neo4j, it's easy to use different data sources in tandem through the Graph API.
  • Exposing data through the Graph API is simple if you have a graphdb backend.

And once you put your data in a graphdb, you can of course do more advanced graphy things too, like finding shortest paths, routing with A*, modeling of complex domains and whatnot. Just get started!

Example source code

To get the source code of the example, go to the download page.

Credits

Here's the guys who wrote the code of the example:

by Anders Nawroth (noreply@blogger.com) at May 07, 2010 04:31 PM

Neo4j Blog

Yay! The Graph Processing Infrastructure is starting to emerge!





Hi all,
in the last months, the Tinkerpop team has been starting to venture into the big task of starting a unified ecosystem for the world of graphs and related projects and products. Now, I am proud to say that it seems things are starting to get some traction and see increasing contributions from outside the core team, mainly the awesome Marko Rodriguez:



Logo contributed by Ketrina Yim
  • The JUNG graph library got adapted to Gremlin
  • HyperGraphDB is being adapted to work with Gremlin
  • a REST API based on the awesome work of Jim Webber and the Neo4j team is in the making by Michael Hunger and Pavel Yaskevich
So, here is the current project ecosystem - great work of everyone involved!

Gremlin:
  • mainly driven by Marko Rodriguez
  • a library and standalone, single-user Java project, defining a
  • number of data models - to start with the Property Graph Model (PGM) and the
  • General Document Model (GDM) , soon to be broken out of the core Gremlin code.
  • Adapters to different underlying graph implementations, from Neo4j to SAIL, integrating anything from Sesame to a live LinkedData SAIL
  • Adapters to other interesting graph frameworks like JUNG, suggested by Seth@Automenta
  • A Turing complete scripting language for querying, modification and transformation of PGM and GDM compliant data structures
  • All selectors are XPath-based in syntax
  • Pluggable external path elements and function implementation.

RESTling:
Webling:
  • Driven mainly by Pavel Yaskevich via financing from Neo Technology
  • A web based visual end-user interface to Restling
  • A web based terminal supporting execution of Gremlin operations and logic
  • Visualization support with graph libraries
  • Multi-user support
  • Via REST support to connect to remote Restling instances
Gargamel:
  • Driven mainly by Marko Rodriguez
  • a execution framework primarily targeted at Bulk Synchronous Parallel graph algos
  • A number of highly parallel base graph algos integrated into Gremlin to use this framework
  • A communication framework for execution of gremlin tasks on different (partitioned or replicated) graph instances, firstly using LinkedProcess (financed by the LANL) and XMPP, but replaceable with e.g. an Erlang-implementation (kudos to Ingo Schramm for suggesting it) or RESTling- based communication for optimization of different aspects like inter-process communication during execution

All in all, I just wanted to express my excitement over the whole emerging community around Gremlin, Neo4j and graphs in general! It is thrilling to see that the easy use of graphs and the internet-scale processing of complex data structures is starting to take shape in an open world, getting the different views on graphy data onto one page and providing a broader audience the possibility to use graph structures in the real world.

/peter neubauer

by Peter Neubauer (noreply@blogger.com) at May 04, 2010 03:56 PM

Emil Eifrem

On The Facebook Open Graph and Graph Databases

Last week Facebook announced their vision of the Open Graph -- what they call "a more social web." Now that the dust has settled, it stands clear that if Facebook can pull this off it will be a transformative moment for the web.

There are many perspectives on Facebook's move[1] but what makes me most excited is that the announcement was ALL about graphs. From the Facebook Graph API to the Open Graph Protocol, it seems like you can't touch any part of Facebook without getting infected by graphs.

That's ok by me! And since I happen to dabble in graphs for a living, I thought I'd spend some time today picking apart the announcement and talk about how it relates to our corner of the world: graph databases.

Ait, let's get started: what is this Open Graph anyway?

The Open Graph is Facebook's name for all content on the web that "hooks into" the social graph. Examples include people, places, businesses and events. This is nothing more than the regular web content we've always used and loved (for example, a movie listing on IMDB) but it's annotated with structured information so that it can be related to other things in the social graph.

Ok, so what is the Facebook Graph API?

The Facebook Graph API is the interface through which an external software developer interacts with the Facebook service. It's an HTTP API[2] that produces and consumes Open Graph data, i.e. people, places, businesses and the connections between them.

Whenever you as a web developer want to have access to data inside Facebook (say, if you'd like to figure out the friend graph of a visitor), you access that data through the Graph API. Facebook is basically a huge, highly connected graph and the Graph API is the lens through which we mere mortals can view and manipulate it.

Fair enough, then finally what is the Open Graph Protocol?

The Open Graph Protocol is a standard way for website authors to tag their content as part of the Open Graph. Those web pages can then be consumed not only by web browsers (for rendering) but also by services like the social plugins and Facebook (for processing).

Example: Let's say you have a website for tracking events such as concerts and book readings. The front page of the site is a chronological list of events. The markup for the front page consists of two things: the content (text saying "Dave Matthews Band") and then the markup (HTML tags and CSS elements saying "blue" and "font size x" or what not).

But there's a Band type in the current iteration of the Open Graph Protocol[3]. So what if you in addition to markup for presentation also added markup to say that the label "Dave Matthews Band" actually refers to a specific band in the Open Graph. Then you can add a social plugin, for example the "Like" button, to your front page, and that plugin can all of the sudden make sense of the content on your site.

When a visitor X to your site clicks "Like," the plugin knows to find the Open Graph content and throw that over the wire to the Facebook service. Suddenly all the friends of X see that X recommends checking out the upcoming Dave Matthews Band concert. Additionally, since your site has access to both the identity and location of X, you of course choose to highlight the events that are local to X as well as rank the events by popularity in amongst X's social graph.

If the Open Graph Protocol catches on it will create an unparalleled platform of machine-processable web content. The vehicle for inducing this change is an ecosystem of social plugins that will give website developers a shitload of cool functionality -- if only they sprinkle some Open Graph markup in their pages. Facebook's bet is that these plugins are appealing enough to push site owners over the (admittedly fairly low) threshold to tag their content with Open Graph markup.

Ok, so what's the bigger story here?

After taking all this in, two high-order bits stand out:

  1. Facebook is expanding the notion of the social graph to include not only people and their friends but everything of value in the digital world.
  2. Facebook is determined to capture how things are connected in that graph. Who is friends with whomHow are they friends? Who likes whatHow much do they like it? All those questions are about relationships in the graph. Facebook believes they can derive great value from owning the answers to those questions.

This view of the web and of the world is important because now that we enter the age of Facebook, interacting with the Graph API and the Open Graph will be a common task. It will be as natural to the modern developer as storing shopping carts in MySQL tables was to the developer of the previous decade.

That doesn't mean MySQL will be displaced or that the Graph API even solves the same problem. It doesn't. But an entire generation of software developers will grow up and think natively in graphs. And that's a big deal.

How do graph databases and Neo4j relate to all this?

Graph databases are increasingly relevant in a world that's increasingly connected. Facebook's recent move is undeniably a significant step in that direction. 

But that's obvious.

So let me dig a little deeper and take this perspective: Facebook seems to think (and I agree with) that in this decade a lot of the value in the web ecosystem will reside in the Open Graph layer. A substantial amount of that value will be captured by Facebook, much in the same way that Google through pagerank captured the value of the link graph of the previous decade.

Image by Chris Messina (factoryjoe.com) based on original work by http://www.flickr.com/photos/martiniko/

Image shamelessly stolen from Chris Messina who wrote an excellent post on the Open Graph Protocol, image based on original work by martiniko.

Now, some people will be satisfied with letting Facebook store the data they add on top of the Open Graph. They trade the value of that data for the convenience of social plugins and for not having to build and run large data centers and write their own large-scale graph databases. That's a painful investment Facebook has already been forced to make.

But others clearly won't want to donate that precious information to Facebook. They want to store parts of the Open Graph locally and they want to augment it with their own data, because that's the value of their service.

Of course, that data -- either being directly graph-oriented or operating on graphs -- fits extremely well in a graph database. Why take the pain of squeezing a graph into a table? Especially when the reward for all your hard graph/table mapping work is that you can barely even analyze it! Generally speaking, a relational database breaks down (due to joins) when you try to traverse more than 2-3 hops in any non-trivial connected data set.

Instead, I think the trend will continue where developers migrate towards a data model that fits the natural form of the data they're working with. In this case a graph database. Especially if that graph database is a robust and mature technology like Neo4j with years of solid 24/7 production track record and performance numbers on a level where it can traverse 1-2 MILLION hops per second in a graph.

Oh and wouldn't it also be neat if you could interact with all that data in your local graph and the Facebook graph in a uniform manner? Say, through the same API? Why, that'd be a neat addition to a graph database...

Conclusion

In conclusion, this announcement reinforces the fact that we're moving to a highly connected world. A world where the value lies in how things are connected. This megatrend has so far been mostly visible in the social web space.

But Facebook's gauntlet has been thrown: the Open Graph connects everything of value in the digital world. As Facebook's march towards dominance begins, developers around the world take notice that the natural and most powerful data model for anything web, really anything at all, is a graph.

It's a good time to be in the graph database space.

1] Not the least which is the implications it has for privacy and centralization of power, of course. But I'll try to stand clear of that or it'll take another ten pages before I get to how this all relates to graph databases.

2] The API uses HTTP but it's not fully RESTful. For example, it chooses to expose the identifiers of the connections and then the client use a predefined URI scheme to figure out the final URI of the connection. A RESTful API would use hypermedia to directly link the connection URI.

3] Astute readers will observe that there's a gap here. Who defines these types? Top-down or bottom-up? If bottom-up (as indicated in the spec), how do you define semantic equivalence? There's a lot unsolved (or unrevealed) parts about how all this will be governed.

oOo

by Emil Eifrem at April 30, 2010 07:38 PM

Alex Averbuch

Thesis - Sharding the Neo4j Graph DB

Me
My name's Alex, I'm currently a distributed systems masters student at KTH, Stockholm, and part way through my thesis at Neo Technology and SICS.
The project is a dual thesis that another student - Martin Neumann - and I are working on together.

What this project is
Graph databases are unique in that they're optimized to model highly connected data, and query it efficiently using graph traversal patterns. The benefits of this approach are numerous [4], but won't be discussed here. Our focus is on understanding the implications that highly connected data has when sharding (partitioning). We intend to investigate the requirements of and start developing an auto sharding framework for the Neo4j graph database. Although it's still early days, this post will (try to) summarize our progress so far and outline ideas for future work.

What this project isn't
There are problems we won't have time to tackle during the project.
These are assumed to be solvable and will be considered in design decisions, but won't be solved.

Replication: although essential to any fault tolerant distributed system, will not be implemented.

Graph partitioning algorithm design: 100's of graph partitioning algorithms exist already. We intend to do a broad sweep of the field then identify and evaluate those that appear most relevant, but designing them is not the goal. Graph partitioning algorithms are modern day black magic and their designers are experts in the dark arts. We'll leave the trickery to them and try to create a modular design that supports "pluggable" partitioning algorithms.

Approaching the problem from multiple angles...

Sharding at insert-time
Maybe no links exist between data (E.g. key-value stores), or the graph topology can be guesstimated ahead of time (E.g. GIS applications). In these cases data can be allocated to shards at insert time.
Your favorite DHT can already do this by sharding along the key space, but coupling sharding logic to a particular data structure is limiting.
I haven't looked into Twitter's Gizzard framework in detail but it seems to be powerful in this respect, allowing users to define custom "hash" functions.

Decoupled Components
As in Gizzard, it's desirable to abstract the insert time sharding logic into a pluggable component.
We define the Insert-Sharding component for this purpose.

Insert-Sharding
Inputs: Insert-Sharding-Function , Record.
Outputs: Shard-Mapping (mapping between data and shards).

Sharding at run-time
Thanks to the world wide interweb many applications try to model dynamic, constantly evolving domains. Social networks, for example, likely change every time a user starts/ends a relationship, moves to another city, etc.
Insert-Sharding decisions are cheap and effective in many situations, but also have a shelf life.
Not only may the Insert-Sharding scheme need to be updated periodically, but as data and/or access patterns evolve the already sharded data may need to be reallocated to new shards.

Decoupled Components
As with Insert-Sharding, it's desirable to abstract the runtime sharding logic into pluggable components.

Sharding at runtime relies on accurate, current information to make intelligent decisions. User access patterns, shard sizes, links between data, traffic... these are examples of metrics that can be used as input parameters to a runtime sharding algorithm. Applications have differing requirements regarding metrics they log, and metrics they need for runtime sharding. We define a Runtime-Logging component that encapsulates the logging function.

Runtime-Logging
Inputs: Runtime-Logging-Function.
Outputs: Runtime-Metrics.

Implementation of the runtime sharding algorithm is domain specific. To fulfil the goal of being pluggable we loosely define the Runtime-Sharding component.

Runtime-Sharding
Inputs: Runtime-Sharding-Function, Runtime-Metrics, Change-Log (recent CRUD operations).
Outputs: Shard-Mapping.

Note, graph partitioning algorithms may be a natural fit here, but are not essential. Refer to "Partitioning algorithms in a bit more detail..." for more on graph partitioning algorithms and their application to graph databases.

Reallocation of data to different shards may be beneficial, but if migration occurs during peak load performance will suffer and the process becomes counter productive. Shard-Mapping produced by the Runtime-Sharding component are instructions on where to migrate data, they say nothing about when to migrate.
For that purpose we need a module that is responsible for deciding when data migration should occur, and then issues commands to the shard servers to perform the migrations.
To perform these tasks we define the Migration-Scheduler component.
Migration-Scheduler
Inputs: Migration-Scheduler-Function, Shard-Mapping(, Runtime-Metrics).
Outputs: Migration-Commands.

A loosely coupled framework
This results in the first incarnation of our vision for a loosely coupled, extensible sharding framework.
Note, we've purposely emitted finer details. Partly for readability... and partly because we don't have all the details defined yet.



Partitioning algorithms in a bit more detail...

Graph partitioning algorithms
To massively over simplify, graph partitioning algorithms partition a graph into a number of disjoint subgraphs, while attempting to minimize edge cut, minimize the size difference between partitions (shards), minimize conductance of individual partitions, and/or maximize modularity of the partitioned graph.
The goal is to partition the graph such that traversal operations execute locally as much as possible, and are required to perform network hops (to different shards) infrequently.

Difficulties in partitioning graphs
Optimal graph partitioning is known to be NP-hard. O(n²) complexity is not uncommon, n can be >> 1,000,000,000, and with graphs of this size it becomes unrealistic to fit all state in RAM.
Each algorithm often performs well on a set of graph topologies, then is ineffective on others.
They often don't cope with dynamism (CRUD), require access to the entire graph, or are inherently sequential.
Moreover, many assume the graph has unlabeled edges, undirected edges, and/or unweighted vertices/edges. Unfortunately, every "un" results in some loss of the underlying graph semantics.

The point isn't that these algorithms are impractical, but limitations should be considered and accounted for. There is no one best algorithm, but in most cases a capable algorithm for a given application exists.

Importance of access patterns
When access patterns exist (and are recognizable) taking them into consideration is likely to be beneficial.
To over simplify again, take the - completely hypothetical - example of a social network where the most common operation is, "find my ex-lovers' current relationship status".
Social networks are graphs and this is a very simple graph traversal operation, making this potentially suitable for a graph partitioning algorithm. An algorithm that, during partitioning, places greater importance on not cutting the is_an_ex_lover edge type may do well in this case. Conversely, one that places greater importance on the is_an_old_school_colleague_that_I_never_speak_with may miss the point.
Algorithms we've looked at
As mentioned, we've tried to identify and evaluate a number of graph partitioning algorithms that appear most relevant.

To gauge relevance we first defined "desirable" characteristics of a graph partitioning algorithm, they are:
  • Dynamic: Algorithm can update an existing partitioning when CRUD operations are performed, without completely restarting.
  • Iterative: Rather than computing the partitioning in one operation, it continually improves over time.
  • Smoothness: When updating a partitioning, guarantees are given that the partitioning will not drastically change.
  • Distributed: Algorithm can be distributed across multiple machines, where each machine only stores and processes a subgraph of the global graph. Especially useful for massive graphs and for parallelizing the Runtime-Sharding process.
  • Local View: At no time does the algorithm need to have knowledge of the entire graph.
  • Weighted: Algorithm makes use of edge and/or vertex weights. This allows us to map certain metrics or graph properties to weights, to capture more of the graph's underlying semantics.
Using these characteristics we tracked down and began evaluation on the following algorithms:

DiDiC [1]:
"A Distributed Diffusive Heuristic for Clustering a Virtual P2P Supercomputer"
Dynamic, iterative, distributed, local view, smooth, weighted edges.
Reliant on the existence of clusters. May be slow to converge to a good partitioning, but it's iterative nature partly makes up for this.
Similar in nature to p2p gossip algorithms.

Evolving Set Process [2]:
"Finding sparse cuts locally using evolving sets"
Fast, but extremely reliant on the existence of clusters.
Local view, weighted edges.

Min-cut Tree [3]:
"Dynamic Graph Clustering Using Minimum-Cut Trees "
Dynamic, local view (mostly), smooth (for certain operations).

Non-graph partitioning algorithms
Not all partitioning algorithms are graph partitioning algorithms. Graph partitioning algorithms are suitable when data can be modelled as a graph, and operations as graph traversal patterns [4].
Other methods become more suitable when the above is not the case, or even when the above holds but the graph topology tends toward being random.
Graph partitioning algorithms can be a powerful sharding tool, but only when applicable. Use the best tool for the job.

If it's easy to identify a sharding policy for a given domain, avoid expensive graph partitioning algorithms.
For example, you run a GIS service for two cities. Users of your system (citizens of each of the cities) frequently want travel plans for within their own city, but rarely to the other. In this case storing all data for a city on the same server/cluster is simple, and may be a good solution.

When to what (identifying the best approach)
One long term vision for this project is to create a repository that makes configuration of the sharding framework as painless as possible.

All social networks share commonalities, as do GIS systems, language learning tools, etc.
Identifying these commonalities is not trivial, but we're working with graphs so graph metrics may be useful when attempting to classify these datasets. Metrics such as clustering coefficient, number of vertices, number of edges, graph diameter, etc all tell us something about the topology of the dataset.

It may be feasible to create a public repository containing mappings between these metrics and matching Insert-Sharding-Functions and Runtime-Sharding-Functions. A dataset analysis tool then produces metrics for a given dataset, and these are used to lookup the best fitting Insert-Sharding-Functions and Runtime-Sharding-Functions from the public repository.

For example, if an Insert-Sharding-Function and Runtime-Sharding-Function work well for Facebook, they can likely be applied Myspace too.

Shiny results
As mentioned, we're only part way into our thesis, and even by the end of it we clearly won't have everything complete.
We have implemented, played around with, and plotted preliminary results for a few algorithms already though.

Below is a visualization (using igraph) we created on a sample dataset (~2,500 vertices, ~7,500 egdes) from an online graph archive. This was run for 500 iterations using the following configuration:
  • Insert-Sharding-Function: Random allocation
  • Runtime-Sharding-Function: DiDiC [1]
  • Migration-Scheduler: Instant migration

Thanks too...
The Neo4j crew, our supervisor Sarunas Girdzijauskas, and Marko Rodriguez, f
or their help so far in providing Martin and I with a continual stream of suggestions, deadlines, and ideas!



by Alex Averbuch (noreply@blogger.com) at April 23, 2010 03:57 AM

Jayway Team Blog

Neo4j .NET Client over HTTP using REST and json

Here it is; a Proof of Concept of the world’s first Neo4j .NET Client. In other words: Here follows a discussion on how to create a client library for communicating with a graph database over REST.

UPDATE: There is now a live CodePlex project for the realization of this concept; A .NET Client Library for Neo4j over HTTP using REST and JSON; http://neo4jrestsharp.codeplex.com/

Intro

Neo Technologies have come out with a Neo4j REST API for their popular Neo4j graph database. Since I am employed at Jayway and a couple of colleagues working with Neo are too a little bird by the name of Peter Neubauer tweeted in my ear that there was a free T-Shirt involved in making a working sample for .NET that communicated with the Neo4j server. Not only was that a handsome offer – I wanted to check out what the Architecture of such a client library for a graph database would look like.

I’ve recently been involved in client library abstractions and good architecture with the RESTful interface to Windows Azure Storage. (More on Azure and storage on my blog here, here, here, here and here.) Enough with the side tracking (sorry)!

Said and done; We set out to code up a working sample for .NET. I hacked the client code. The Neo-guys cheered me on and offered very helpful assistance when I ran into trouble or was just plane stumped on how to proceed with the API. A very nice team effort, including a remote team debugging session, later and we have a small and running POC

Tweeting on this effort it is notable how a bunch of people answer back with an interest in this idea for potential applications in the .NET world. I’ve been approached both on twitter (@noopman) and on LinkedIn with questions on this.

This blog post is to submit our findings to the world and hope that someone will be inspired to get going with a real .NET client library for Neo4j.

Architecture

First let’s look at the architecture for creating a client library for a RESTful service. This is a more general discussion that leads to our client library POC below. If your main reason for reading is the actual sample; scroll down.

When doing a client library abstractions and domain thinking comes into play. What we want is for the forgotten users (developers) of your application to work with their normal domain objects in the business code and for them to use an API that makes sense to their domain while talking to a highly specific graph database backend store. This means you have to keep all the RESTfulness, HTTP and json away from the domain code. Let me show you a sketch and see if I can make this explanation a bit less wordy:

image

At the other side of the Cloud from our .NET code is a Java based graph database. It is friendly enough to expose a RESTful API so that we may speak to it.

Doing so we want to create a client library that is capable of a couple of things:1) Taking care of serializing objects to json (and deserializing them back again). 2) Knowing about HTTP requests and how to handle them. In our sample we employed the power of the HttpWebRequest class to handle things like HTTP Headers, Content Types, URIs etc. 3) Finally this library must understand to expose an interface to your code that you can use from a specialized Data Access Component.

To protect the business layer from knowing anything about the specifics of the Neo4j API you really should use a specialized Data Access Component. There is now law being broken if you don’t do that. However a good rule of thumb is to add abstraction layers in your system to shield your domain code in your business components from any specifics of an underlying storage structure. The details of serialization of objects to json and making requests over HTTP are not a natural fit to the business problem you are solving. The code that does this separation is code you write yourself in your application. Only you know how to adapt a foreign library and technology to your domain. The sole purpose of this code, in other words, is to abstract away from your business logic the underlying technologies used to persist and query data used in the application. This component translates between the two worlds of your business logic and the techniques used to store data in a graph database. Most of the specific technical details of storage are wrapped up in the Client Lib mind you but the Data Access Layer DAL knows about this problem domain. It understands that the persistence indeed is a REST based service over HTTP with all that this entails. This component in your code can handle synchonisity and take care of specific errors or faults that may occur when using the storage. All of this is done in a language that is fairly close to the API that Neo4j exposes. In the end this component translates into terms useful to the domain in which your business operates. Instead of saying .PostNode() in your code perhaps you instead expose a .StorePerson() action.

What we achieve is a good and sound (and classic) layering model: Business Logic – DAL – Client Lib.

There is one more observation to make in the picture and that is the matter of IoC - Inversion of Control. In order to keep your code testable, maintainable and untangled from the implementations at lower layers you should always employ the method of Inversion of Control – making your business code for instance not reference the DAL direct but instead use a common contract to do so. The Depencency Inversion Principle states briefly that a higher layer should never depend direct on a lower layer but instead depend on an abstraction. If this is true you are able to test your business logic without using the actual underlying storage. This is for instance key in a TDD – Test Driven Development development style.

Code that depends on abstractions can be tested very quickly and does not depend on specific data in a data store to be correct in order to test the intended behavior. The Business Logic uses a contract for the DAL and the DAL implements it. Then you use something called an IoC container (read on at Wikipedia in the link above) to create the DAL instances and hand them to the Business Logic employing the technique of DI -Dependency Injection. This separation enables you to modify, re-deploy or even exchange DAL components without having to touch your Business Layer at all. If separating the Business Logic from the DAL using contracts is important it is almost event more important to do so in the lower layer that in run time speaks direct to the data store! You never want to end up in the situation where in order to test your DAL code you have to have access to a specific data store with specific data in it. Not only is that an inconvenience; it also makes executing your tests run as slow as transferring data over the Internet compared to as fast as your CPU can execute code. From experience it is an obvious conclusion that tests that are slow to run are not in fact run. The conclusion from all of these IoC discussions can only be that it is imperative that your Client Library comes with an abstraction layer.

What did we do with the Architecture in our POC?

All of these architectural considerations are of course premature in a POC. This is why in our sample we did not really separate the DAL code from the Client Lib. However to prove the point of IoC and DI I did however separate out the Business Logic from the Data Access logic using a simple contract and an equally simple Factory Method Pattern. Instead of new:ing up a DAL component in my code I called a factory method to access an abstraction of the DAL.

Our POC sample

Our code sample creates a small node graph with a few relationships. Then it makes a traversal of the graph to find all who “knows” someone according to the relationships. Finally it deletes the graph again.

Here is the graph we create in our sample: http://wiki.neo4j.org/content/The_Matrix We only create names on nodes and types on relationships in our simple sample.

And here is also the code so that you may download and follow along with our explanation your self:

First, in accordance with architectural considerations (above) we get hold of our DAL using a factory method. We also create a new Person object and store him (in this case Thomas Anderson – Neo):

INeo4jStorageClient neo4JStorageClient =
Neo4jStorageFactory.GetNodeStorageClient(neo4jServerUri);

neo = new Person("Thomas Anderson");
neo4JStorageClient.StoreNode(neo);

As you can see in our business code this is just standard .NET classes being instantiated and a “Neo” Person is stored in a very normal way. In the business logic code there is no real details on how this storage happens. There is of course a “node” concept and you can consider for your self if that should be hidden away from business code too.

Since this is a POC the implementation of .StoreNode(neo) is very very trivial. If this were a real client library we would have included things like a library to transform (serialize) the neo-Person node into a json string. Now we just do it statically:

public void StoreNode(Person person)
{
    var name = person.Name;

    string jsonString = @"{""Name"":""" + name + @"""}";

    var nodeProtocol = new NodeProtocol(uri);

    var nodeId = nodeProtocol.PostNode(jsonString);
    person.Id = nodeId;
}

The interesting part here is the existence of a NodeProtocol implementation. I’ve found it useful working with a remote API like Windows Azure Storage and also with this Neo4j REST API to implement methods in C# code that correlate very closely to the API methods available. In our case we have a PostNode operation on the REST API and consequently we have a .PostNode() method in the C# implementation of that HTTP method call.

public int PostNode(string nodeData)
{
    var uri = HttpHelper.ConcatUri(baseUri, "node");

    HttpWebRequest request =
        Request.CreateWebRequest(HttpWebRequestMethod.POST, uri);

    HttpHelper.AddBody(request, nodeData);

    var response = requestMaker.MakeRequest(request);

    return HttpHelper.FindNodeId(response.Item3);
}

I won’t bore you with the exact details of all of the helper classes. To dig in to those it is better if you just download the code and look at the details your self.

As you can see we create a HttpWebRequest, add some headers to it (in the CreateWebRequest method) specifying things like request.Accept = "application/json"; meaning we accept a json formatted response. Also we specify that the body we send out is formatted as json with request.ContentType = "application/json";. Finally we make the call and return the id of the node we just created.

Again this is a solution bordering on triviality but still it is to the point of making a REST request and receiving a response. Also there is no code to handle any errors present here and the request is done synchronously.

What does the actual HTTP REST and json request and response look like for this simple sample?

Request:


POST http://try.neo4j.org:9999/node HTTP/1.1
Content-Type: application/json
Host: try.neo4j.org:9999
Content-Length: 26
Expect: 100-continue
Connection: Keep-Alive


{"Name":"Thomas Anderson"}

Response (note that the added line breaks in the response body were added by me):


HTTP/1.1 201 Created
server: grizzly/1.8.1
Content-Encoding: UTF-8
Location: http://try.neo4j.org:9999/node/1

Content-Type: application/json
Content-Length: 860
Date: Fri, 16 Apr 2010 08:36:45 GMT

{"incoming typed relationships":

http://try.neo4j.org:9999/node/1/relationships/in/{-list|&|types},

"incoming relationships":http://try.neo4j.org:9999/node/1/relationships/in,

"all relationships":http://try.neo4j.org:9999/node/1/relationships/all,

"create relationship":http://try.neo4j.org:9999/node/1/relationships,

"data":{"Name":"Thomas Anderson"},

"traverse":http://try.neo4j.org:9999/node/1/traverse/{returnType},

"property":http://try.neo4j.org:9999/node/1/properties/{key},

"self":http://try.neo4j.org:9999/node/1,

"properties":http://try.neo4j.org:9999/node/1/properties,

"all typed relationships":

http://try.neo4j.org:9999/node/1/relationships/all/{-list|&|types},

"outgoing typed relationships":

http://try.neo4j.org:9999/node/1/relationships/out/{-list|&|types},

"outgoing relationships":http://try.neo4j.org:9999/node/1/relationships/out}

As you can see the Neo4j REST API is self explanatory. Creating a node gives you back a response with all the things you can now do with this new node over the API. Since this quite a lot of data when the only thing that really matters is in the header (Location: http://try.neo4j.org:9999/node/1) I’ve already given feedback to the Neo team that I’d like a simple Uri switch (like APIDocumentation=0) to shut off the detailed response. But in fact I do like that you get so many Uris back that you may use direct. Depending on your needs this is either good or to verbose a payload to get back in your response. As you can see the node we created got the Id set to 1.

Now we proceed to create all the other nodes and in a very similar fashion the relationships between nodes. Here is the relationship that says that “Neo knows Trinity” (who has a node id of 2). This is a POST operation to /node/1/relationships:


POST http://try.neo4j.org:9999/node/1/relationships HTTP/1.1
Content-Type: application/json
Host: try.neo4j.org:9999
Content-Length: 56
Expect: 100-continue

{"type":"Knows","to":http://try.neo4j.org:9999/node/2}

Once the node graph for the Matrix is created we proceed to make a traversal of the graph. We want to find anyone who knows someone throughout the graph.

The request looks like this in REST (after a similar path through the code as above with a call from business logic to DAL and through to the client lib that makes the actual call using REST):


POST http://try.neo4j.org:9999/node/1/traverse/node HTTP/1.1
Accept: application/json
Content-Type: application/json
Host: try.neo4j.org:9999
Content-Length: 57
Expect: 100-continue
Connection: Keep-Alive

{ "relationships": { "type": "Knows" }, "max depth": 10 }

Again we simplified to a “max depth” of 10 relationships deep into the graph which works fine for our sample.

The result is a list of all the nodes that match. Because of the self explanatory API mentioned above the result also gives you actions to perform on each node that comes back in the response thus making the response body quite long. Rather than pasting the whole thing here into the blog post you can try it out your self (by downloading the application) and you will get a result exactly like this from the app:

image

Neo knows Trinity and Morpheus, who knows Cypher, who in turn knows Agent Smith.

Difficulties

We had only two real challenges in coding this POC.

The first one was an issue with streams and the optional so called Byte Order Mark (BOM) that you can put at the head of a stream. I’ve posted on this here: Would you like a Byte Order Mark to go with that?.

The other challenge was that I wrote a small app for Windows to show our results. But the guys on the Neo team are certainly not Windows users. The wanted to show the code on Mono. Just to be sure of compatibility I scaled back to a good old Windows Forms demo app in .NET Framework 3.0. This worked great on Mono which was our intent. But the drawback was that the library I’d found to handle object serialization didn’t respond well to that old version of the framework. That’s why I replaced my lib with static string creations (as you can see in the implementation of the .StoreNode() method above).

What have we learned from this?

First of all creating a client library for .NET to speak REST and json over HTTP to Neo4j is very doable; quite easy and straight forward.

Second (and that’s just a word of warning); keep your layers separated. This goes for all applications I guess but if you start to put code to create instances of web requests with web methods and content types and such in your business code you will end up in a very tangled and messy situation very fast which will be a pain and a high cost to maintain.

Finally we have considered looking into using WCF data services (and OData – the Open Data Protocol) and also possibly Linq – Language Integrated Query on the client side to make implementations in the library even more powerful and useful. And we do need a object to json serializer library.

Oh and here is another post on using the Neo4j REST API by the team itself: http://blog.neo4j.org/2010/04/neo4j-rest-server-part1-get-it-going.html

That’s it!

We hope this is inspirational enough for you to get going with your own samples or even client libraries in .NET to speak to Neo4j. If you do; please get back in touch with us!

Cheers,

M.

by Magnus Mårtensson at April 16, 2010 09:49 AM

Neo4j Blog

The Neo4j REST Server - Part1: Get it going!

Introduction

As requested and wished by many, finally Neo4j got its own standalone server mode, based on interaction via REST. The code is still very fresh and not thoroughly tested, but I thought I might write up some first documentation on it, based on the Getting Started with REST Wiki page

Installation

The first version of the distribution can be downloaded from here: zip, tar.gz. After unpacking, you just go to the unpacked directory and run (on OSX/Linux - see the wiki entry for details on Windows)
$ ./bin/neo4j-rest start
which will start the Neo4j REST server at port 9999 and put the database files under a directory neo4j-rest-db/ (lazily with the first request). Now, let's point our browser (not Internet Explorer since it doesn't send any useful Accept-headers and will get JSON back, this will be fixed later) to http://localhost:9999 and we will see the following:



Things seem to be running! The reason for the HTML interface is the Browser sending Accept: text/html. Now, setting the Accept to application/json will produce
peterneubauer$ curl -H Accept:application/json -H Content-Type:application/json -v http://localhost:9999
* About to connect() to localhost port 9999 (#0)
* Trying 127.0.0.1... connected
* Connected to localhost (127.0.0.1) port 9999 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.19.7 (i386-apple-darwin10.2.0) libcurl/7.19.7 zlib/1.2.3
> Host: localhost:9999
> Accept:application/json
> Content-Type:application/json

* Connection #0 to host localhost left intact
* Closing connection #0
{
"index":"http://localhost:9999/index",
"node":"http://localhost:9999/node",
"reference node":"http://localhost:9999/node/0"
}

Now, with "200 OK" this is a good starting point. We can see full references to the interesting starting points -the reference node and the index subsystem. Let's check out the reference node:
peterneubauer$ curl -H Accept:application/json -H Content-Type:application/json -v http://localhost:9999/node/0
* About to connect() to localhost port 9999 (#0)
* Trying 127.0.0.1... connected
* Connected to localhost (127.0.0.1) port 9999 (#0)
> GET /node/0 HTTP/1.1
> User-Agent: curl/7.19.7 (i386-apple-darwin10.2.0) libcurl/7.19.7 zlib/1.2.3
> Host: localhost:9999
> Accept:application/json
> Content-Type:application/json
>
{
"incoming typed relationships":"http://localhost:9999/node/0/relationships/in/{-list|&|types}",
"incoming relationships":"http://localhost:9999/node/0/relationships/in",
"all relationships":"http://localhost:9999/node/0/relationships/all",
"create relationship":"http://localhost:9999/node/0/relationships",
"data":{},
"traverse":"http://localhost:9999/node/0/traverse/{returnType}",
"property":"http://localhost:9999/node/0/properties/{key}",
"self":"http://localhost:9999/node/0",
"properties":"http://localhost:9999/node/0/properties",
"all typed relationships":"http://localhost:9999/node/0/relationships/all/{-list|&|types}",
"outgoing typed relationships":"http://localhost:9999/node/0/relationships/out/{-list|&|types}",
"outgoing relationships":"http://localhost:9999/node/0/relationships/out"
}
Which gives us some info about what the Node 0 can do, how to get its relationships and properties and the syntax of how to construct queries for getting properties, creating relationships etc.

Insert some data

According to RESTful thinking, data creation is handled be POST, updates by PUT. Let's insert a node:
peterneubauer$ curl -X POST -H Accept:application/json -v localhost:9999/node
* About to connect() to localhost port 9999 (#0)
* Trying 127.0.0.1... connected
* Connected to localhost (127.0.0.1) port 9999 (#0)
> POST /node HTTP/1.1
> User-Agent: curl/7.19.7 (i386-apple-darwin10.2.0) libcurl/7.19.7 zlib/1.2.3
> Host: localhost:9999
> Accept:application/json
>
{
...
"self":"http://localhost:9999/node/1",
"data":{},
...
}
Resulting in a new node with the URL localhost:9999/node/1 (described by the "self" property in the JSON representation) and no properties set ("data":{}). The Neo4j REST API is really trying to be explicit about possible further destinations, making it self-describing even for new users, and of course abstracting away the server instance in the future. This makes dealing with multiple Neo4j servers easier in the future. We can see the URIs for traversing, listing properties and relationships. The PUT semantics on properties work like for nodes.
We delete the node again with
curl -X DELETE  -v localhost:9999/node/1

and get 204 - No Content back. The Node is gone and will give a 404 - Not Found if we try to GET it again.

The Matrix

Now with properties encoded in JSON we can easily start to create our little Matrix example:



In order to create relationships, we do a POST on the originating Node and post the relationship data along with the request (escaping the whitespaces and others special characters):
curl -X POST -H Accept:application/json -H Content-Type:application/json -d '{"name":"Mr. Andersson"}' -v localhost:9999/node
curl -X POST -H Accept:application/json -H Content-Type:application/json -d '{"name":"Morpheus"}' -v localhost:9999/node
curl -X POST -H Accept:application/json -H Content-Type:application/json -d '{"name":"Trinity"}' -v localhost:9999/node
curl -X POST -H Accept:application/json -H Content-Type:application/json -d '{"name":"Cypher"}' -v localhost:9999/node
curl -X POST -H Accept:application/json -H Content-Type:application/json -d '{"name":"Agent Smith"}' -v localhost:9999/node
curl -X POST -H Accept:application/json -H Content-Type:application/json -d '{"name":"The Architect"}' -v localhost:9999/node

Getting http://localhost:9999/node/1, http://localhost:9999/node/2, http://localhost:9999/node/3 as the new URIs back. Now, we can connect the persons (escaping ruining readability a bit ...):
curl -X POST -H Accept:application/json -H Content-Type:application/json -d '{"to":"http://localhost:9999/node/1","type":"ROOT"}' -v http://localhost:9999/node/0/relationships
curl -X POST -H Accept:application/json -H Content-Type:application/json -d '{"to":"http://localhost:9999/node/2","type":"KNOWS"}' -v http://localhost:9999/node/1/relationships
curl -X POST -H Accept:application/json -H Content-Type:application/json -d '{"to":"http://localhost:9999/node/3","type":"KNOWS"}' -v http://localhost:9999/node/2/relationships
curl -X POST -H Accept:application/json -H Content-Type:application/json -d '{"to":"http://localhost:9999/node/4","type":"KNOWS"}' -v http://localhost:9999/node/2/relationships
curl -X POST -H Accept:application/json -H Content-Type:application/json -d '{"to":"http://localhost:9999/node/5","type":"KNOWS"}' -v http://localhost:9999/node/4/relationships
curl -X POST -H Accept:application/json -H Content-Type:application/json -d '{"to":"http://localhost:9999/node/6","type":"CODED BY"}' -v http://localhost:9999/node/5/relationships
curl -X POST -H Accept:application/json -H Content-Type:application/json -d '{"to":"http://localhost:9999/node/1","type":"LOVES"}' -v http://localhost:9999/node/3/relationships

Now, pointing our browser at http://localhost:9999/node/3/relationships/all will list all relationships of Trinity:



Our first traversal

To start with, the Neo4j default Traverser framework (updated to be more powerful than the current) is supported in REST, and other implementations like Gremlin and Pipes to follow. The documentation on the traversals is in the making here. There are a number of different parameters:
http://localhost:9999/node/3/traverse/node specifies a return type of "node", returning node references. There are other return types such as relationship, position and path returning other interesting info respective. The Traverser description is pluggable and has default values - a full description looks like
{
"order": "depth first",
"uniqueness": "node path",
"relationships": [
{ "type": "KNOWS", "direction": "out" },
{ "type": "LOVES" }
],
"prune evaluator": {
"language", "javascript",
"body", "position.node().getProperty('date')>1234567;"
},
"return filter": {
"language": "builtin",
"name", "all"
},
"max depth": 2
}

To note here is the pluggable description of the "return filter" (what to include in the return) and "prune evaluator" (where to stop traversing). Right now only JavaScript is supported for writing these more complicated constructs up, but other languages are coming. Very cool. To finish, let's get all the nodes at depth 1 from Trinity via trivial traversal:
curl -X POST -H Accept:application/json -H Content-Type:application/json -d '{"order":"breadth first"}' -v http://localhost:9999/node/3/traverse/node

Which just returns all nodes of all relationships types at depth one (default) as a JSON Array of node descriptions as above, in this case http://localhost:9999/node/1 and http://localhost:9999/node/2.

Summary

Having the Neo4j REST API and with it the Neo4j REST Server coming along is great news for all that want to use a graph database over the network, especially PHP or .NET clients that have no good Java bindings. Already a first client wrapper for .NET by Magnus Mårtensson from Jayway is underway, and a first PHP client is on Al James' GIThub.
This will even pave the way for higher-level sharding and distribution scenarios and can be used in many other ways. Stay tuned for a deeper explanation of the different traversal possibilities with Neo4j and REST in a next post!

by Peter Neubauer (noreply@blogger.com) at April 16, 2010 10:45 AM

Neo4j News

Neo4j Standalone REST server 0.8 available

The Neo4j team is proud to announce the availability of a REST wrapper that can be installed both on Windows, Linux and Mac OS X as a standalone server. Please follow the Getting Started guide on the Neo4j wiki where you'll find download links, the Matrix-REST example is available over here.

This makes the interaction with the Neo4j graph database from PHP, .NET and other languages like Perl a lot easier. Let us know how things work out via the Neo4j mailing list!

by Peter Neubauer (noreply@blogger.com) at April 15, 2010 04:43 PM

Jayway Team Blog

Would you like a Byte Order Mark to go with that?

It is possible to encode a little bit of metadata at the beginning of your byte streams to let the stream itself carry information on how it has been encoded. This is known as a Byte Order Mark (BOM) and it is as far as we know completely optional. Some .NET Framework implementations add this BOM to the start of streams given that they are to have a specific encoding. Here is how to find the BOM, and if necessary, remove it from your stream.

Context

Recently Neo4j graph database announced a REST interface. A couple of colleagues at Jayway are working on this. This was (very much) a too intriguing temptation to resist. I naturally had to write a small POC .NET based REST client to be able to talk to that server. (Blog post on this is pending within a few days.) While trying to send REST data in HTTP requests to the server we came across something interesting that caused compatibility issues between the Java and .NET worlds. The culprit was that optional byte order mark that I unintentionally embedded in my streams of bytes. The server did not recognize them. In a joint effort from both sides of the world (.NET client and Java server) we were able to find and identify the problem. Not that this is a new problem - It is very well documented in Wikipedia (link above). However I feel that one more post on this topic with a good test to show what’s going on is warranted.

A very good tool was also used to nail down the problem; Fiddler2. This little baby of a program is an http proxy that sits on your machine and listens to your http traffic. Using the tool you can inspect exactly what you are sending over the wire in your http requests and also view the responses coming back. Sweet stuff indeed. This is a MUST tool for you if you develop anything that sends data over HTTP.

Using Fiddler2 we were able to observe that when sending a stream of bytes as a body for an http POST the byte length, when the request came from my code, was 108 bytes. However when I used the built in request builder in Fiddler2 to reproduce and resend the exact same request it was reported that the length was 105. Why was there a difference in three (3) bytes?

Problem was that these tree bytes, as I stated above, caused the server to throw a fit. It could not recognize the body of the message to be the expected json string in UTF8 encoding. Json, I’ve learned, is always in UTF8 encoding.

The java guys googled and I binged ;~) and we found that these three bytes is an optional “byte order mark” or BOM. It is not wrong to send these three bytes and it is not wrong not to send them. Guess what? .The .NET world tends to send these extra bytes and everyone else tends not to. Why not have a bit of more incompatibility, right? It’s not like we want to talk to each other anyway!

What’s up with the BOM? What is it’s purpose?

The BOM is a way to add a kind of metadata inside of a stream of bytes instead of sending some actual metadata along next to your stream. Option one is saying “Here is my byte stream and btw it is in UTF8 encoding”. The other way to do it is saying “Here is my byte stream and if you look at the first three bytes you can read the encoding of it”. Which is better? I can’t say I care all that much other than the fact that it caused us a problem when I tried to send requests from .NET code to the Neo4j server.

How can you then – finally – handle these three bytes?

Well that depends… on what you intend to do. What I can show you is a piece of code that tells you exactly what this is and then you can copy that behavior into your code and modify it to serve your purpose. The code below should be pretty self explanatory but just to be sure. I take the string "foo" and encode it into a MemoryStream using a StreamWriter. The issue is that I tell the stream writer to use Encoding.UTF8 and this is where .NET Framework adds the BOM. The resulting stream is not 3 bytes as perhaps expected. Instead it is 6 bytes long. What you have to do if you read this stream ‘raw’ – by hand – in some library, is skip over the first three bytes. A better way to do it is to use a reader that handles the BOM. Finally if you don’t want to add the BOM in the first place you can write your bytes yourself in a more ‘raw’ fashion byte by byte to the stream. As you can see .NET Framework is good enough to have a way to find an actual BOM for different encodings. The UTF8 encoding does it this way: Encoding.UTF8.GetPreamble() (msdn library link to .GetPremable()).

[TestMethod]
public void StreamWriter_with_encoding_writes_ByteOrderMark()
{
    string str = "foo";

    var memStream = new MemoryStream();

    // Test
    var writer = new StreamWriter(memStream, Encoding.UTF8);
    writer.Write(str);
    writer.Flush();

    Assert.AreEqual(6, memStream.Length);

    var buffer = new byte[3];

    memStream.Position = 0;
    memStream.Read(buffer, 0, 3);
    string result = Encoding.UTF8.GetString(buffer);

    var preamble = Encoding.UTF8.GetPreamble();
    var byteOrderMarkUtf8 = Encoding.UTF8.GetString(preamble);

    Assert.AreEqual(byteOrderMarkUtf8, result);

    memStream.Position = 3;
    memStream.Read(buffer, 0, 3);
    result = Encoding.UTF8.GetString(buffer);

    Assert.AreEqual(str, result);

    // Alternative: Write the bytes yourself to skip the BOM

    memStream = new MemoryStream();

    foreach (var @byte in Encoding.UTF8.GetBytes(str))
    {
        memStream.WriteByte(@byte);
    }

    Assert.AreEqual(3, memStream.Length);
}

You can also if you like compare bytes for the premable one by one rather than converting to a string comparison.

As you can see this BOM can be handled easily if you like. The thing is I did not have a clue that it was there.

Oh – and btw now the Neo4j server accepts REST json bodys that both can have and skip the BOM! Good ‘bug’ to solve or ‘feature’ to have.

Cheers,

M.

by Magnus Mårtensson at April 14, 2010 06:45 AM

Peter Neubauer

Cool spatial algos with Neo4j: Part 1 - Routing with A* in Ruby

Introduction

The GIS and Spatial domain is one of the hottest areas of complex data right now, and an interesting case for using a graph database like Neo4j as well. There are several trends that come into play here:
  • The accessibility of geographic datasets with Google Mapsthe Open Street Map Project , Nokias Ovi Maps providing increasingly accurate and crowd sourced base data for geographic information
  • Startups like GeoAPI and SimpleGeo providing interesting APIs for commonly needed services in an easy-to-use manner via REST and JSON
  • Geographical standards and APIs like Web Map Service and others like KML, WFS etc, defined by bodies like the Open Geospatial Consortium (OpenGIS) and others catching on and becoming a good base to develop applications upon
  • Most new mobile devices have now GPS sensors and APIs built in, augmenting more data with location info.
As a result of the increasing availability of geospatial information it's becoming part of the core domain model for an expanding range of applications. Events, photos, running routes, logistics systems and location games all have the geographical position of the domain data as a core aspect of their data.

The traditional approach has been to create spatial indexes as a separate system, typically in PostGIS, Oracle Spatial or other specialized systems in order to then ask them certain types of queries. Afterwards, these results are intersected in set operations with the domain objects in question.

As spatial attributes and queries are now becoming intertwined with the domain models, so are the datasets. The domain objects like friends, buildings or trucks ARE the spatial "points of interest", so a separate spatial system just to do these queries means duplication of your data, resulting in synchronization problems. Queries like
  • "Give me all my friends that are listening to the same music genre, are between 20 and 40 years old, live within 50km of my current position and are online in GTalk"
span a lot of different aspects of a dataset, spatial queries just being one of them. In a graph, all of these data aspects can be modeled in the same data structure, resulting in a traversal-based approach to these queries as opposed to a set-based crunching of different index-results.

Neo4j and routing

Neo4j as a Graph Database is modeling all data as Nodes, Relationships and Properties on them. Since this is a very flexible model, arbitrary data can be added to any domain model in order to build for instance spatial indexes and logic right into the data itself. In a typical application, a subgraph could contain some road network like this:

Small road network.


The advantage of having the spatial data in the graph is that at any time the different points can be related and incorporated into different indexes, e.g. K-trees, QuadTrees and other interesting approaches.

Since Neo4j has very fast traversal performances, we can employ classic deep graph algos like the A* algorithm that is effectively a cost minimizing deep traversal through a network.


Original from Wikipedia, see here


Now, in Neo4j an A* implementation in Java looks like this. Basically, nodes are evaluated regarding their accumulated cost from the start node, and the "as the crow flies" remaining distance to the end node. For this, we need a EstimateEvaluator which can calculate the remaining distance (as the geodetic distance) in meters (in Ruby):
#implements a Java Interface
class GeoCostEvaluator
  include EstimateEvaluator
  def getCost(node, goal)
    distance(node.getProperty("lat"), node.getProperty("lon"), goal.getProperty("lat"),goal.getProperty("lon"))
  end
end

#calculates a geodetic distance between two spatial points
def distance(start_lat, start_lon, other_lat, other_lon)
  latitude1 = start_lat.to_f * Math::PI/180 #in radian
  longitude1 = start_lon.to_f * Math::PI/180 #in radian
  latitude2 = other_lat.to_f * Math::PI/180 #in radian
  longitude2 = other_lon.to_f * Math::PI/180 #in radian
  cLa1 = Math.cos( latitude1 );
  x_A = RADIUS_EARTH * cLa1 * Math.cos( longitude1 )
  y_A = RADIUS_EARTH * cLa1 * Math.sin( longitude1 )
  z_A = RADIUS_EARTH * Math.sin( latitude1 );

  cLa2 = Math.cos( latitude2 );
  x_B = RADIUS_EARTH * cLa2 * Math.cos( longitude2 )
  y_B = RADIUS_EARTH * cLa2 * Math.sin( longitude2 )
  z_B = RADIUS_EARTH * Math.sin( latitude2 )
  
  #in meters
  distance = Math.sqrt( ( x_A - x_B ) * ( x_A - x_B ) + ( y_A - y_B ) * ( y_A - y_B ) + ( z_A - z_B ) * ( z_A - z_B ) )
end
We also need a CostEvaluator which calculates the cost associated with choosing a certain path using the stored information in the graph (here the cost is attached to the relationships representing roads as meters). In our case, we'll model this in our domain as a "cost" property attached to the edges representing roads:
#domain model
class Road
  include Neo4j::RelationshipMixin
  
  property :cost
end

class Waypoint
  include Neo4j::NodeMixin
  
  # neo4j node properties
  property :lat, :lon, :name
  
  # lucene indexed node properties
  index :name
  
  # relationships to other waypoints
  has_n(:road).to(Waypoint).relationship(Road)
  
  # for this example just calculate the straight distance between points as cost
  def connect(other)
    self.road.new(other).update(:cost => distance(self.lat, self.lon, other.lat, other.lon))
  end
  def to_s
    "Waypoint, #{name}, lon=#{lon}, lat=#{lat}"
  end
end
Then, let's look up the coordinates of cities using Yahoo!'s geocoding webservice with the name and state of the city. It makes the creation of a sample a lot easier:
# create point, and look up the location via Yahoo! MapsService
def create_waypoint(city, state)
  url = "http://local.yahooapis.com/MapsService/V1/geocode?appid=#{APP_ID}"
  res = Net::HTTP.get(URI.parse( URI.escape(url + "&state=#{state}&city=#{city}") ))
  
  lat = res.slice(/Latitude\>(.*)\\/Latitude/,1)
  lon = res.slice(/Longitude\>(.*)\\/Longitude/,1)
  point = Waypoint.new :name=>city, :lon=>lon, :lat=>lat
  puts point
  point
end
Now, populating the system with a few cities and connect them is as easy as:

# populating the routing test
Neo4j::Transaction.run do
  NYC = create_waypoint('New York', 'New York')
  KAN = create_waypoint('Kansas City', 'Kansas')
  SFE = create_waypoint('Santa Fe', 'New Mexico')
  SEA = create_waypoint('Seattle', 'Washington')
  SF = create_waypoint('San Francisco', 'CA')
  NYC.connect(KAN)
  NYC.connect(SEA)
  SEA.connect(SF)
  KAN.connect(SFE)
  SFE.connect(SF)
end

The resulting routing network looks something like this:

Simplerouting-2
Now finding the optimal route is simply a matter of calling the Java A* algo form Ruby:

# Finding the route
sp = AStar.new( Neo4j::instance, RelationshipExpander.forTypes( DynamicRelationshipType.withName('Waypoint#road'), Direction::BOTH),
				        DoubleEvaluator.new("cost") , GeoCostEvaluator.new)
path = sp.findSinglePath(NYC._java_node, SF._java_node)
nodes = path.getNodes.iterator
until !nodes.hasNext() do 
  puts nodes.next.getProperty('name')
end

In our case, populating the network and finding a route from New York to San Francisco will look like this:

Waypoint, New York, lon=-74.007124, lat=40.714550
Waypoint, Kansas City, lon=-94.626824, lat=39.113380
Waypoint, Santa Fe, lon=-105.937406, lat=35.691543
Waypoint, Seattle, lon=-122.329439, lat=47.603560
Waypoint, San Francisco, lon=-122.420139, lat=37.779600
New York
Kansas City
Santa Fe
San Francisco
Which is depicted here:

Simplerouting-3

This is a very trivial example, but the A* algorithm works well in Neo4j with routes hundreds and thousands of hops deep, on millions of roads and waypoints, imported for instance from OpenStreetMaps. A Neo4j plugin to the Osmosis project to export that data is underway and will be part of another spatial blog entry.

Summary

The A* algorithm is not the best of solutions for advanced routing in real life conditions, but it provides a first measure of the mechanics and performance characteristics involved when operating on big graphs with deep traversals. Of course, obvious improvements are:
  • Introduce a bi-directional approach (two A* starting from start and end node, meeting in the middle), thus cutting time
  • Introduce "via" waypoints like cities or big traffic hubs, cutting the total distance into shorter chunks that can be calculated in parallel
  • introduce higher-order graphs that describe the logistics backbone like Highways and Autobahns and first find the nearest connection to that network, go near the end point and return to the local road network for the final distance
  • Introduce better cost functions that take into account not only distance but also allowed speed, road conditions, one-way lanes etc.
  • Factor in customer context like the current used car, family preferences and so on.

Still being able to do these calculations in sub-second speeds on graphs of millions of roads and waypoints makes it possible in many cases to abandon the normal approach of precomputing indexes with K/V stores and be able to put routing into the critical path with the possibility to adapt to the live conditions and build highly personalized and dynamic spatial services.

The full source code and project setup is found at github.com/neo4j-examples/ruby-astar-routing. Play with the code, add cities, connections and have some fun with the routing! There is even a Java version at http://github.com/neo4j-examples/java-astar-routing.

Feel free to comment or contact me at peter at neotechnology dot com. You're welcome to the Neo4j mailing lists as well.

by Peter Neubauer at April 06, 2010 09:32 PM

Neo4j News

Google Summer of Code Projects with Neo4j

Great news for all students wanting to spend the summer hacking on Neo4j and getting paid for it! The hackers behind Neo4j have teamed up with a few mentoring organizations on some interesting Summer of Code projects!

So far there are two proposed projects involving Neo4j, but more projects could be added in the next couple of days. You could of course also come up with a cool project on your own and propose it to one of the mentoring organizations.

The first project we have proposed is a collaboration with the uDIG project. This project is part of our greater effort of making Neo4j capable of being a fully featured spatial database. The sweet spots are complex spatial algorithms such as routing, spatial analytics and N-dimensional indexing. This summer of code project will aim at implementing the OpenGIS and GeoTools standard interfaces. This will make Neo4j a drop-in alternative to other GIS backends, such as PostGIS, MySQL Spatial and Oracle Spatial.

The second project we have proposed is a collaboration with the Gephi project. What we aim for here is to make Gephi able to visualize graphs straight out of a Neo4j graph (not just exporting data from Neo4j and importing it in Gephi). This will enable Gephi to work with larger graphs than today, which is a direct benefit for Gephi. It will also enable Gephi to be used as a visualization tool for Neo4j based applications, for introspection, data debugging and graphical reporting.

This will be a truly great summer for Neo4j and the lucky students who get selected for these projects. There's a lot of interest in both GIS and visualization in the Neo4j community, so we look forward to great discussions around these projects! Mark March 29 in your calendar and submit your proposal early!

by Tobias Ivarsson (noreply@blogger.com) at March 26, 2010 12:53 PM

Neo4j Blog

Neo4j meetup at Twitter HQ Wed Mar 31 (NEW DATE & TIME)

Update: NEW DATE & TIME! The meetup has been moved to Wed Mar 31 because the speaker (Emil) is unable to speak. (And what good is a meetup if you're unable to speak?) For more info, see below. Sorry for the late notice and hope you can make it on Wednesday instead!

If you are in the San Francisco bay area, you should mark down next Thu Mar 25th Wed Mar 31st in your calender. Why? Because Twitter is hosting a Neo4j meetup! So if you're interested in NOSQL, graph databases, Neo4j or beer (or all of the above) then please register and join us at the Twitter HQ in San Francisco.

The meetup will start with pizza at 6.30pm and then we get going with a Neo4j presentation at 7pm, followed by beer, discussions and code in random order.

When: Thursday, Mar 25 2010 at 6.45pm Wednesday, Mar 31 2010 at 6.30pm
Where: Twitter HQ at 795 Folsom ave, 6th floor, San Francisco

Please register here, then bring your laptop and an apetite for graphs. Be there or be [ ]!

by Emil Eifrem (noreply@blogger.com) at March 26, 2010 12:38 AM

Neo4j Blog

CTO of Amazon: "Neo4j absolutely ROCKS!"

There was a NOSQL smackdown at the South by Southwest conference hosted by the changelog. It's a fun 45 minute clip with lots of good-humored banter between representatives from Cassandra, MongoDB, CouchDB and SimpleDB.

The latter is represented by none other than Werner Vogels, the CTO of Amazon, godfather of Dynamo (from which all current key-value stores are inspired) and probably the top expert in the world at building large-scale distributed systems.

The best part is at the end (around minute 45), when the participants were asked to name their favorite nosql systems, except their own. Werner Vogels concluded with saying:

"For anything with multiple relationships, multiple connections, Neo4j absolutely ROCKS!"

A statement we wholeheartedly agree with. :) Listen to the entire clip here.

by Emil Eifrem (noreply@blogger.com) at March 24, 2010 07:51 AM

Neo4j Blog

Modeling categories in a graph database

Storing hierarchical data can be a pain when using the wrong tools. However, the Neo4j open source graph database is a good fit to this kind of problems, and this post will show you an example of how it can be used. To top it off, today it's time to have a look at the Neo4j Python language bindings as well.

Introduction

A little background info for newcomers: Neo4j stores data as nodes and relationships, with key/value style properties on both. Relationships connect two different nodes to each other, and are typed and directed. Relationships can be traversed in both directions (the direction can also be ignored when traversing if you like). You can create any relationship types; they are identified by their name.

For a quick introduction to the Neo4j Python bindings, have a look at the Neo4j.py component site. There's also slides and video from a PyCon 2010 presentation by Tobias Ivarsson of the Neo4j team, who also contributed the Python code for this blog post.

This blog post only contains simplified snippets of code, to get full working source code - which exposes a domain layer on top of the underlying graph data - go to:

If you take a look at a site like stackoverflow.com you will find many questions on how to store categories or, generally speaking, hierarchies in a database. In this blog post, we're going to look at how to implement something like what's asked for here using Neo4j. However, using a graph database will allow us to bring the concept a bit further.

Data model

It may come as a surprise to some readers, but even though we're using a graph database here, we'll use a common Entity-Relationship Diagram. The entities we want to handle in this case are categories and products. The products holds attribute values, and we want to be able to define types and constraints on these attributes. The attributes that products can hold are defined on categories and inherited to all descendants. Products, categories and attribute types are modeled as entites, while the attributes have been modeled as relationships in this case. Categories may contain subcategories and products. So this is the data model we end up with:

What can't be expressed nicely in the ER-Diagram are the attribute values, as the actual names of those attributes are defined as data elsewhere in the model. This mix of metadata and data may be a problem when using other underlying data models, but for a graph database, this is actually how it's supposed to be used. When using an RDBMS with it's underlying tabular model, the Entity-Attribute-Value model is a commonly suggested way of dealing with the data/metadata split. However, this solution comes with some downsides and hurts performance a lot.

That was it for the theoretical part, let's get on to the practical stuff!

Node space

What we want to do is to transfer the data model to the node space - that's Neo4j lingo for a graph database instance, as it consists of nodes and relationship between nodes. What we'll do now is to simply convert some of the terminology from the Entity-Relationship model to the Neo4j API:

ER-modelNeo4j
EntityNode
RelationshipRelationship
AttributeProperty

That wasn't too hard, was it?! Let's put some example data in the model and have a look at it (click for big image):

The image above gives an overview; the rest of the post will get into implementation details and good practices that can be useful.

Getting to the details

When a new Neo4j database is created, it already contains one single node, known as the reference node. This node can be used as a main entry point to the graph, and next we'll show a useful pattern for this.

In most real applications you'll want multiple entry points to the graph, and this can be done by creating subreference nodes. A subreference node is a node that is connected to the reference node with a special relationship type, indicating it's role. In this case, we're interested in having a relationship to the category root and one to the attribute types. So this is how the subreference structure looks in the node space:

Now someone may ask: Hey, shouldn't the products have a subreference node as well?! But, for two reasons, I don't think so:

  1. It's redundant as we can find them by traversing from the category root.
  2. If we want to find a single product, it's more useful to index them on a property, like their name. We'll save that one for another blog post, though.
Note that when using a graph database, the graph structure lends itself well to indexing.

As the subreference node pattern is such a nice thing, we added it to the utilities. The node is lazily created the first time it's requested. Here's whats needed to create an ATTRIBUTE_ROOT typed subreference node:

import neo4j
from neo4j.util import Subreference
attribute_subref_node = Subreference.Node.ATTRIBUTE_ROOT(graphdb)

... where graphdb is the current Neo4j instance. Note that the subreference node itself doesn't have a "node type", but is implicitly given a type by the ATTRIBUTE_ROOT typed relationship leading to the node.

The next thing we need to take care of, is connecting all attribute type nodes properly with the subreference node. This is simply done like this:

attribute_subref_node.ATTRIBUTE_TYPE(new_attribute_type_node)

Always doing like this when adding a new attribute type makes the nodes easily discoverable from the ATTRIBUTE_ROOT subreference node:

Similarly, we want to have a subreference node for categories, and in this case we also want to add a property to the subreference node. Here's how this looks in Python code:

category_subref_node = Subreference.Node.CATEGORY_ROOT(graphdb, Name="Products")

This is how it will look after we added the first actual category, namely the "Electronics" one:

No let's see how to add subcategories. Basically, this is what's needed to create a subcategory in the node space, using the SUBCATEGORY relationship type:

computers_node = graphdb.node(Name="Computers")
electronics_node.SUBCATEGORY(computers_node)

To fetch all the direct subcategories under a category and print their names, all we have to do is to fetch the relationships of the corresponding type and use the node at the end of the relationship, just like this:

for rel in category_node.SUBCATEGORY.outgoing:
print rel.end['Name']

There's not much to say regarding products, the product nodes are simply connected to one category node using a PRODUCT relationship:

But how to get all products in a category, including all it's subcategories? Here it's time to use a traverser, defined by the following code

class SubCategoryProducts(neo4j.Traversal):
types = [neo4j.Outgoing.SUBCATEGORY, neo4j.Outgoing.PRODUCT]
def isReturnable(self, pos):
if pos.is_start: return False
return pos.last_relationship.type == 'PRODUCT'

This traverser will follow outgoing relationships for both SUBCATEGORY and PRODUCT type relationships. It will filter out the starting node and only return nodes reached over a PRODUCT relationship. This is then how to use it:

for prod in SubCategoryProducts(category_node):
print prod['Name']

At the core of our example is the way it adds attribute definitions to the categories. Attributes are modeled as relationships between a category and an attribute type node. The attribute type node holds information on the type - in our case only a name and a unit - while the relationship holds the name, a "required" flag and in some cases a default value as well. From the viewpoint of a single category, this is how it is connected to attribute types, thus defining the attributes that can be used by products down that path in the category tree:

Our last code sample will show how to fetch all attribute definitions which apply to a product. Here we'll define a traverser named categories which will find all categories for a product. The traverser is used by the attributes function, which will yield all the ATTRIBUTE relationship. A simple example of usage is also included in the code:

def attributes(product_node):
"""Usage:
for attr in attributes(product):
print attr['Name'], " of type ", attr.end['Name']
"""
for category in categories(product_node):
for attr in category.ATTRIBUTE:
yield attr

class categories(neo4j.Traversal):
types = [neo4j.Incoming.PRODUCT, neo4j.Incoming.SUBCATEGORY]
def isReturnable(self, pos):
return not pos.is_start

Let's have a final look at the attribute types. Seen from the viewpoint of an attribute type node things look this way:

As the image above shows, it's really simple to find out which attributes (or categories) are using a specific attribute type. This is typical when working with a graph database: connect the nodes according to your data model, and you'll be fine.

Wrap up

Hopefully you had some fun diving into a bit of graph database thinking! These should probably be your next stops on the way forward:

by Anders Nawroth (noreply@blogger.com) at March 23, 2010 05:43 PM

Neo4j News

Update on Neo4j Ruby bindings

During 2010 there's been two releases of the Neo4j.rb JRuby bindings for the Neo4j graph database so far. Time to catch up with what's new!

Version 0.4.0 of Neo4j.rb came with improved traversal performance, more options on how to use relationships, for instance relationships can now be indexed. Version 0.4.1 gave us migrations and access to a batch inserter for big import-once data volumes.

In the last days of 2009, neo4jr-simple was first released. It's a simple ready to go wrapper for Neo4j and currently in version 0.2.1. Make sure to check out the neo4jr-social example application as well!

Great thanks to Andreas Ronge, Matthew Deiters and all the other contributors for the awesome stuff!

by Anders Nawroth (noreply@blogger.com) at March 12, 2010 01:32 PM

Neo4j Blog

Access control lists the graph database way

In many contexts you need to handle user permissions to access, create or change some kind of resources. A common example is a file system, and that's what we are going to dive into in this blog post. We're going to use Ruby bindings for the Neo4j graph database to create a small - but working - example application.

Preparation

To set up the environment for this example on Ubuntu, I used the following commands:

sudo apt-get install jruby
sudo jruby -S gem install neo4j

To import the libraries, the following code was used:

require 'rubygems'
require 'neo4j'
require 'neo4j/extensions/find_path'

Heading for the node space

So user permissions, what are they all about? Obviously it's about users, and usually user groups as well. We'll abstract this away a bit and use the term principals, which can be single users or groups.

The other side of user permissions are the resources which are to be protected. In our case we'll have a file system, so there will be folders and files. Here we'll use the term content.

Let's start out building a graph to support the application from what we have gathered so far! When working with a graph it's beneficial to think in a graphy manner, so that's where we'll begin. Graphs are presumably about connecting things, so our first step is to create some relationships. Neo4j comes with a built-in reference node, which is easily accessible at all times. We use this to create our own "subreference nodes", one for principals and one for content. This is how our graph looks so far:

To create (and get) the subreference nodes, we use this function:

def get_or_create_sub_ref( name )
result = Neo4j.ref_node.rels.outgoing( name ).nodes.first
if ( result.nil? )
result = Neo4j::Node.new :name => name.to_s.capitalize.gsub("_", " ")
Neo4j.ref_node.rels.outgoing( name ) result
end
return result
end

This function is then called whenever we need to use a subreference node. The important parts here are:

  • ref_node: the built-in reference node
  • rels: relationships connected to a node
  • outgoing: the direction of the relationship (the relationships are always directed, but you can choose to ignore the direction in traversals)
  • ( name ): the type of relationships to follow (the type can be ignored in traversals as well, but in our case we want to use it)
  • nodes: the nodes in the other end of the relationships
  • first: the first node found - there sould only be one subreference node of each type

If the subreference node isn't found, it will be created and connected to the reference node. As you can see, we're adding a property with the key name to the nodes as well, which is there solely for the purpose of visualization (the images in this post are created using Neoclipse).

Basic structure

For the principals part, we are going to connect the top-level ones to the corresponding subreference node using a PRINCIPAL type of relationship. Other than that, there's just users and groups, so let's use a IS_MEMBER_OF_GROUP relationship type to encode that. This is how that looks in the graph:

And here's the code to create it:

def new_principal( name, member_of_groups = [] )
principal = Neo4j::Node.new
principal[ :name ] = name
if member_of_groups.empty?
get_or_create_sub_ref( :PRINCIPALS ).rels.outgoing( :PRINCIPAL ) principal
else
for group in member_of_groups
principal.rels.outgoing( :IS_MEMBER_OF_GROUP ) group
end
end
return principal
end

If a new principal isn't member of any groups, it's added as a top-level principal, connected to the principals subrefererence node. In other case, it's simply added to the groups.

With Neo4j all operations on the graph have to be encapsulated in a transaction, so this is how we'll call the above function:

Neo4j::Transaction.run do
all_principals = new_principal( "All principals" )
root = new_principal( "root", [ all_principals ] )
regular_users = new_principal( "Regular users", [ all_principals ] )
user1 = new_principal( "user1", [ regular_users ] )
user2 = new_principal( "user2", [ regular_users ] )
end

For the content part, things are very similar to the principals part. The main difference is that in this case, an item can have only a single parent item. Here's the graphical view on that:

And this is the code to create the structure:

def new_content( name, parent = nil )
content = Neo4j::Node.new
content[ :name ] = name
if ( parent.nil? )
get_or_create_sub_ref( :CONTENT_ROOTS ).rels.outgoing( :CONTENT_ROOT ) content
else
parent.rels.outgoing( :HAS_CHILD_CONTENT ) content
end
return content
end

Similar to how the principals were created, this is the code to create the content data:

Neo4j::Transaction.run do
root_folder = new_content( "Root folder" )
temp_folder = new_content( "Temp", root_folder )
home_folder = new_content( "Home", root_folder )
user1_home_folder = new_content( "user1 home", home_folder )
user2_home_folder = new_content( "user2 home", home_folder )
a_file = new_content( "MyFile.pdf", user1_home_folder )
end

At the core

Now that we have the basic structure in place, what's left regarding our data is a small but crucial part: the permissions information! We're using a simple scheme: adding security relationships with optional boolean flags for read and write permission. Not much to say here, this is what we want the full graph to look like (click for a bigger version):

A small function will help us add the security information:

def apply_security( content, principal, map_with_flags )
security_relationship = Neo4j::Relationship.new( :SECURITY, principal, content )
map_with_flags.each_pair {|key, value| security_relationship[ key ] = value}
end

It's time to add the security data:

Neo4j::Transaction.run do
apply_security( root_folder, root, { "w" => true } )
apply_security( root_folder, all_principals, { "r" => true } )
apply_security( temp_folder, all_principals, { "w" => true } )
apply_security( user1_home_folder, regular_users, { "r" => false, "w" => false } )
apply_security( user1_home_folder, user1, { "r" => true, "w" => true } )
apply_security( user2_home_folder, user2, { "r" => true, "w" => true } )
end

To check the permission for some action by an actual principal for some content, there's some work to do. This is the algorithm we use to retrieve a permission flag:

  1. Move from the content node and upwards through the file system structure and investigate each level for permission information.
  2. On each level, see if there are any principals related to or identical with the principal concerned.
  3. Make sure to use the permission information from the principal closest to the principal concerned.
  4. If permission information was found, return it; otherwise, continue traversing to the next level in the file system.

In the code for this, we'll use a function named depth_of_principal() to calculate the distance between the principal we have traversed to and the principal concerned. More on that later, here's the code to check the permissions:

def has_access( content, principal, flag )
for current_content in content.incoming( :HAS_CHILD_CONTENT ).depth( :all )
lowest_score = nil
lowest_modifier = nil
for rel in current_content.rels.incoming( :SECURITY )
rel_principal = rel.start_node
if !rel[ flag ].nil?
score = depth_of_principal( rel_principal, principal )
if !score.nil?
modifier = rel[ flag ]
if lowest_score.nil? || score lowest_score ||
( score == lowest_score && modifier )
lowest_score = score
lowest_modifier = modifier
end
end
end
end
if !lowest_modifier.nil?
return lowest_modifier
end
end
return false
end

Here's our function to check the distance between principals (and to see if they're on the same path at all).

def depth_of_principal( principal, reference_principal )
result = reference_principal.outgoing( :IS_MEMBER_OF_GROUP ).depth( :all ).path_to( principal )
return result.nil? ? nil : result.size
end

Finally, we want to see that everything works, so here's a utility function to print permission information:


def print_has_access( content, principal, flag )
print principal[ :name ] + " +" + flag.upcase + " access to " + content[ :name ] + "? " +
has_access( content, principal, flag ).to_s + "\n"
end

And here's how to use the function:

Neo4j::Transaction.run do
print_has_access( home_folder, root, "w" )
print_has_access( home_folder, user1, "w" )
print_has_access( a_file, root, "r" )
print_has_access( a_file, user2, "r" )
print_has_access( a_file, user1, "w" )
end

Next steps

The full source code is found here

Here's a few useful resources to help you on your way:

Thanks for reading - any feedback is welcome!

by Anders Nawroth (noreply@blogger.com) at February 25, 2010 04:55 PM

Neo4j News

Neo4j 1.0 released

Recently version 1.0 of Neo4j was released. There has been a Neo Technology news post regarding this event, as well as a blog post on how to get to know Neo4j. The distribution is available as binary and source packages from the downloads page.

For more information, read the list mail announcement and check out the details in the changelog.

A few pointers to stuff that happened around and after the release:
As always, feedback to the mailing list, on Twitter or directly to us.

by Anders Nawroth (noreply@blogger.com) at February 23, 2010 03:41 PM