Planet Neo4j

Neo4j home | news | blog | Neo Technology

Neo4j Blog

Neo4j 2.0.2 Maintenance Release

Today we released the 2.0.2 maintenance release of Neo4j. This release comes with some critical stability improvements as well as a few small but handy Cypher type conversion functions. All Neo4j users are strongly recommended to upgrade to this release. Head on over to http://www.neo4j.org/download to upgrade to Neo4j 2.0.2. Neo4j 2.0.2 does not require any store-level upgrades from the

by Kenny Bastani (noreply@blogger.com) at April 15, 2014 05:58 AM

Neo4j Blog

Neo4j 2.0.1 Community Released on Windows Azure VM Depot

We have released a Linux distribution of Neo4j 2.0.1 community on Windows Azure's VM Depot website. Users of Windows Azure are now able to copy a platform image of Neo4j 2.0.1 directly from the VM Depot. Once provisioned, a fresh Neo4j database instance is made available via HTTP through port 7474. Check out the slides below for instructions on how to setup and provision the virtual machine

by Kenny Bastani (noreply@blogger.com) at March 19, 2014 12:03 AM

Neo4j Blog

Spring Data Neo4j Progress Update: SDN 3 & Neo4j 2

The 3.0.1 Release Spring Data Neo4j 3.0 was recently rolled out by the Spring Data team as part of the "Codd" release train of the Spring Data projects. We’re happy to announce this milestone, which can be used effective immediately to start developing Neo4j 2.0 applications with Spring. Today, Spring Data Neo4j 3.0.1 containing some necessary updates was released. The Spring Data Neo4j 3

by Michael Hunger (noreply@blogger.com) at March 14, 2014 01:29 PM

Neo4j Blog

Blog Post: Community Support

First of all, I want to say how happy I am to be part of such a great community around Neo4j. What makes the Neo4j community so impressive is not just the fact that many users apply graphs in so different contexts, but especially the supporters of the community who help you and us by jumping in, answering questions, offering advice and solving problems. People like Wes Freeman, Luanne

by Michael Hunger (noreply@blogger.com) at March 07, 2014 01:56 AM

Neo4j Blog

Graph Gist Winter Challenge Winners

To be honest, we were blown away. When starting this challenge we were really excited and curious about the results. But what YOU created and submitted is just impressive. We received 65 submissions in the 10+ categories. Well done! Make sure to check them out, each one is a jewel on its own and there are many surprises hidden in these submissions. And if you get started with Neo4j

by Michael Hunger (noreply@blogger.com) at March 01, 2014 03:54 AM

Neo4j Blog

RECAP: DeveloperWeek 2014 and GraphPUB SF

We had a fantastic time at DeveloperWeek in San Francisco last week! It was great to see so many graphistas attend events around the city, which resulted in an awesome week connecting with the local Neo4j ecosystem. We kicked things off on with our (Neo4j) - [:POWERS] - > (love) meetup on Wednesday, Feb. 12 at EngineYard with presentations by Andreas Kollegger, Amanda Laucher and Felienne

by AdamH (noreply@blogger.com) at February 25, 2014 01:11 PM

The Neo4j 2.1.0 Milestone 1 Release - Import and Dense Nodes

We're pleased to announce the release of Neo4j 2.1 Milestone 1, the first drop of the 2.1 release schedule whose dual goals are productivity and performance. In this release we've improved the experience at both ends of the Neo4j learning curve. On the data import side, we now support CSV import directly in the Cypher query language. For large, densely connected graphs we've changed the way

by Michael Hunger (noreply@blogger.com) at February 25, 2014 12:01 AM

Neo4j Blog

Neo4j 2.0.1 Maintenance Release

Mark Needham Today we’re releasing the latest version of the 2.0 series of Neo4j, version 2.0.1. For more details on Neo4j 2.0.0 see the December release blog post. This is a maintenance release and has no new features although it contains significant stability and performance improvements. We’ve made some improvements to the way updates are propagated around an HA cluster and have

by Michael Hunger (noreply@blogger.com) at February 04, 2014 08:39 PM

Neo4j Blog

The first GraphGist Challenge completed

We're happy to announce the results of the first GraphGist challenge. Anders Nawroth First of all, we want to thank all participants for their great contributions. We were blown away by the high quality of the contributions. Everyone has put in a lot of time and effort, providing thoughtful, interesting and well explained data models and Cypher queries. There was also great use of graphics,

by Anders Nawroth (noreply@blogger.com) at February 02, 2014 01:25 AM

Neo4j Blog

Importing data to Neo4j the spreadsheet way in Neo4j 2.0!

Hi all graphistas out there, And happy new year! I hope you had an excellent start, let's keep this year rocking with a spirit of graph-love! Our Rik Van Bruggen did a lovely blog post on how to import data into Neo4j using spreadsheets in March last year.  Simple and easy to understand but only for Neo4j version 1.9.3. Now  it’s a new year and in December we launched a shiny new

by Pernilla Lindh (noreply@blogger.com) at January 23, 2014 06:17 PM

Neo4j Blog

The Winter GraphGist Challenge

Happy New Year! We’re happy to announce that we’ve extended December's GraphGist Challenge until January 31, 2014! This gives you a few more weeks to submit or improve your entries to maximize your chances to WIN a $300 Amazon.com gift card and more prizes for any of the 10 categories. You can win in multiple categories, so feel free to create as many submissions as you like. All

by AdamH (noreply@blogger.com) at January 06, 2014 10:52 PM

Neo4j Blog

Neo4j 2.0 GA - Graphs for Everyone

A dozen years ago, we created a graph database because we needed it. We focused on performance, reliability and scalability, cementing a foundation for graph databases with the 0.x series, then expanding the features with the 1.x series. Today, we announce the first of the 2.x series of Neo4j and a commitment to take graph databases further to the mainstream. Neo4j 2.0 has been brewing

by Andreas Kollegger (noreply@blogger.com) at December 16, 2013 11:41 PM

Chris Gioran

Software sympathy, or how i learned to stop worrying and love the Garbage Collector

Mechanical sympathy is a term originating from Formula 1 where it was used to describe a driving style that takes into consideration the mechanical properties of the engine and the car as a whole, leading to better performance of the vehicle-driver combo. Martin Thompson has taken this term and applied it in software engineering, demonstrating how understanding the architecture of the various components of a computing machine and the way they interact can lead to writing vastly more efficient code.
Embarking from this idea, i wondered how this can be applied in a purely software world and write code that can take advantage of the way other software works. Being a Java programmer the answer is actually quite obvious - look at how the JVM performs a task on behalf of your code and optimize things so they operate in symphony.

Which component to choose from, though? My focus ended up being on the Garbage Collector and in particular on the Garbage First GC implementation that is available with Java7. Specifically, i wanted to understand what the impact of setting references in Java objects is on performance, an effect associated with the write barrier and the way it is implemented in G1.

 

The Write Barrier you say?

The write barrier is a piece of code executed by garbage collectors whenever a reference to an object is set. This is part of the book keeping done by the collector and allows for the garbage in the heap to be traced and collected when the time comes. The implementation and use is specific to each garbage collector of course and here we'll look at the case of G1.

Garbage First

Roughly, G1 works by slicing the heap into equal sized pieces or regions and satisfies requests for memory from one region at a time, moving through regions as they fill up. When heap memory gets low, G1 mostly-concurrently collects regions that are mostly filled with garbage, until it reaches a target size for available heap (or a time limit is reached, allowing for soft real time behaviour). The regions are useful because, since they are collected as a whole, it is not necessary to keep information about objects pointing to other objects in the same region - the scan during the mark phase will go through everything anyway. This in turn means that the write barrier when setting a reference to an object in the same region as the reference holder will leave the write barrier as a no op, while setting cross region references will cost somewhat extra.

The above is something that we can test for. We can devise an experiment to benchmark the difference between setting intra vs extra region references. If it turns out to be significant, we can be aware and prefer, when writing code, to allocate together objects that are expected to point to each other, giving us much better throughput at the cost of properly structuring our allocation strategy.


Moving on to measuring things

In this case the benchmark is quite simple - we need to do both same region and cross region reference setting and see how they compare. To do that we'll need to layout our objects in memory in a predictable way so that we know that the only difference between our comparative runs is only the relative position of the referenced objects. As it turns out, that is not that hard to do.
If we allocate a fixed size for heap (by setting -Xms and -Xmx to be the same) and we know the size of the Java objects we'll be allocating, we can, given the fixed number of regions created, to calculate how many objects fit in each region and in turn figure out which object needs to point to which in order to have extra or intra region references.

A region is the heap average size (min size + max size divided by 2) divided by 2048. That means that for a heap of 1Gb, each region is 2^30/2^11 = 2^19 bytes big. If Ballast objects are 32 (2^5) bytes then that means we require 2^19 / 2^5 = 16384 objects of that class to fill a region.

We'll do two runs. Both will allocate 32768 Ballast objects. One run will set references from the first 1/4th to the second 1/4th and from the third 1/4th to the fourth 1/4th (and back, for each couple). The other run will set references from the first half to the second half (and back). We expect the first run to have much bigger throughput even though the number of allocations and reference sets will be exactly the same. All is much better explained if you read the code.

A note about the ballast objects: The Ballast objects which are used to contain the references also contain a long field. The reason for that is twofold - one is padding to get the object to an even 32 bytes and the other is for the plot twist at the end of the blog post.

That's it really. On my computer, a MacBook Pro with a 2.3Ghz Intel Core i7 CPU and 8GB of RAM, running the above code on an Oracle HotSpot JVM version 7u45 and the command line options being

-Xms1024m -Xmx1024m -XX:+UseG1GC

i get the following results:


Setting same region references took 80640ms

Setting cross region references took 84140ms

The time is how long it took for 5000 repetitions of allocating 2 new regions of objects and set the references properly.

Results

What this shows is that there is little difference between the two methods of setting references. Surprising, no? On to the source code then, to try and understand why that happens. As it turns out, the cost of marking each card as dirty is there but it actually is pretty small - it costs a put in a queue when the card is dirtied and that queue is processed afterwards while doing the collection. Both operations cost relatively little and this cost is also split between reference set time and collection time. Alternatively, one might say that the cost is dominated by the write barrier check rather than the operations taken from it.

An extra step and a twist

The write barrier cost - i wonder how much that is. What should we compare it against? Well, the minimal cost i could think of was that of setting an integer or a long. Since we already have a long field in our Ballast objects, we can use that. So we'll alter the test to instead of setting references in two different ways, it'll set once the reference to a fixed, known object and the other it'll set the long field to a given value. The new source code is here.

On the same setup as the previous experiment
(changing the GC implementation every time), i get the following numbers:

G1 (-XX:+UseG1GC)

Setting the long value took 79601ms
Setting the reference took 90197ms


ParNew (no arguments)

Setting the long value took 85628ms 
Setting the reference took 104894ms

CMS (-XX:+UseConcMarkSweep) 
Setting the long value took 91407ms
Setting the reference took 108954ms

The difference in cost lies strictly with the write barrier, which is of course not triggered for the case of storing a long. An interesting note here is that if you set the long field and the reference field in Ballast to be volatile, both actions take about the same time and at twice the time of setting the long value as shown above, demonstrating the cost of volatile variables.

Closing remarks and future work

While the first result is a negative one when it comes to the driving assumption, i still wanted to discuss it for two main reasons: One is the educational content from it, demonstrating in a hands on, high level way how G1 operates. The other is that negative results are still results and it's noteworthy that we can ignore the placement of objects when it comes to single threaded programs. This last part is quite important and it's going to be the next piece of work I'll undertake in this track. In particular, how does the card marking affect multithreaded programs and how does the no-op barrier for same-region assignment in the case of G1 fare under such conditions?
As for the long vs reference setting comparison, its explanation is quite easy but a way of taking advantage of the fact directly is not obvious. If however you are familiar with off heap memory management, you will immediately see a reason why performance in such scenarios is substantially better - being away from the control of the garbage collector does not only lead to collection improvements, but it also removes the bookkeeping overhead during program runtime. But that is a discussion for another time.



A parting word

As you can see, all source code is available, as well as the setup i used. If you think that information is not enough to recreate the results i got, please say so and I'll improve the article. But, if you think i got the results wrong, say so and I'll review the methodology. There is no reason why results such as the above should not be peer reviewed.


Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 Unported License.

by Chris Gioran (noreply@blogger.com) at December 11, 2013 09:04 PM

Neo4j Blog

Recap: GraphConnect New York and London 2013

Andreas Kollegger teaching Intro to Neo4j course in NYC GraphClinician Kenny Bastani diagnosing an attendee's graph problem Peter Olson presenting how Marvel uses Neo4j to graph the Marvel Universe Dreams do come true: Jim Webber and Ian Robinson thoroughly enjoyed having a Tardis at GraphConnect London Ian Robinson teaching Data Modelling and Import in London Sebastian Verhueghe of

by AdamH (noreply@blogger.com) at November 26, 2013 08:41 PM

Neo4j Blog

Neo4j 2.0.0-RC1 – Final preparations

WARNING: This release is not compatible with earlier 2.0.0 milestones. See details below. The next major version of Neo4j has been under development for almost a year now, methodically elaborated and refined into a solid foundation. Neo4j 2.0 is now feature-complete. We're pleased to announce the first Release Candidate build is available today. With that feature-completeness in mind,

by Andreas Kollegger (noreply@blogger.com) at November 21, 2013 11:58 AM

Neo4j Blog

Why Graph Databases are the best tool for handling connected data like in Diaspora

Handling connected domains with the “right tool for the job” Michael Hunger Sarah Mei recently wrote a great blog post describing the problems she and her colleagues ran into when managing highly connected data using document databases. Document databases (like other aggregate-oriented databases) do a good job at storing a single representation of an aggregate entity but struggle to

by Michael Hunger (noreply@blogger.com) at November 13, 2013 03:26 PM

Neo4j Blog

Musicbrainz in Neo4j - Part 1

What is MusicBrainz? Paul Tremberth Quoting Wikipedia, MusicBrainz is an “open content music database [that] was founded in response to the restrictions placed on the CDDB.(...) MusicBrainz captures information about artists, their recorded works, and the relationships between them.”  http://en.wikipedia.org/wiki/MusicBrainz Anyone can browse the database at http://musicbrainz.org/

by Peter Neubauer (noreply@blogger.com) at November 05, 2013 03:15 PM

Neo4j Blog

Recap: GraphConnect SF 2013

The Driver Writers Our fearless leader Emil Eifrem presents the new Neo4j web browser Sessions packed with attendees GraphClinicians assist attendees with their graphDB problems GraphClinician Amanda Laucher helps with a proof of concept BBQ lunch while enjoying SF's Indian summer (node_3) < - [:CONNECT] - (node_2) < - [:CONNECT] - (node_1) Wow, GraphConnect SF 2013 was awesome! We had

by AdamH (noreply@blogger.com) at October 22, 2013 08:58 PM

Neo4j Blog

Neo4j 2.0.0-M06 - Introducing Neo4j's Browser

Type in a Cypher query, hit <enter>, then watch a graph visualization unfold. Want some data? Switch to the table view and download as CSV. Neo4j's new Browser interface is a fluid developer experience, with iterative query authoring and graph visualization. Available today in Neo4j 2.0.0 Milestone 6, download now to try out this shiny new user interface. Cypher Authoring Neo4j

by Andreas Kollegger (noreply@blogger.com) at October 15, 2013 10:45 PM

Neo4j Blog

The Neo4j Driver Hackathon at GraphConnect San Francisco

Hi all, we are all still totally stoked after 2 days of intense development work at the inaugural Neo4j driver authors hackathon. Thank you all for joining us in San Francisco and the San Mateo Neo Technology office and sorry Michael Klishin that it didn't work for you to come!  Many of us saw each other for the first time, and it was a perfect kick-off to GraphConnect San Francisco on

by Peter Neubauer (noreply@blogger.com) at October 09, 2013 01:53 AM