- Developer Tools
John Wetherill, September 17, 2013
Neo4j is a powerful NoSQL database product from Neo Technology that offers significant advantages over both traditional RDBMS offerings and many newer big data products as well. This blog describes Neo4j, and discusses a handful of possible deployment scenarios for Neo4j in Stackato.
Pivotol's awesome CF Platform Cloud Foundry conference just finished, and featured many energetic and page-turning (so to speak) sessions by powerful industry thought leaders including Warner Music CTO Jonathan Murray. As one can imagine, Warner Music keeps a good amount of data (for example, their entire catalog at high fidelity), and as a result, are subject to many performance, scalability, security, and reliability constraints. Given these requirements, a natural, almost instinctual reaction, is to reach for an RDBMS. But in his talk Jonathan made it emphatically clear that for all of their recent and greenfield app development, they could not come up with a single use-case that required a relational database.
Not one. Hmmmm.
He also had some strong words to say about stored procedures.
These are powerful statements, and a few years ago they would have seemed ludicrous. But not any more. Database technology has advanced fast, resulting in a vast palette of mind-numbingly powerful big data or NoSQL databases and datastore products, many of which are instantly and freely available. Several of these surpass traditional RDBMS systems in performance, and have other considerable advantages.
This article will describe Neo4j, a first-class big data “graph” database by Neo Technology. Neo4j has several exciting features that lend credence to Jonathan's claims. The main focus here will be on how Neo4j fits into the cloud, using Stackato as the foundation.
Graph Databases and BigData
Relations are an important feature of RDBMS systems that give them much of their power. In contrast, several big data offerings have little explicit support for relations, including most document stores and key-value stores such as MongoDB and CouchDB.
But Neo4j is an exception: it is as relational as any RDBMS I’ve ever used, and arguably considerably more so. Graph databases place importance on the relationships between the data, not just the data itself. This statement can be justified by the fact that while RDBMS systems can certainly represent and efficiently and traverse relationships, the mechanism that do so (foreign keys, joins, join tables, highly optimized indexes, etc.) are not really “primitive” to these systems like tables, rows and columns are.
But graph databases are made up of nodes (aka vertices) and relationships. In other words relationships are a fundamental part of the structure of the data, not a bolt-on. In addition, each node and relation can have properties, which are effectively key value pairs that can store almost anything, of any size and format.
Neo Technology provides an informative description of graphs (recommended reading). The following diagram, represented as a graph, shows the components of a graph database, and their relations.
It turns out graphs provide a powerful and natural representation for large classes of data, i.e. data that has relations - which is pretty much all data.
Neo4j by Neo Technology
I tip my hat to the visionary founders at Neo Technology who not only imagined all of this years ago (in the heyday of RDBMS), but they also had the foresight to architect, build, and bring to market a database product based on their visions. Neo4j is gaining momentum and it is rapidly becoming a formidable contender in in the NoSQL landscape. Here are some of Neo4j's features and capabilities:
- Neo4j provides a fully equipped, well designed and documented rest interface with automated and predictable entity access
- It includes extensive and extensible libraries for powerful graph operations such as traversals, shortest path determination, conversions, transformations
- Neo4j is fully transactional with ACID properties
- It includes triggers, which are referred to as transaction event handlers
- Neo4j can be configured as a multi-node HA cluster with ease
- An embedded version of Neo4j is available, which runs directly in the application context
- Neo4j is fast - some of the performance stats are jaw-dropping, like in the Manning book
Neo4j has lots of other great features but I'll leave it to its creators to describe this in their very fine online documentation. I've found the Neo4j docs to be of the highest quality: very readable, informative, thoughtfully organized, and a pleasure to browse.
Neo4j and Stackato
My main goal here is to consider how Neo4j fits in with Stackato. I've never built a production application using Neo4j so it would be a little presumptuous of me to prescribe any best practices incorporating Neo4j into cloud apps. But what I am equipped to do is consider and discuss various deployment options for Neo4j. I've come up with five possible deployment scenarios for Neo4j in Stackato. Some of these are not available today, but are still worth discussing.
Some Deployment Scenarios for Neo4j in Stackato
- External Neo4j Server: a network-accessible database server is made available directly to deployed applications. There's nothing PaaS or cloud-specific about this familiar setup and it works equally well for cloud and legacy application deployments.
- PaaS-managed External Neo4j Server: Like the previous scenario, an external database server is made available to Stackato applications. The difference here is that Stackato would manage the database instance lifecycle instead of it being managed by an external entity (usually a human or group of humans) responsible for the external db server.
- PaaS-managed Internal Neo4j Server: Like MongoDB, PostgreSQL, and MySQL today, Neo4j could be bundled with Stackato as a first-class service.
- Internal Neo4j Server as Application: Neo4j Server is a Java application that happens to provide a service. Like most Java apps it can easily be pushed to the cloud and made available to other PaaS-deployed apps via its exposed APIs.
- Embedding Neo4j in Cloud-deployed Apps: The embedded Neo4j runs in the same context as the application that's using it. Any Java or JVM-based app can embed Neo4j and have its own personal instance.
The Scenarios In Detail
External Neo4j Server
This is a common setup that was in widespread use long before the advent of BigData and PaaS.
In this scenario, an application that wants to access the external Neo4j server simply needs an access URL and credentials to authenticate to it. The specific method chosen to configure these is application-specific but if the app is deployed in the PaaS, Stackato does offer conveniences that simplify this including hooks, build-packs, and access to environment variables. Frameworks like Spring also provide mechanisms to inject credentials.
An independently managed external database service like this works fine, but it brings with it several challenges and costs that can be reduced if the PaaS is allowed to manage the external service.
PaaS-managed External Neo4j Server
ActiveState recently incorporated an Oracle adapter, or wrapper, into Stackato. This wrapper provides Stackato users a single, consistent and self-service way to rapidly create and populate an Oracle database instance, wire it up to the app, migrate schema, manage its lifecycle, and provide external dbshell access. In addition this adapter allows Stackato to deal with the accounting and rationing of database resources, tasks which otherwise require considerable effort and cost.
We've had customers and partners write their own Stackato adapters for MySQL cluster and other external services. A similar adapter could be written for Neo4j.
While we at ActiveState don't currently have plans to build a Neo4j adapter, we could be motivated to do so. Meantime the CF architecture is designed so others may write and incorporate external service adapters, with a little bit of effort. If you're interested in building something like this let me know and I'll get you started and make sure you have what you need.
Bundle Neo4j as a First Class Stackato Service
The Stackato image includes several enterprise-class database products such as MongoDb, PostGres, and MySQL. These are core Stackato components and as such allow rapid and consistent database provisioning and management. Stackato handles all aspects of these services, freeing the developer to focus on coding instead of mucking around with database configuration and management, or being forced to get someone else to do it. Bundling Neo4j with Stackato would provide the same benefits.
As in the previous scenario, a bundled Neo4j instance is not on our short-term roadmap. However my bacon-gobbling cohort Matthew recently went through the process of bundling up ElasticSearch, another big-data product that's built on Java. He blogged about his experiences, which would likely resemble the efforts involved to build a similar adapter for Neo4j.
Internal Neo4j Server as Application
It's trivial to spin up a Neo4j server: simply download and expand a zip file and run a script. The Neo4j server is just an application that serves things, so why not run it as a Stackato application?
It's easy to do: just download Neo4j, create a stackato.yml file, and push to Stackato. This will provision Neo4j as a Stackato application. Since Neo4j's REST API is served over HTTP, no additional harbor port services are needed to expose the Neo4j APIs.
This raises an important question. As my manager might put it, "What even does it mean to run Neo4j server as a Stackato application?" Scaling is a basic foundation of cloud apps. Stackato is a platform for cloud apps so naturally it's highly adept at rapidly scaling apps in the cloud. But how does this work with Neo4j?
In the above configuration, each Neo4j server/application is an isolated instance. There's no clustering here, so each instance would have its own data files. Additionally, according to the Neo4j docs, the data files cannot be shared across server instances. Thus each scaled-out server instance would have its own datastore. Or more succinctly each instance holds state, a practice that's can seriously interfere with scaling.
But don't let this be a deterrent: running Neo4j as a server is extremely useful during most of the SDLC where teams need to deploy to the cloud but not to production. This would include, for example, all dev and QA teams.
The embedded scenario seems to have similar issues. If an app that embeds Neo4j is scaled out, multiple Neo4j instances will also be created. By definition this means that the application instances are holding state, and as discussed above, scalability and statefulness should not be uttered in the same sentence. Oops!
I was close to discarding the idea of using embedded Neo4j in cloud apps when I was struck by the idea that instead of considering an embedded database as just a database (with all the overhead and agony usually associated with these), instead it could be thought of as a graph datastructure, on steroids.
Here's a use case. Say I have a MySQL table that represents airline flights and costs between US cities, and say I want to determine the least expensive route between two given cities. There are lots of ways to do this with a relational database, but this problem is ideal for a graph. Since Neo4j implements all graph semantics and provides many powerful libraries for graph traversal and manipulation.
Why not load my table into a Neo4j graph (effectively an ETL) and use Neo4j's libraries to calculate the shortest path? Used in this way the graph is just a datastructure, not a database. It would not be persisted. The embedded Neo4j is lightweight enough, and fast enough to spin up, that this concept can be entertained. And given that Java collections don't include a graph then Neo4j with its powerful libraries seems like a good option.
But there are some obvious drawbacks to this. For example if the embedded Neo4j uses the filesystem as the backing store, there will be performance issues. Also, a powerful third-party in-memory graph datastructure would presumably provide similar features without needing to include a whole database. I asked the Neo4j engineers about this and they informed me that because of this they consider embedding Neo4j in multiple app instances in the cloud as an antipattern.
However, like “best” practices, antipatterns are a general rule, and with any rule there are exceptions. There may be use-cases where embedding Neo4j in a Cloud app does make sense. If you know of any I’d love to hear about it.
Cloud Foundry V2 and Neo4j
Cloud Foundry V2 will support a completely new service architecture based on buildpacks that will significantly reduce the complexity of service provisioning, and will streamline the process of adding new service to Cloud Foundry. ActiveState is committed to provide full v2 compatibility Stackato, which will allow for services like Neo4j to be easily provisioned with Stackato. Watch this space for more information as the new architecture is implemented.
Stackato and Neo4j are ideal companions for cloud application developers. I've identified five scenarios for deploying Neo4j with Stackato.
After playing with Neo4j for a week I'm now convinced that it's a first-class enterprise-ready product that's ideal for working with almost any kind of data, delivering the best of both the old-school relational world, and the newer NoSQL world. It exposes a powerful REST API, and works with all mainstream enterprise programming languages. Oh, did I mention: the Neo4j REST endpoint is transactional? Sweet!
In short, Neo4j is powerful, innovate, and fast. And no surprise, it works great with Stackato.
Try it out; you'll be happy you did.