Big Data is big these days, as more and more companies dig into their servers to find out what makes their market tick.
There is “big”, and then there is “BIG”, however. Really big data–multi-terabyte-scale–is still fairly rare. If you’re working at that scale then Hadoop MapReduce or possibly Spark is required.
Hadoop MapReduce gets its enormous power from being a fairly restrictive computing environment. Today we have alternatives like Apache Spark, which aims to give Hadoop-like distributed scalability along with a more flexible programming model. It’s a mature technology (version 2.1 was released mid-way through 2016) and it scales well, but it’s a lot more than you need for a lot of “fairly big data” problems, and those are pretty common (in fact, they are the most common.
Why Use Python Tables
Enterprises face many problems involving datasets up to a terabyte or so, which while being just “fairly big” still requires some specialized handling. Fortunately there is tables, which is built on top of HDF5 (accessible via the h5py package) and allows efficient processing of datasets that are far too big to fit in memory. HDF5 is a system for managing large complex datasets, and tables is a friendly interface to it.
The description of tables as a system for dealing with hierarchical data is best understood in terms of how it is tied to the underlying filesystem: the hierarchy can be viewed as a directory tree, as the examples discussed below illustrate.
Unsurprisingly, table objects are the fundamental data type for the package. They are always associated with a file, which has a pseudo-file-system within it.
Create a New Table
To create a new table, you must first open a new or existing file using tables.
open_file(), and then call
create_table on the returned file object to get a table object. Assume we have run
from tables import * before the following example code:
newFile = tables.open_file("filename.h5", "w") # new file for writing
The table object has to have data added to it, and to do this we need to have at least one “group” defined. Tables is designed to work on hierarchical data which is represented by a tree of nodes. The abstract Node class has two concrete types: Group and Leaf, for nodes that have children and those that have content, respectively. Group and Leaf might also be thought of as “directory” and “file”.
newGroup = newFile.create_group("/", "animal") # top-level animal group newerGroup = newFile.create_group(newGroup, "mammal") # mammals ... and so on…
Note the parent of a group (the first argument to
create_group) can be either a path string or a group object. The same is generally true for group specification in interfaces.
Before we can create a table we need to have a description of what it will contain:
class Mammal(IsDescription): name = StringCol(64) # 64-character String legs = UInt16Col() # Unsigned short integer, probably overkill arms = UInt8Col() # unsigned byte, still probably overkill temp = Float32Col() # 32-bit float for body temperature
Once a group and a description have been created, they can be used to create (finally!) a table:
>mammals = newFile.create_table(newerGroup, "people", Mammal)
The intimate connection between table objects and files is the key to tables’ power. Although we as programmers don’t need to know–most of the time–what the boundary between in-memory and on-disk is, the tables framework takes care of it for us. It will even deal with zlib compressed data on disk, seamlessly compressing and decompressing as required.
Sometimes it matters that there is a disk drive underneath our table object. We have to care about this when we have modified a table object. For efficiency reasons, changes are not automatically flushed to the disk, so we have to manually flush them:
To add data to the table we get the row object, set values on the row, and then call append on the row to append the values to the table. Which is maybe a bit odd, but convenient: the row knows what table it belongs to, so there is no reason not to use it rather than dragging the table object around:
newPerson = mammals.row() newPerson['name'] = "Tom" newPerson['legs'] = 2 newPerson['arms'] = 2 newPerson['temp'] = 37.8 newPerson.append() # add to table mammals.flush() # write to disk
Once a table is populated it can be iterated over with iterrows, which acts like any normal iterator:
for row in mammals.iterrows(): ... whatever…
The important thing is that because the boundary between the disk and memory is almost completely hidden behind the interface, data scientists and analysts can focus on algorithm development, and for the most part leave the data management to the framework.
While the basic interface is simple, there are a lot more advanced features for doing search and selection that allow for efficient filtering and processing of data. I’ll talk about some of those things in a future post.
The tables package and hdf5 might not scale to multi-terabyte datasets–at least on today’s hardware–but for the very large number of cases where we are dealing with Fairly Big Data they do the job extremely well
Download ActivePython Community Edition for free and get started!
Title image courtesy of Nick Youngson