The ZeroDB Database Hides Your Data Even From Itself
Even the server it's stored on can't see what's stashed in a ZeroBD database.
Image: Tristan Schmurr/Flickr
A fundamental problem with database security is that in order to make use of database software that software has to be able to actually access the data. Databases do more than just blindly dump things into anonymous lockboxes to be later accessed in bulk by the database owner. Their utility comes in organization, and organization requires doing things to actual data: sorting, querying, and generally making sense of complex relationships among many different data entities storing many different sorts of data. The relational algebra powering SQL-style databases is worthless without real, unencrypted data to do stuff with.
If I sound like a database fanboy, it's because I am. Data is amazing, and the seemingly simple, intuitive systems we've come up with to deal with that data are amazing too (and underappreciated). The tradeoffs are clear, however: When we put our data onto some database server—in someone else's hands—it's a target.
Enter ZeroDB. ZeroDB is a new database system described in a white-paper posted to the arXiv pre-print server on Wednesday, promising no less than "an end-to-end encrypted database that enables clients to operate on (search, sort, query, and share) encrypted data without exposing encryption keys or cleartext data to the database server." As of Thursday, it can be downloaded and installed using the Python package manager.
One might wonder how this seemingly paradoxical feat is accomplished. Well, it's a bit complicated, but there's some intuition to it.
The encrypted data is stored in its encrypted form in a type of binary search tree called a B-tree, which lives on the server. A B-tree is a way of structuring ordered data to minimize search cost/time. It looks like this:
The idea is that someone searching the tree for some element begins at the top, or root, and travels downward. Every time they hit a new node, where two or more new branches emerge, the searcher picks a branch by making a comparison: Is the desired element greater than or less than the current node? Every new node, a new decision is made. The result is that the searcher only has to cover half or so of the tree in the worst case, vs. the entire thing had the data been stored in a sequential list.
Anyhow, ZeroDB data is stored in "buckets" or containers corresponding to the tree's different nodes. You can imagine these as just segments in memory. To get at some data element, the client (whoever owns the data) asks for a particular bucket (rather than the specific data item) and the server searches through the tree for that bucket, returning it to the client who then decrypts it and does stuff with it. That's the high-level picture, anyhow.
"As a result, the server doesn't know how individual objects are organized within the B-Tree or whether they belong to an index at all," the paper explains. "Since ZeroDB is encryption agnostic, probabilistic encryption can ensure that the server cannot even compare objects for equality."
The database can use some other tricks to ensure that it's impossible or at least really, really hard to infer information about the data being stored from access patterns or the size of data records being requested by the client. For example, when a client requests a records, the server might return a bunch of other records with it, ensuring that the requests stay roughly equal in size no matter what the client is actually after.
"Since the server has no insight into the nature of the data, the risk of a server-side data breach is eliminated," the paper continues. "Even if attackers successfully infiltrate the server, they won't have access to the cleartext data."
This sort of security isn't free, of course. Each query made to the database winds up requiring several queries in total, but for applications involving not-so-big data, it winds up being not completely unreasonable. More performance-minded databases might be better off with MIT's CryptDB, which is based on a similar idea, but allows the server to have some awareness of the data relationships within, thus making data access a more simple undertaking.