Absolute security has always been easy. You just pull the disk drive, bury it in a secret location, and then burn the map. Laughing maniacally is optional. The data is secure.
Working with data without letting it leak out has always been much harder. Over the decades, mathematicians and computer scientists have been creating better, faster, and more elaborate algorithms that balance the need to keep data secure while making it available to answer questions. Some of the newest algorithms are changing the field by making it possible to accomplish jobs and deliver reports that were once impossible.
There are limits, of course. Attackers can still exploit holes you never imagined in layers of the operating system that you don’t control. The only truly secure approach is still that hole in the ground. But we’re reaching a point where some savvy math, mixed with good basic patch maintenance, can make it possible to keep the data online without letting it leak out.
Here are three techniques that are gaining traction and making it possible for the enterprise to keep its cake in a digital vault while still sneaking a byte now and then.
The traditional approach to maintaining secrecy is to keep a perfect copy of the data in a highly secure location. The idea behind differential privacy is that imperfect copies can be just as useful for computing aggregated statistics such as average age. But anyone who pokes around inside the imperfect copies won’t find personal information, because it’s all been tweaked, morphed, or otherwise altered in just the right way.
There are several algorithms that deliver on this promise by adding noise to the numbers. An age might be shifted by a random few years. A gender might be flipped 25% of the time. The theory explains just how to do this so the overall statistics remain close to the same. The US Census Bureau is exploring using this approach so it can release detailed information immediately instead of making the world wait the 72 years required by statute.
An important part of the theory is the idea of a “privacy budget,” which is sometimes assigned to the Greek letter epsilon. The algorithm designers might add more noise to some fields and less noise to others. The theory helps explain how to split up the budget so that the overall data remains secure.
The tricky part is that there’s no obvious value for epsilon. One Census Bureau scientist punted on this question and tossed it over to the political leadership, calling it a “policy question.” Researchers are still trying to understand the best way to choose such a value for businesses, agencies, and other enterprises.
Microsoft and Google are just two of the organizations that are sharing open-source tool kits that you can apply to your data today. There are also larger academic projects such as the OpenDP that are building larger collections of algorithms designed to work together.
Best for: Sharing large datasets where a little bit of imprecision isn’t catastrophic.
Not for: Datasets where the effect of noise is unknown or sensitive. Some AI learning algorithms, for instance, can pick up on small changes in unpredictable ways.
The weak point with encrypted data comes in that moment when you decode the data to analyze it. Any hole in the operating system can be exploited. Homomorphic encryption algorithms are designed to make it possible to run regular computer algorithms without ever decrypting the data. That’s ideal for situations that involve renting hardware from people whom you don’t trust.
These algorithms have been evolving rapidly over the last decade. The earliest solutions are just modified versions of the earliest public key algorithms, such as Rivest–Shamir–Adleman (RSA) and Diffie–Hellman (DH). They limit the range of operations that you might do. Some, for instance, make it possible to add value for accounting, but you can’t multiply them. Others do the reverse. (See chapters 14 and 16 of my book, Translucent Databases, for simple examples.)
The newest algorithms are much more capable. Some use the term “fully homomorphic encryption” to mean algorithms that allow arbitrary Boolean circuits and flowcharts. In general, the complexity of the process increases rapidly with the computation you want to do to the encrypted data. More complicated software needs more complicated encryption. The simplest solutions may be feasible, but more elaborate flowcharts may require renting half of the cloud.
IBM, one of the pioneers of the approach, is distributing tool kits for Linux, iOS, and macOS developers.
Best for: Very sensitive information that must be analyzed with an algorithm.
Not for: Some complex algorithms that are much too inefficient and slow with current techniques.
The standard digital signature is pretty simple. It just offers one bit of information, namely an assurance that someone stands behind some block of data and says it’s valid. SNARKs are just one of the techniques that offer more complex answers. There are dozens of similar approaches with names such as “STARK,” “Hyrax,” “ZKBoo,” “Bulletproof,” and “Pinocchio,” but SNARKs may have the best name, at least for the modern, social media-addled zeitgeist.
The name is short for “succinct non-interactive argument of knowledge.” The “succinct” part means that the package of bits is small, or at least practical in size. Some may only be several hundred bytes. The “non-interactive” part means that anyone can double-check the accuracy without setting up an Internet connection to the original creator (using a public key and the supporting infrastructure). The “argument of knowledge” part means the bundle of bits carries a longer message than a digital signature.
SNARKs and their cousins are finding homes in a number of cryptographic protocols. They can guard a certain bit of knowledge and prove it’s accurate without revealing it. A digital driver’s license SNARK may prove someone is old enough to drink alcohol without revealing the birth date or the age. These digital, non-interactive proofs are a big part of many of the blockchains used to track the ownership of some cryptocurrencies.
The proof protocols have not been studied long enough to lead to a government-blessed standard such as AES. Some libraries include jsnark (Java), libsnark (C++), and Ginger-lib.
Best for: Complex forms of authentication and assurance. Think of any kind of licensing or digital contracts. These are often found in many cryptocurrency blockchains.
Not for: Deep sharing or interactive algorithms.
Extend the lock and key ecosystem
The key contribution of the technologies, so to speak, is that they extend the old lock and key metaphor. They add functionality and semantics with new features. Differential privacy doesn’t even lock up data at all, but just uses noise to obscure the important details. Homomorphic encryption does more than just lock away data with a key; it locks it away so it can be blindly manipulated.
These three techniques are powerful in their own right, but they can also be combined with other options or with each other to create more privacy and defend against more attacks. SNARKs, for instance, are being added to some blockchains to add more anonymity and security to some cryptocurrencies and tokens. ZCash uses a version of SNARKs that makes it possible for users to spend without revealing more about their identity.
Thinking about the algorithms a bit differently and escaping old metaphors of lock and key make it possible to solve new problems. The architecture isn’t just a bunch of lockboxes and safes. It’s a pipeline for answering the right kind of queries without leaking information.