The Need for Safe and Efficient Sensitive Data Storage
In 1990, the original human genome project formally began. Taking over 10 years to complete and costing almost 3 billion US dollars, this international effort was deemed a colossal accomplishment. Now, in 2010, a company charges $300,000 to sequence an individual’s entire genome, a mere fraction of the original cost. “In five to ten years,” speculated Professor Perry Miller, Director of the Yale Center for Medical Informatics, “it will [cost] less than $1,000 to sequence a person’s genome.”
If Miller’s predictions are accurate, it might soon become feasible and beneficial for society to have everyone’s genome sequenced and made public. With this comes a vast increase in the ease of studying the genetic foundations of disease and developing a crime database much more accurate than fingerprinting. There would also, however, be major ethical and privacy concerns. Such sensitive data would have to be stored efficiently and securely, but to store the human genome on a computer without compression would take about 750 megabytes–about one CD. The United States alone would have well over 200 million gigabytes of information even without all of the metadata (name, address, etc.) that would have to accompany the database in order to make it useful.
This is an example of a challenge that the Privacy, Obligations, and Rights in Technologies of Information Assessment (PORTIA) conducted by Yale, Stanford, NYU, Rutgers, and the University of New Mexico is trying to address. The PORTIA team looks at how large amounts of sensitive data can be stored using minimal space, searched with the greatest possible efficiency, and handled ethically.
PORTIA and Sensitive Data
The project started in 2003 as a 5-year project among 5 universities with a National Science Foundation grant with a total budget of $12.5M (and a Yale budget of almost $4M). The project was ultimately extended to 7 years, ending this fall. As the project progressed, it became evident that “legal and social questions are pushed to fore by the internet and the central role of computers and networks in today’s daily life,” as described by Joan Feigenbaum in a recent interview. While computer science plays a key role in answering such questions, it isn’t enough. Thus, PORTIA incorporates lawyers and professors from other fields who also produce and use large amounts of sensitive data.
PORTIA predominately focuses on the emerging use of networks to transmit sensitive information. For instance, PORTIA sought to differentiate “sensitive” information from “private” information. “Sensitive” data was defined as just any information that would be injurious to a certain party if revealed; for example, improper use, censorship, or corruption of copyrighted works intended for distribution is harmful to the products’ creator. Such information needs to be collected by third parties to whom people are willing to share their information over networks with increasing frequency. With the mounting difficulty of constructing client-side defenses to prevent information from being stolen, however, tactics such as private browsing, encryption, and password protection are needed. As access to sensitive information is getting easier in general, Miller predicts that DNA sequencing, for example, “will be so easy to do that it will be hard to prevent it from being done.” Developing mechanisms to keep people accountable will be essential.
Transmittance of Anonymous Data
One increasingly important issue addressed by PORTIA is the transmittance of anonymous data for use in joint computations. The easiest way to do this is to transfer all of the data to a trusted party. Functions called Secure Multiparty Function Evaluation (SMFE) have been created to bypass the concern of data misuse by a single party. For example, if a salary survey needs to be conducted where individuals want to keep their salaries anonymous, one can use a SMFE. Unfortunately, SMFE is impractical if large numbers of people are working on a data set because too many simultaneous communications would be necessary.
The new approach is to divide the groups into input providers and computational agents. This minimizes not only the necessity of heads participating simultaneously but also the number of computers upon which SMFE protocols have to be installed and maintained, thus both limiting cost and increasing feasibility. The program can run with as few as two computation computers. As long as any semi-honest adversaries do not control all computation computers, it is impossible for any information to be divulged. Then each of the inputs (xi) is split into shares by the original (Pi) and sent to one of the computational agents. Once all of the data arrive at the computational computers, they use the same communication intensive strategy of the previous SMFE to create a hybrid system that is both feasible and private.
PORTIA designed an interesting application of the sorting network technique. A sorting network consists of wires each carrying a value. Comparators will receive two input values and then emit the two inputs on two wires so that the higher value is on the bottom wire and the lower value is on the upper wire. Maximizing the efficiency of comparators can be done either by total size (number of comparators used) or by depth (the maximum number of comparators for a number to go from top to bottom). In this particular project, the sorting network was used to generate an anonymous salary list.
The Privacy Issue
One of the primary issues with creating anonymous systems is the inability to perform data cleaning. If issues are identified within the data, there is no way for a central authority to notify the supplier of the data. And as no one sees the entire data set, it is difficult for humans to determine the feasibility of the data by sight. The solution is to provide computational computers with a program that essentially runs a sanity check on the data. Can this be done? Can it be trusted?
Privacy is really the heart of the issue. Hardly anything is a true secret anymore. As Feigenbaum made clear, there is “less and less inaccessible information. The question is what is the cost of access. And the truth is the cost is decreasing.” To counter that, the coupling of computer science techniques with the establishment of legal standards is necessary. Even though the project is ending later this year, the work that PORTIA has done will be fundamental in the years to come. These issues will become ever more pressing as we increasingly expose ourselves via networks and look to store more information on the Internet.
About the Author:
Matthew Chalkley is a freshman in Davenport College. He is originally from South Jersey and is planning on majoring in chemistry. He also volunteers at YNNH and is involved with the Roosevelt Institute.
The author would like to thank Professor Perry Miller for his insight into the PORTIA project. And he would like to extend particular thanks to Professor Feigenbaum for her enthusiasm and guidance in researching the project
A. C. Yao, How to generate and exchange secrets, Proc. of the 27th Symposium on Foundations of Computer Science (FOCS), IEEE, 1986, pp. 162–167.
A. Stubblefield and D. S. Wallach. “Dagster:Censorship-resistant publishing without replication.” Technical Report TR01-380, Rice University, 2001.
D. Malkhi, N. Nisan, B. Pinkas, and Y. Sella, Fairplay—ASecureTwo-Party Computation System, Proc. of the 13th Symposium on Security, Usenix, 2004, pp. 287–302.
D. Mazieres and M. Waldman. “Tangler: A censorship-resistant publishing system based on document entanglements.” In Proceedings of the 8th ACM Conference on Computer and Communications Security, pp.126-135, 2001.
M. Naor, B. Pinkas and R. Sumner, Privacy Preserving Auctions and Mechanism Design, Proc. of the 1st Conference on Electronic Commerce (EC), ACM, 1999, pp. 129–139.