Reinventing cybersecurity
“If a service is free, you are the product,” according to conventional wisdom about websites, apps and online services. That is, your personal data is the product. The truth is, whether or not we’re paying for services, these cyber-businesses are tracking what we buy, where we go, what we’re looking at and how long we look. And depending on the language behind that “I agree” button most of us click to get on with our tasks, they are likely selling our data to other businesses. Numerous large-scale hacks — affecting millions and even billions of users — have shown that we cannot trust these companies to keep our data safe.
The good news is that, as of January 1, 2020, the California Consumer Privacy Act joined other recent data protection laws in granting consumers new rights to know how our data is used, to have it deleted and to opt out of having it sold. Businesses are also now legally bound to protect our data throughout its life cycle, though many lack the technical capability to do so effectively.
But all these new responsibilities — and liabilities — add to the likelihood that companies will simply lock up their private data, storing away potential value. Analyzing customer data can support business growth, for example, and collaborations can address societal problems, such as travel data from ride-sharing companies informing systemwide transportation improvements, or medical data being studied to compare treatment outcomes across large populations. So while the new legal landscape brings much-needed privacy protections, it also locks up a lot of potential.
“Data has fueled the modern-day economy,” says electrical engineering and computer sciences professor Dawn Song (Ph.D.’02 EECS). Song made MIT Technology Review’s “35 Innovators Under 35” list in 2009 for her advanced approaches to securing private data from attacks, and a year later was named a MacArthur Fellow. She says data informs business decisions that help grow the economy and can support discoveries that make society safer and healthier. She can enumerate better than most of us the importance of protecting sensitive data. “But if we silo the data to keep it secure,” she says, “we don’t get any benefit from it.”
Actually computing on sensitive cloud-based data has proven challenging — simply building firewalls around the cloud has not kept hackers out.
To access those benefits while protecting users’ personal data, Song thinks we need to move beyond the ad hoc approach of simply adding security patches. Instead, she is reimagining what online security looks like, proposing a new paradigm she calls a “platform for a responsible data economy.”
What does that mean, exactly? She ticks off several foundational ideas: We own our data; owners get to specify how we want our data to be used; and an individual’s data is tracked throughout its life cycle, so how it’s used is transparent.
Her research lab is developing security and privacy solutions based on these ideas. In 2018, she co-founded Oasis Labs, where she is chief executive, to commercialize this work. Oasis is building a secure platform to store, track and compute on private data, what Song calls “controlled use.” The publicity materials for Oasis Labs put her goals more simply: She is building a “better internet” — one that can access data’s value without compromising privacy. The work has earned her spots on Wired and Inc. magazines’ top innovator lists.
A multifaceted defense
Song recently identified one particularly alarming privacy vulnerability: leaky programs. Language-prediction technology is everywhere now, deploying artificial intelligence to suggest our likely next words in texts, emails and search terms. These machine-learning models work by gobbling up huge quantities of data — some of it private, like emails and texts — to learn common speech patterns, then using that growing intelligence to predict what we’re likely to say next.
The technology has brought some gentle relief to busy people and tired thumbs, but Song found that these language models have a nasty habit: They memorize the source data they learned from.
“We show that unintended memorization is a persistent, hard-to-avoid issue that can have serious consequences,” reads the provocatively named 2018 study, “The Secret Sharer,” which she co-authored in collaboration with Google Brain. Those consequences? Savvy attackers can query unique sequences — like credit card or Social Security numbers — then use the program’s predictions to launch attacks.
Protecting data from this and other security vulnerabilities throughout its life cycle requires multiple lines of defense. Song’s platform delivers a one-two-three punch.
To train language programs, the platform uses machine-learning models that employ cryptographic techniques such as inserting noise and duplications — a so-called “differential privacy” system that keeps nefarious queries from learning sensitive information. In addition, it runs data computations in a secured environment that combines hardware solutions and cryptographic techniques. For record-keeping, it uses a distributed ledger that deploys blockchain technology, the same decentralized ledger system used in cryptocurrency, where no single individual has control.
Together, these innovations create a platform where organizations — like hospitals or ride-share companies — can share data to gain meaningful insights, without revealing an individual’s personal information. This “blindfolded” secure-computing process, called “secure collaborative learning,” can safely tap into data’s potential.
Much of Song’s work is open source, a free-access approach to programming that she says will speed the larger paradigm shift and help make the system more robust.
Joining the data economy
A cornerstone of Song’s vision is establishing data rights that align with basic property rights, allowing individuals to garner monetary value from their data. Does this mean you’ll be able to rent out your 23andMe results like Airbnb and Uber let you unlock value from your home and vehicle? It’s not so simple, she explains.
“People talk about data like it’s the new oil. But data is actually crude oil. You need to process it to turn it into something that’s useful and valuable.”
That processing will require a whole new set of principles. She and colleagues are working on these ideas using a principle from game theory to determine exactly how much data is worth and what those transactions will look like. To train health-related models, for example, data pertaining to a rare disease would likely have more value than a ubiquitous data point like a common blood type.
Song’s vision for a better internet is gaining traction. Last year, she advised the U.S. chief technology officer, and her research team’s new approach to verifying differential privacy won the 2019 Distinguished Paper Award at OOPSLA, an influential software engineering conference.
“Both users and companies are suffering, but with our platform for a responsible data economy, we want to bring them into a win-win situation,” Song says. Users can be assured that their data is safe and potentially even monetize it, and businesses can comply with their security and transparency obligations while realizing benefits from their data. “But,” she adds, “in a privacy-preserving and responsible way.”
‘Sharing without showing’
Electrical engineering and computer sciences assistant professor Raluca Ada Popa also believes that society can reap great benefits from private data if only we can learn how to use it securely. Popa, who counts a spot on MIT Technology Review’s 2019 “35 Innovators under 35” list among the many honors for her work in secure cloud computing, focuses on secure collaborative learning to help organizations that have a lot of sensitive data unlock their potential to address big, important questions.
“We would love to learn from all this data,” she says. What’s the best cancer treatment across all hospitals, she wonders, and what are the indicators for a particular diagnosis? Illegal operations like drug dealing and human trafficking launder money by making small deposits across numerous banks and geographical locations, but banking data is, necessarily, private. “If these institutions could only put all this information together, they would learn a lot from it. They could produce much better cancer treatments, much better models for predicting money laundering and so on.”
Encryption has been effective in securing data while it’s in transit to the cloud or resting in cloud storage, but actually computing on sensitive cloud-based data has proven challenging — simply building firewalls around the cloud has not kept hackers out. Popa’s work overcomes this hurdle by providing organizations a way to share and compute on their sensitive data without ever decrypting it.
“Essentially it’s sharing without showing,” Popa says. Secure collaborative learning has been around for three decades, she adds, but it’s been too slow for practical applications. “My work makes it practical. For example, in our paper on Helen [Popa’s encrypted machine-learning training system], we made training 1,000 times faster than existing technology. So instead of training a model in three months, it takes us under three hours.”
Popa’s technology is already helping organizations learn from shared data. An anti-money laundering pilot project with Canadian banks is just one of several trials that have proven the technology’s potential and moved it a step closer to adoption. Secure auditing is another use case she’s studying. Regulators can audit specific, agreed-upon activities at a country’s nuclear power plants, for example, without violating private activities outside the scope of international agreements.
Collaborating securely
To begin a collaboration, organizations agree to run an algorithm on their collective data based on a specific shared goal. For example, to detect potential money laundering, banks may look for a single individual making deposits at different geographic locations at the same time. They agree on the data categories they need to include, such as customer IDs, deposit amounts and locations. They encrypt their data with Popa’s systems, and then what she calls “the magic” happens: The banks run the algorithm on their joint data without ever decrypting it.
Finally, only the output is decrypted — in this case, suspicious accounts. “This is very powerful because banks don’t have to share all their customer data with each other. They only share the results of their search,” Popa says.
Decryption in her software works like nuclear launch codes, adding another layer of security against inadvertent leaks and human frailties. Employing blockchain’s decentralized authorization process, each collaborator only has one piece of the key. To decrypt a computation result, all the parts must come together, a system that guards against even multiple points of corruption or error in the process.
Popa’s company, PreVeil, makes this Berkeley-developed technology available for email and file sharing — applications that perform like Gmail and Dropbox. She has customers in security-critical fields like aerospace, defense and biotechnology.
But Popa’s ambitions go beyond what her research and company can do alone. She is working on MC2, an open-source version of her lab’s secure collaborative learning platform that is accessible to non-technical users. Potential applications appear to be vast.
“People come to us from financial institutions, big internet companies, nuclear physics, the government, medical institutions.… We don’t have the bandwidth to deploy our technology for every one of them, so we are creating a platform that anyone can use,” she says. “You don’t have to know cryptography or fancy engineering.”
While Popa’s lab works on about three use cases at a time, her students track the many inquiries they receive and integrate those questions into the open-source project. “Our goal is to bring secure collaborative learning to the masses — to the non-experts, to everyone,” she says. “That’s the vision we’re working toward.”