Popular science | Privacy protection is worrying? Encrypted data warehouse shows its talents

This article is part of a section of the paper "Encrypted Data Vaults" by the Rebooting Web of Trust at RWOT IX – Prague, 2019.

Original: https://github.com/WebOfTrustInfo/rwot9-prague/blob/master/final-documents/encrypted->

Author (in alphabetical order): Amy Guy, David Lamers, Tobias Looker, Manu Sporny, and Dmitri Zagidulin

Contributors (in alphabetical order): Daniel Bluhm and Kim Hamilton Duffy

We store a large amount of sensitive data online, such as personally identifiable information (PII), trade secrets, family photos, and customer information, which are often not protected as they should be.

In general, people are primarily constrained by legislation, such as the General Data Protection Regulations (GDPR), to ensure that they take responsibility for data breaches and encourage them to better protect their privacy. This pressure of responsibility exposes technical shortcomings, and service providers often do not have the technology to protect the privacy of their customers. The encrypted data warehouse makes up for this gap and has many other advantages.

This article describes current methods and architectures, derived requirements, design goals, and the risks developers should be aware of when implementing data storage. It also explores the basic assumptions of such systems, such as providing privacy protection mechanisms for storing, indexing, and retrieving encrypted data, as well as data portability.

Current ecosystem and existing work

The problem of decentralized data storage has been implemented from a different perspective, and decentralized or other forms of personal data storage (PDS) have a long history in both commercial and academic contexts. Different approaches lead to differences in terminology and architecture. The figure below shows the existing component types and their effects. It is not difficult to see that the encrypted data warehouse mainly provides storage functions.

Next we will briefly describe the commonalities and differences between some existing implementations. It may not be comprehensive, but we try to select some projects that are most familiar to the author and represent different method types.

Architecture and deployment

Many architectures are designed around the idea of ​​separating data storage from the application layer that uses stored data. We can think of these applications as clients with different complexity and treat the data store as a server. Some projects are looking to build ecosystems with diverse applications and design their agreements based on this. NextCloud, Solid, and DIF's Identity Hubs all describe an architecture for decoupling end-user programs from data storage. Such an application can be a general-purpose file management interface for browsing or sharing data, or a tool for a specialized domain designed for a specific task, such as a calendar. Datashards, Tahoe-LAFS, and IPFS are only relevant for data storage and retrieval.

For Solid, NextCloud, and Identity Hubs, end users can choose to install and run a server portion of the data store on their own controlled device, or register to a configured instance hosted by a trusted third party (such as a business provider, affiliate, or friend) on. For Datashards and Tahoe-LAFS, end users install applications on one or more devices they control and store this data locally on the device. IPFS is peer-to-peer, so end users only install read/write clients, and data is stored on the public network.

In addition to being responsible for data storage, Identity Hubs has other roles, such as management of end-user data, transmission of human or machine-readable messages, and pointing to external services.

Encryption strategy

An important consideration in encrypting data storage is which components in the architecture can access (unencrypted) data, or who controls the private key. There are roughly three methods: storage side encryption, client (edge) encryption, and gateway side encryption (the first two are mixed).

Any data storage system that allows users to store arbitrary data basically supports client-side encryption. That is, they allow users to encrypt the data themselves and then store it. However, this does not mean that these systems are optimized for encrypted data, and it can be difficult to query and access encrypted data (as is the case with Solid, NextCloud, Identity Hubs, and IPFS).

Storage side encryption usually manifests as full disk encryption or file system level encryption. This is widely supported and understood, and storage-side encryption is possible with any type of managed cloud storage. In this case, the private keys are managed by the service provider or controller of the storage server, they can be multiple entities, and can be different from the users who store the data. Storage-side encryption is an effective security measure, especially when storage hardware can be accessed directly. However, this method does not guarantee that only the original user who stores the data can access it.

In contrast, client-side encryption (like Datashards) provides a high degree of security and privacy, especially if metadata is also encrypted. In this approach, encryption is usually done at the level of a single data object with the assistance of a keychain or wallet client. Therefore, the user can directly access the private key. But the price is that the responsibility for key management and recovery falls directly on the end user. In addition, the issue of key management becomes more complicated when shared data is needed.

A gateway-side encryption system like Tahoe-LAFS uses a combination of storage-side encryption and client-side encryption. These storage systems are often encountered in multi-server clusters or some encrypted cloud service provider platforms. These providers recognize that client key management can be too difficult for some users and use cases, and is willing to provide encryption and decryption services in a way that is transparent to client applications. At the same time, they attempt to minimize the number of components (storage servers) that have access to the private key. Based on this consideration, the key is usually located on a "gateway" server that encrypts the data before passing it to the storage server. Encryption and decryption is transparent to the client, and the data is opaque to the storage server. Therefore, the storage server can be modular or pluggable. Gateway-side encryption has some advantages over storage-side encryption, but it also has some disadvantages, that is, the gateway key is the administrator of the gateway, not the user.

Encrypted metadata

We kill people based on metadata.- General Michael Hayden, former director of the NSA and the CIA

Whether metadata can (or is required to be) encrypted has an impact on the privacy, security, and availability of the system.

Some systems, including Solid, NextCloud, and Identity Hubs, support the inclusion of arbitrary metadata on binary data, while IPFS, Datashards, and Tahoe-LAFS do not. In Solid, clients use RDF to write metadata by resource. Identity Hubs uses JWT for metadata for each object and uses JSON documents for other metadata in Collections (this is also the responsibility of the customer). The NextCloud client can add metadata to the document using WebDAV custom properties, but none of these involve metadata encryption.

Access interface and control

Whether accessing data over a network or on a local device, data objects tend to require globally unique identifiers. In different implementations, the storage interface for reading and writing data, and the mechanisms that restrict or authorize doing so, will vary.

Both NextCloud and Solid leverage existing web standards. NextCloud uses WebDAV technology to allow its client applications to use the directory structure to read, write, and search data on the server's file system and to support custom login streams for authentication. Solid combines LDP with OpenID Connect authentication and web access control to allow users to read or write data after logging in to the client application. The resources (data objects) on the Solid server are represented by HTTP URIs, which receive HTTP requests containing RDF payloads and create or modify target URIs accordingly.

Identity Hubs uses a JSON Web Token (JWT) and a specified service endpoint. This requires multiple requests, first to retrieve the metadata for the required data object, and then to retrieve the commit sequence that makes up the actual data. Its identity authentication mechanism is still under development, and access control is performed through the "permissions" interface.

Tahoe-LAFS uses the client gateway storage server architecture, the client passes the data to the gateway server for encryption and partitioning, and the gateway stores the blocks in the storage server cluster in turn. At the same time, the data is stored in multiple copies to increase availability and help with data recovery. The service is identified using the Footscap URI, and the client can be configured to use HTTP, (S)FTP, or listen to a local directory (" magic folder") to create, update, and delete data. Data is organized in a directory structure similar to the file system and used for access control.

IPFS is a distributed content-addressable storage mechanism that decomposes data into Merkle-DAG. IPFS uses IPLD to generate URIs for the data in the content, link the content to the network, and use DHT to discover content on the network.

Index and query

Because encrypted data is opaque to the storage server, this can be a challenge for data indexing and searching. Some systems solve this problem by appending a certain amount of unencrypted metadata to the data object. Another possibility is to use a list of unencrypted pointers that point to the filtered subset of data.

Solid is designed to provide a web-accessible interface to file systems. Resources (RDF documents or arbitrary files) are organized into containers of similar folders, and the granularity of the data store (for example, file system or database) needs to be considered in implementation. Solid does not specify a search interface, but some implementations may use SPARQL or TPF.

Identity Hubs are also indexed using the Collections interface. The client is responsible for writing the appropriate metadata to its own unencrypted collection, allowing the Hub to respond to queries.

NextCloud categorizes data objects into directories, and clients can query data and metadata using WebDAV's SEARCH and PROPFIND methods.

Tahoe-LAFS, Datashards, and IPFS are low-level storage protocols that do not provide indexing or search data.

Availability, replication and conflict resolution

Data replication across multiple storage locations provides the system with the ability to quickly recover and enhance system security. Systems that support peer-to-peer replication must provide conflict resolution mechanisms such as CRDTs, or require end user intervention to incorporate unsynchronized files. NextCloud, a commercial enterprise product, enables data to flow across multiple instances for scalability. Different NextCloud servers do not communicate directly with each other, but can communicate through user-installed applications. Similarly, different instances of the Solid server will not communicate with each other. Client applications can perform any necessary communication between storage servers, but these servers typically belong to different users rather than storing copies of the same data.

IPFS, Tahoe-LAFS, and Datashards block data by using content-addressable links and store many chunked copies for high availability. Since they are low-level protocols and the data is not transparent to the server, they are not resolved by conflict.

In Identity Hubs, change synchronization and conflict resolution between Hub instances is under development.

to sum up

The above outlines the characteristics of some of the active projects in the personal data storage aspect. These projects are designed to enable end users to control their data without the need for centralized authorities or proprietary technology.

From the current point of view, it is difficult to implement client (edge) encryption of all data and metadata in one system at the same time, so that users can store data on multiple devices and share data with other entities, and can also be searched. Or query. From the survey we can see that there is usually a trade-off, such as the need to sacrifice privacy to improve usability and vice versa.

As many technologies and standards are now maturing, we hope that this compromise will no longer be needed and explore the possibility of designing a broadly applicable privacy protection protocol for encrypted decentralized data storage.

The ontology has carried out a lot of exploration in decentralized identity and personal data privacy protection. Welcome all technical partners to join us for discussion. Later, we will provide you with the second half of the paper, exploring the basic assumptions of such systems, such as providing privacy protection mechanisms for storing, indexing, and retrieving encrypted data, and data portability.