Abstract

The rapid development of wearable sensors and the 5G network empowers traditional medical treatment with the ability to collect patients’ information remotely for monitoring and diagnosing purposes. Meanwhile, the health-related mobile apps and devices also generate a large amount of medical data, which is critical for promoting disease research and diagnosis. However, medical data is too sensitive to share, which is also a common issue for IoT (Internet of Things) data. The traditional centralized cloud-based medical data sharing schemes have to rely on a single trusted third party. Therefore, the schemes suffer from single-point failure and lack of privacy protection and access control for the data. Blockchain is an emerging technique to provide an approach for managing data in a decentralized manner. Especially, the blockchain-based smart contract technique enables the programmability for participants to access the data. All the interactions are authenticated and recorded by the other participants of the blockchain network, which is tamper resistant. In this paper, we leverage the K-anonymity and searchable encryption techniques and propose a blockchain-based privacy-preserving scheme for medical data sharing among medical institutions and data users. To be specific, the consortium blockchain, Hyperledger Fabric, is adopted to allow data users to search for encrypted medical data records. The smart contract, i.e., the chaincode, implements the attribute-based access control mechanisms to guarantee that the data can only be accessed by the user with proper attributes. The K-anonymity and searchable encryption ensure that the medical data is shared without privacy leaking, i.e., figuring out an individual patient from queries. We implement a prototype system using the chaincode of Hyperledger Fabric. From the functional perspective, security analysis shows that the proposed scheme satisfies security goals and precedes others. From the performance perspective, we conduct experiments by simulating different numbers of medical institutions. The experimental results demonstrate that the scalability and performance of our scheme are practical.

1. Introduction

Data sharing is crucial for promoting the research of disease tracking and treatment. For instance, the sharing of substantial medical data can help government agencies make correct decisions in public health or help medical research institutions conduct scientific research to promote the progress of medical science. The fight against to current epidemic of COVID-19 also proves that efficient medical information sharing among different institutions can effectively trace the disease and accelerate the vaccine development. The application of wearable medical devices and health-related mobile apps alleviate the difficulty of personal medical data collection and retrieving, but for the following stages of data management, the security and privacy of the data have become the top concern at the same time.

Traditional approaches are based on clouds to store data remotely and different institutions share the data in a centralized manner. The top two challenges for these methods are privacy and security.

For the privacy issue, any data user should not infer a specific patient private information, including the name, address, and phone number, from querying the stored medical records. If the privacy of the patient is not well protected during data sharing, the patient would be reluctant to share their medical data [1]. In addition, for sharing purposes, the data is stored remotely on clouds. The patients and medical institutions have to trust that the cloud provider would not leak out the data, if the data is not encrypted. However, the encrypted data would hinder the sharing process.

K-anonymity [2] and differential privacy [3] are two commonly adopted techniques to protect data privacy. Comparing both, differential privacy is relatively new and got more attention in recent years due to its strong privacy guarantee. Differential privacy is usually achieved through adding noise to attributes or values [4], i.e., through simply adding or deleting some less important data in the original dataset. Though differential privacy defines an extremely strict attack model for guaranteeing privacy, the noise data it introduces may affect the statistical characteristics of the data. However, for the medical research purpose, the statistical characteristics are more important to study a disease instead of focusing on some individual patients. Therefore, the K-anonymity [2] technique is more suitable for preprocessing the medical data, which obscures some sensitive data fields without affecting the original data. The accuracy of the statistical results on the data, therefore, can be preserved. On the other hand, the medical data are encrypted and stored on untrusted clouds. Searchable encryption is a potential solution which allows the server to search on encrypted data without knowing the content of the data [5].

For the security issue, cloud-based data platforms have problems of a single point of failure, vulnerability, and inefficiency. Besides, cloud-based data sharing relies on third-party services and there may exist stealing, leakage, tampering, or misusing of data. Although existing cryptography-related solutions have solved some problems of the cloud, the single point of failure problem cannot be solved. On the other hand, the access control for medical data management is also centralized and usually based on roles. However, the RBAC (role-based access control) model requires configuring complex rules to restrict the accessibility of different types of data users. Especially, the process of configuring and updating rules is vulnerable that the attack may leverage the loophole and easily elevate privileges to obtain the grant for the entire dataset, due to the centralized data storage.

Blockchain is a distributed ledger, which has the characteristics of decentralization, tamper resistance, and reliability. Thus, blockchain is a potential solution to replace centralized cloud storage. The blockchain-based smart contract provides a decentralized manner to authenticate the data access request among participants of different institutions. The ABAC (attribute-based access control) model can be further leveraged to simplify configurations for restricting the data accessibility according to the users’ assigned attributes.

In this paper, we propose a privacy-preserving scheme based on the blockchain for medical data sharing. To tackle the two challenges mentioned above, the contributions of our developed system are as follows: (i)We adopt the K-anonymity technique to preprocess the data for privacy preserving(ii)We design the scheme based on searchable encryption for storing the encrypted medical data on clouds and enable the keyword search in a privacy-preserving manner(iii)We develop smart contracts based on Hyperledger Fabric, and realize the secure keyword search and the attribute-based access control model(iv)We implement a prototype system with smart contracts based on a chaincode of Hyperledger Fabric (the URL of source code: https://github.com/mythsand/privacy-preserving-medical-data) and conduct experiments with simulating different numbers of institutions(v)We analyze the security properties and evaluate the computational overhead

The rest of the paper is organized as follows. Section 2 explains the preliminaries related to this paper, including K-anonymity, bilinear pairings, blockchain, and access control. Then, we formulate the problem with explaining the security goal and notions. The system design and technique details are introduced in Section 4. Security analysis and experimental studies are demonstrated in Section 5 and Section 6, respectively. Section 7 presents the related work and Section 8 finally concludes the paper.

2. Preliminaries

2.1. K-Anonymity

K-anonymity was the first model proposed to protect data privacy through syntax [2]. Its main idea is to make reidentification infeasible by hiding objects in the same group. That is to say, K-anonymity requires that each record in the anonymized data cannot be distinguished from other at least -1 records. Quasi-identifier attributes, e.g., social security numbers, state liquor identification cards, drivers’ licences, and even passports or national identity cards, are concerned. Therefore, no identity in the K-anonymous dataset will be linked to fewer than records. That is, the probability of correct reidentification is at most 1/. Several definitions of K-anonymity related to this paper are explained below. Here, we assume that all the information is in a table , which contains multiple tuples.

Definition 1. (quasi-identifier attribute set). A quasi-identifier is a minimal set of attributes in table that can be combined with external information records to reidentify personal information. This paper assumes that quasi-identifiers are known based on empirical data and epistemology.

Definition 2. (equivalent class). The equivalent class in table indicates that each tuple is the same as several other tuples.

Definition 3. (K-anonymity property). Table is a set of determined values of the attribute group in that is anonymized to appear at least times in , i.e., each of the equivalent classes is at least in size.

2.2. Bilinear Pairings

The scheme of our encrypted search is constructed based on bilinear maps, the definition of which is described as follows:

2.2.1. Bilinear Map

For two cyclic groups and of order , there is a bilinear map between them: . The map relation satisfies following three properties: (1)Computability: given , algorithms to compute can finish within a polynomial time(2)Bilinearity: for any integers (3)Nondegeneracy: if is a generator of , then is a generator of . In other words, this can be simplified as

The size of is determined by the security parameter.

2.2.2. Decisional Bilinear Diffie-Hellman (DBDH) Assumption

Suppose an adversary chooses random , , , , the DBDH assumption [6] means that there is no adversary, who can distinguish the tuple (, , , from the tuple (, , , , within a probabilistic polynomial time with a nonnegligible advantage.

2.3. Blockchain and Smart Contract

The consortium blockchain provides a permissioned design, which is different from the private blockchain and the public blockchain. It does not require the same level of strict control and restriction as the private blockchain. Meanwhile, it is not completely decentralized as the public blockchain. The consortium blockchain is maintained and operated by several trusted nodes. Besides, the consortium blockchain has the advantages of promoting openness and collaboration, marvelous data control, and node management and speeding up the operation of the system. In this paper, we build a medical data sharing system based on Hyperledger Fabric. Hyperledger is an open source project under the Linux Foundation and is supported by companies such as IBM, Intel, and Sap. Hyperledger Fabric is one of the implemented blockchains. This consortium blockchain has the advantages of high throughput, low latency, and scalability. It is a popular choice in the industrial blockchain scenario.

2.4. Access Control

Attribute-based access control (ABAC) is a type of access control technology that considers attributes, objects, permission, and environment as input. It determines whether to grant authorization by examining whether the object contains the proper attributes. ABAC can provide fine-grained access control, which can support a large number of input decision sets, define many possible rules, and express many strategies, but with only limited computing consumption and attributes. Such flexibility can decouple the relationship between the subject and the object. For instance, an employee is given attributes in the subject’s attribute set, e.g., a nurse in a certain hospital. Objects are given attributes when they are created, e.g., medical data of patients with diabetes. Object owners will define attributes to set access control rules at the beginning of the creation, e.g., all nurses in the hospital can access the medical data of patients with diabetes. Under attribute-based access control, access decisions can be modified by simply changing the attribute value without affecting the relationship between a single subject and an object. Hence, ABAC can provide more dynamic and flexible access control management capabilities, reducing long-term maintenance costs. In addition, ABAC can be performed without any knowledge of new subjects, which means that there is no need to modify the existing rules.

3. Problem Formulation

3.1. Problem Scenario

There are four main roles in our problem scenario, data owner, medical institution, and data user. The respective responsibility and operations of these roles are as follows. (i)Data owners, i.e., patients, share their medical data with medical institutions, including personally identifiable information and medical data(ii)Medical institutions extract medical data keywords, perform keyword search, and conduct access control(iii)Data users are entities that need to be authorized to perform keyword searches and obtain the required data, such as research institutions, insurance companies, or government departments. Data users need to be authorized first by access control and then can perform a keyword search on medical data

3.2. Security Goal

Here, we address the security goals as follows in our scheme.

3.2.1. Privacy Preserving

With all these patient personal information and medical data involved in this scheme, the major challenge is to preserve the privacy of all patients. No matter who wants to approach the patients’ health data, the identity information must be controlled or limited.

3.2.2. Security Search

If data users want to obtain patients’ health data, they have to get through the access control mechanism. Meanwhile, the searching and query process should not leak information related to the keywords.

3.2.3. Data Integrity and Reliability

In addition to data privacy issues, there are data security issues. The patient’s data should not be tampered and deleted after uploading, i.e., the integrity and reliability of the data should be guaranteed.

3.3. Notions

(i)SID: the patient identifier, which can be used to denote one specific patient, such as the social security number(ii)QID: quasi-identifier, such as zip code, address, age, gender, and birthday(iii): the medical institution collection, denoted as a set of medical institutions (iv): the plain medical data text of , denoted as a set of data (v): the ciphertext of medical data; medical data can be encrypted by a medical institution, denoted as (vi): the keywords of medical data that can represent a medical data, denoted as a set of keywords (vii): encrypted keyword of , medical institutions’ encrypted keyword collection, denoted as (viii) represents the queried keywords and the subset of the keywords , denoted as a set of keywords (ix): the trapdoor for , denoted as (x): the public key(xi): the private key(xii): the security parameter

4. System Design and Technique Details

To tackle the issue of privacy protection and access control for sensitive medical data, we propose and develop a system based on the consortium blockchain platform with K-anonymity and searchable encryption techniques. The technologies adopted in this paper are designed and selected for medical data and medical data application scenarios. Medical data mainly has two kinds of privacy characteristics: data privacy and identity privacy. And the application scenarios of medical data are the coexistence of multiple medical institutions, similar to the P2P network in the computer network. The following is a detailed analysis and explanation. (1)For data privacy, this paper extracts keywords from medical data and further encrypts and searches the keywords with a searchable encryption technology, which not only protects the privacy of data but also ensures the availability of data(2)As for identity privacy, the current technologies to protect identity privacy include differential privacy and K-anonymity. Based on the characteristics of medical data, K-anonymity is chosen in this paper. The main consideration is that differential privacy will change the original statistical data characteristics of medical data, and that is more important to medical data. K-anonymity can retain the statistical characteristics of data. Therefore, we choose K-anonymity(3)For the application scenario of medical data sharing, this paper adopts the consortium blockchain. Each medical institution is acting as one node of the consortium blockchain. The consortium blockchain is in line with the medical data sharing scenario of a peer-to-peer network and authorized access. On the contrary to the public blockchain, the data stored and managed by the consortium blockchain can keep being secured and private, instead of being totally transparent

In this section, we first introduce the system overview and then zoom into the detailed techniques adopted.

4.1. System Overview

Figure 1 shows the architecture overview of our system. We introduce the consortium blockchain as the middleware to perform as the trust layer. After patients upload data, i.e., medical data, to medical institutions. The medical institutions need to preprocess the data with a K-anonymity technique to blur some sensitive data fields which can probably reflect the patient’s identity. Then, the keywords are extracted from the dataset and form the index. We design a scheme based on searchable encryption to encrypt the dataset and the index, i.e., the keyword. The entire encrypted dataset is uploaded to the cloud and managed by the medical institution itself. The encrypted data are uploaded to the consortium blockchain platform, which is constructed by different medical institutions. Each medical institution acts as a participant node in the platform. We develop smart contracts and deploy them on the blockchain platform. The smart contracts provide all the related interfaces for medical institutions and data users to leverage. One of the interfaces is designed for data users to query with the keyword index. According to our searchable encryption scheme, the corresponding results can be fetched back to the data user from clouds without decrypting the secured data. The searching process is also implemented by the smart contract. Besides, the smart contract also provides the interface of access control based on the attributes of data users.

In this paper, K-anonymity and searchable encryption are different from traditional application scenarios. The implementation of K-anonymity in this paper is different from the traditional way of centralized processing and centralized storage. The process of the K-anonymity algorithm adopted in this paper is processed by distributed nodes. The results of processing are stored on the consortium blockchain, and all nodes in the blockchain network can access them. As for searchable encryption, in traditional application scenarios, the processing and calculation modes are centralized. In this paper, the searchable encryption scheme based on the blockchain platform is implemented with smart contracts, including the calculation process of trapdoor generation and keyword matching, which are also in a decentralized manner.

The consortium blockchain network in this paper is based on blockchain nodes, and each medical institution acts as one node of the blockchain network. For the newly joined nodes, i.e., medical institutions, they need to apply for certificates from the authority node and then join the blockchain network with certificates. Among them, the authority node is the root of trustworthiness in the consortium blockchain, which can be played by the authoritative management department of the medical institution.

Figure 1 shows the entire architecture with multiple medical institutions. For the following parts in this section, we zoom into the details and explain how the techniques are leveraged. Hence, for a specific medical institution, Figure 2 demonstrates how the medical institution preprocesses the data and uploads the data to a remote cloud. The personal SID is firstly removed and then the K-anonymity is adopted to blur the QID. Afterwards, the preprocessed data is encrypted and uploaded to the remote cloud. Finally, data users can perform queries without decrypting the data. The related technique is described as follows.

4.2. K-Anonymity

Mondrian multidimensional partitioning [7] is a K-anonymous multidimensional partitioning algorithm with K-anonymous processing in two steps. In the first step, the multidimensional regions covering all the domain space attributes are defined, i.e., the partition stage constructs kd-trees [8]. The second step is to construct functions for data recoding.

The partitioning algorithm is described in Algorithm 1. In the algorithm, each dimension selects the dimension and the value of the partition. In the literature of kd-trees, one approach is to use the median as the value of the partition. The partition is completed to get k-groups. and each k-group spontaneously includes at least k records. Each k-group is then generalized. Thereby, the QID of each group is the same.

Input: Table S to be partitioned.
Output: Partitioning result.
1: Anonymize(partition)
2: if (no allowable multidimensional cut for partition) then
3: return : partition summary
4: else
5: dim choose_dimension()
6: fs frequencySet(partition, dim)
7: splitVal find.median(fs)
8: lhs t partition: t.dim splitVal
9: rhs t partition: t.dim splitVal
10: return Anonymize(rhs) Anonymize(lhs)
11: end if

For the selection of parameter in the algorithm, the principle we follow is to get the value of the K-anonymity parameter after the practical test in the practical application scenarios, so as to protect the patient’s identity privacy and not make the system query results too redundant.

4.2.1. Before Processing

For example, supposing that the patient data structure is shown in Table 1, QID includes the age, sex, and zip code. These attributes that appear in private personal data may also appear in public datasets. If the two sets of data are linked together, the patient’s private data may be leaked. Therefore, these attributes need to be processed by the K-anonymous algorithm to avoid privacy leakage.

4.2.2. After Process

The multidimensional anonymization of patients is shown in Table 2. It shows that with the condition of 2-anonymity, each record has another one record, whose QID attributes are exactly the same.

4.3. Searchable Encryption

In this paper, we assume that there are several medical institutions with different keys that participate. Then, we are faced with the problem of how to search the different key encrypted keywords in multiple medical institutions. To achieve a secure search for multiple medical institutions, we have adopted a secure search scheme that satisfies the following three conditions: (i)Different medical institutions encrypt keywords with their own keys(ii)The data user does not need to know the key when generating the trapdoor(iii)After retrieving the trapdoor, the cloud server can search for the corresponding data content through the keywords without knowing the specific value, as the keywords are encrypted by multiple medical institutions

The cloud server mentioned in this section refers to the computing and storage resources provided by the public cloud service provider (CSP), to form a running unit and provide services to clients. In addition, the middle server is provided by a specific trusted organization to provide the system with completely credible and reliable services, such as secret keys and crucial information storage.

Figure 2 shows the detailed process of using searchable encryption to manage the data. It is worth mentioning that the encrypted index is stored in the consortium blockchain and the encrypted dataset is uploaded to a remote cloud managed by a third party. In order to explain the procedure shown in Figure 2, we use an example to illustrate the details of this scheme. Medical institution , i.e., , needs to encrypt into with its own key before sharing medical data . At the same time, in order for the data user to be able to perform search, the medical institution needs to extract the keyword from the document and send the encrypted keyword to the middle server. The middle server is further encrypted, to , and obtains ; then, the result is sent to the cloud server. Next, assume data user wants to search for a document related to the keyword . Basically, he needs to generate a trapdoor and upload it to the middle server. The middle server then reencrypts the trapdoor to obtain , while generating secret data . Then, and are uploaded to the cloud server. The cloud server finally calculates for keyword search.

4.3.1. Encryption Construction

The construction is based on a bilinear map. We define to be the generators of the cyclic groups, and , whose orders both are . is a bilinear map . In the process of encryption construction, the random key generation algorithm generates different keys for different inputs. , , , . and are the private keys of the middle server; and are the private keys used to encrypt keywords and data of medical institution , respectively. , located in , is a hash function.

4.3.2. Keyword Encryption

The keys of different medical institutions are different in this system, and the ciphertext generated each time for the same keyword is different. Therefore, even if the key is lost, the data cannot be leaked, since the cloud server cannot obtain any information about the keyword. For the th keyword of the medical institution , i.e., ,. the process of encryption calculation is as follows. where is a number generated randomly each time, which is leveraged to calculate and .

The medical institution submits to the middle server, and the middle server reencrypts with its own keys, and , to obtain , as follows.

Finally, the middle server submits the to the cloud server. In the entire process, the middle server is always unable to know the specific value of the keyword.

4.3.3. Trapdoor Generation

In the scheme that we propose, data users do not need to know the key of the medical institution and the trapdoors generated for the same keyword each time are different. The trapdoor is generated in two steps. First, the data user generates a trapdoor based on the search key and the random number and then submits the trapdoor to the middle server. Second, the middle server reencrypts the trapdoor. Here, we assume that the data user wants to search for the keyword and the encryption is calculated as follows. where is a random number generated randomly each time. After receiving the , the middle server generates a random number and reencrypts as follows.

Let us make , , , i.e., . Finally, the middle server submits to the cloud server.

4.3.4. Keyword Matching

In the scheme that we proposed in this paper, the cloud server stores encrypted data and keywords for all medical institutions. The middle server needs to transfer a secret data to the cloud server. After receiving the search request, the cloud server performs a global search to match all stored keywords, in order to obtain the corresponding medical data. The searching process is described as follows. First of all, the cloud server performs the following calculations, after getting trapdoors, and .

Then, the cloud server judges whether equals to , according to the following equation.

4.4. Consortium Blockchain and Attribute-Based Access Control

The consortium blockchain is the crucial part of the entire scheme. All new nodes must get the certification from the fabric-CA before participating in the consortium blockchain. The consortium blockchain implements a chaincode of searchable encryption and ABAC. At the same time, the blockchain exposes interfaces for users to access blockchain data, including access control interfaces and data interfaces. The functions implemented by the consortium blockchain are as follows: (i)The chaincode implements the function for uploading encrypted data to the ledger of the blockchain(ii)The chaincode provides the function of searchable encryption(iii)The chaincode provides ABAC to manage the permissions for user access

For the ABAC part, policy management and access control are separated. Policies may vary with the actual scenario, for instance, by increasing or decreasing the number of attributes to accommodate larger or smaller scenarios. The key part of ABAC is the attribute which can be defined as . The definition of each field is as follows: (i) indicates the attribute. Each attribute has an identifier(ii) indicates the subject’s attributes, i.e., the identity and characteristics of the subject which can perform the access request, for instance, the entity’s name, age, and occupation(iii) indicates the object’s attributes, i.e., the information related to the accessed resource, for instance, resource type, service location, and protocol(iv) indicates the permission, i.e., the operation of the subject which can be performed on the object, for instance, reading, writing, and executing(v) indicates the environment, i.e., the environment information when the access request is initiated, for instance, time and location

The attribute-based access control policy can be defined as, which indicates the access control rules of the subject to access the object. It expresses the required attribute set for accessing the protected resources. The above expression means that the XOR relationship among the subject, object, permission, and environment constitutes the access grants of the access control mechanism.

The attribute-based access control request can be defined as , as mentioned above, which is a set of attributes. It indicates the operation of the subject on the object under the environment. When users access the ABAC system, the attribute set constitutes the access request operation, which is used as the input parameter to initiate the access request of the access control mechanism in the system.

5. Security Analysis

In this section, we discuss that our scheme satisfies the following security goals.

5.1. Privacy Preserving

Patient’s personal information and medical data are protected by searchable encryption and attribute-based access control. Medical institutions encrypt medical data, and the consortium blockchain provides access control, which is a proper way to avoid privacy disclosure to malicious entities.

5.2. Security Search

Data users need to obtain authorization of access control before they can search on the consortium blockchain. Therefore, malicious entities cannot initiate searches after access control, which guarantees the security of the data.

5.3. Data Integrity and Reliability

The patients’ medical data with encrypted keywords are uploaded by the medical institutions to the consortium blockchain. Thus, during the process of data storage and transmission, no one can modify or read data without the authorization of the medical institutions.

5.4. Scheme Properties

As for scheme properties, we compare the properties of our scheme with [9, 10] of the cloud-based one and [11] of the blockchain-based one. As shown in Table 3, our scheme can meet all the scheme properties which are vital properties of medical data sharing schemes.

6. Experimental Study

In this section, we implement the proposed scheme on the Hyperledger Fabric platform and evaluate its performance. Especially, we simulate different numbers of medical institutions to construct the blockchain platform for testing the performance and scalability of our system. For testing purposes, we leverage the docker to build a consortium blockchain platform with Hyperledger Fabric, which is contrusted using CloudsStorm [12] in the Cloud environment. The programming language is Python3.7 and Go1.12; the Fabric edition is 1.4. For the underlying server, we set up the environment with a virtual machine from the ExoGENI [13] cloud. The entire virtual machine is only for this purpose. The configuration of the virtual machine is the type of “XOXLarge,” which contains 4 CPU cores and a 12 GB RAM.

6.1. Computational Cost

In order to evaluate this scheme quantificationally, we have also conducted some experiments based on our scheme. We provide the details of the computational cost of functions in Table 4.

In our scheme, the function is responsible for generating the public key and private key. And, the function is used to encrypt the keywords of medical data. The trapdoor can be generated by the function of . Finally, the function of is for searching the expected keyword and the result can be “True” or “False.”

Due to the fact that the computational cost of these algorithms is related to keyword numbers, we test our algorithms by setting keyword numbers as , 50, and 100. As shown in Table 4, we can find out that the time cost of all functions, including , , , and , increases with the size of keyword amounts linearly.

Among these four main operations, three of them are implemented in the smart contracts of the consortium blockchain, including , , and . Hence, to test the system performance for these on-chain operations, we measure the computational cost of these operations varying with different frequencies of operation requests, i.e., the medical data accessing rate in TPS (transactions per second). The number of keywords to be processed is set to once in each transaction, i.e., for each data accessing request. Five medical institutions are simulated here to construct the 5-node blockchain network. The experimental results are shown in Figure 3. It shows that the computational cost of all the on-chain operations is linear with the medical data accessing rate, which demonstrates that our system is scalable.

6.2. Performance of the Smart Contract

In this subsection, we mainly test the performance of two types of chaincodes, i.e., smart contracts. One is designed for storing and searching the dataset, and the other is for attribute-based access control (ABAC). For this experiment, we simulate that there are five medical institutions, i.e., five nodes construct the experimental blockchain platform. Figure 4 shows the chaincode processing performance of the searching operation with searchable encryption when increasing the medical data accessing rate. In this scenario, the processing rate of the chaincode for searching operations increases linearly according to the increased medical data accessing rate. The system is able to handle all the requests when the accessing rate is not high. However, when the rate of accessing requests comes to 70 TPS, the processing rate cannot increase anymore. It demonstrates that the throughput of our system with 5 medical institutions is around 70 TPS. For the chaincode of the ABAC model, the performance is similar as shown in Figure 5. But the throughput for the ABAC chaincode is better than the chaincode with searching operations, which is around 200 TPS. As the medical information sharing is mainly for research purposes, the amount of requests is not at a large scale. Hence, the performance of our system is acceptable.

On the other hand, the memory consumption of the chaincode for realizing searchable encryption is also measured. The measurements are performed within the node where the chaincode for encryption is invoked. For testing, we input the system with different rates of medical data accessing requests. The experimental results shown in Figure 6 also demonstrate that the system is able to handle the accessing rate blow of approximately 80 TPS. When the accessing rate further increases, the memory consumption stays steady afterwards. The reason is that the system has reached its capacity. Meanwhile, it is worth mentioning that the maximum memory consumption for running searchable encryption chaincodes is around 170 MB. It is, therefore, practical for each medical institution to operate a server for running the chaincode and participant of the blockchain network.

6.3. Scalability

In this section, we test the scalability of our implemented prototype. We still mainly test two types of chaincodes, i.e., for searchable encryption and ABAC. Since these two parts are crucial to our system, the scalability of these chaincodes can determine the scalability of the entire system. Hence, we assume various scenarios, under which there are different numbers of medical institutions. To check the trend, we simulate 5, 10, 15, and 20 medical institutions. It means that the scales of the blockchain platform are 5, 10, 15, and 20. Then, we increase the medical data accessing rate as system inputs until the chaincode processing rate is getting steady. The procedure is similar to the experiment conducted in Section 6.2. In this way, the capacity of system throughput is achieved.

Figures 7 and 8 show the experimental results of chaincodes for searchable encryption and ABAC, respectively. For both of these two types of chaincodes, the descending rates of their performance are not even linear to the increased number of blockchain participants, i.e., medical institutions. However, with the configurations of our experiments, the system capacity has not decreased much for each type of chaincode when the number of simulated medical institutions increases from 5 to 20 (70 TPS to 60 TPS and 200 TPS to 165 TPS, respectively). In practice, not all the medical institutions construct a single federation to share medical data; the scalability of our system is still feasible. However, for a large-scale federation with many institutions, the system still needs to be further optimized.

The data privacy and security are hot topics in recent years. Some papers consider the issue related with different stages of the data management, including data uploading [14], auditing [15], and sharing [16]. Other papers consider these issues in a special scenario, e.g., in a wireless environment [17]. We focus on the same issue in the field of medical data [18, 19].

Medical data usually includes the types of health reports, medical records, examination results, etc. These types of data not only have great research value but also have privacy preserving requirements. Meanwhile, the integrity of medical data requires assurance. We also need to guarantee that these data are not to be tampered with, destroyed, or deleted by anyone without grants. The accessibility of medical data must be controlled by the patients, but it cannot be modified by the patient. In addition to this, patient data should also be able to be circulated smoothly among medical institutions [20, 21].

7.1. Medical Data Privacy Preserving Based on Confusion and Anonymity

The anonymized data or differential privacy was introduced to protect the privacy of medical data [22]. Beaulieu-Jones et al. [23] used a deep neural network that introduced differential privacy to generate artificial data when sharing medical data, which solved the contradiction between data sharing and data privacy to a certain extent. However, in the context of multiple data types, the limitation of this method is that the data would be modified. Cai et al. [24] developed a differential-private framework to preserve the sensitive information for taxi companies when sharing taxi data. Sun et al. [25] proposed a method similar to the above, using differential privacy to process medical data, training a machine learning model, and publishing the training model instead of directly publishing private data.

7.2. Privacy Preserving of Medical Data Based on Encryption

Searchable encryption is usually leveraged in the edge environment for mobile computing [26], through being introduced into data privacy-preserving schemes. For example, Xu et al. [27] leverage a multikeyword searchable encryption technology in medical data sharing to protect data privacy. At the same time, they utilize searchable encryption to achieve the goal of sharing. In the scenario of a body equipment, the body sensor device is leveraged to encrypt the sensor data and upload to the cloud in the stage of collecting the data segment, which is similar to the scenario described by the paper [28]. Meanwhile, the searchable encryption technology is also used to share the data. Besides, this solution only requires little resource consumption, which is suitable for the mobile devices. However, the problem of sensor device credibility should also be considered.

7.3. Privacy Preserving Based on the Blockchain

Blockchain is an emerging technology to enhance the trust for the legacy systems, which has been applied in the cloud service level agreement [29], crowdsourcing [30], etc. It is also applied into aspects of data storage with privacy preserving, since the blockchain has storage characteristics of high reliability, high availability, low cost, and strong disaster tolerance. Do and Ng [31] combine searchable encryption and blockchain to ensure that data cannot be tampered with and deleted. It also adopts smart contracts to implement access control and searchable encryption, which constructs the distributed system and solves the trust problem among nodes. Similarly, the consortium blockchain combines searchable encryption [32] that uses agents to achieve searchable encryption. The consortium blockchain stores keywords and ciphertext. To a certain extent, this solution guarantees the privacy-preserving requirements of medical data, which can realize the goal that the distributed medical data cannot be tampered with. In the scheme proposed by Chen et al. [33], medical data is stored on a public cloud and the index of medical data is stored on the blockchain. Data users firstly need to be authorized when they want to obtain data, so that data owners can fully control their own data. However, it is a critical problem that the public cloud is not safe. Tian et al. [34] propose the scheme of SIFF based on the fabric blockchain; SIFF guarantees the granularity of data search. This scheme ensures the privacy, availability, and integrity of medical data. Zhang and Lin [35] propose a medical data sharing architecture combining a private blockchain, which stores full data, and a consortium blockchain, which stores keywords. At the same time, searchable encryption is leveraged to search for keywords. Encrypted medical data is obtained by grants of access control. Likewise, Azaria et al. [11] use blockchain to store medical data and access control is combined with smart contracts. Utilizing blockchain ensures that medical data cannot be tampered with and provides medical data interconnection, interoperability, and data sharing features. MedChain [36] is also a medical data sharing architecture that combines blockchain and P2P networks to tackle efficiency issues. There are two types of nodes in MedChain; one is a super node, which performs massive calculation and storage, such as large hospitals. The other is an edge node, such as clinics and community hospitals. Although this system is not yet perfect, the scheme of MedChain still improves efficiency, privacy, and security. In addition, MedBlock [37] is also a system that uses blockchain to manage medical data. It adopts a hybrid consensus mechanism, combined with Delegated Proof of Stake (DPoS) and Practical Byzantine Fault Tolerance (PBFT), which improves efficiency. It is worth mentioning that the symmetric encryption algorithm and access control are leveraged to improve privacy and security.

8. Conclusions

In our proposed work, we have presented a consortium blockchain-based medical data sharing system using K-anonymity, keyword searchable encryption, and ABAC to achieve data privacy-preserving and security among different medical institutions. Firstly, we leverage the K-anonymity technique to preprocess the medical data for blurring the identity information. Secondly, we present a scheme for medical data sharing using searchable encryption on keywords to ensure data security and privacy-preserving. The consortium blockchain is leveraged among different medical institutions to provide the trust layer and host the smart contract. Hence, the consortium blockchain stores the encrypted keywords which are linked to the medical data of medical institutions. The encrypted medical data can then be stored safely on remote clouds. Thirdly, we design and implement attribute-based access control with a smart contract. The blockchain-based ABAC model simplifies the configurations and secures the medical data. Furthermore, we conduct a security analysis of the proposed scheme and protocol and compare them with other related work. The security analysis demonstrates that our scheme can meet the security goals of our original design. Finally, we also implement the scheme on the Hyperledger Fabric platform. Through simulating different medical data accessing rates and different numbers of medical institutions, we evaluate the computational overhead of encryption operations, the performance of implemented chaincodes, and the scalability of our prototype. The experimental studies demonstrate that our scheme and system design are feasible and practical to encourage medical data sharing among medical institutions.

Data Availability

Data are available using the following: https://github.com/mythsand/privacy-preserving-medical-data.

Disclosure

The initial version of this paper [38] was presented at the conference of “WASA: International Conference on Wireless Algorithms, Systems, and Applications.”

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The work is supported by the National Key Research and Development Program of China under grant 2018YFB0204301, the National Natural Science Foundation (NSF) under grant 62072306, and Open Fund of Science and Technology on Parallel and Distributed Processing Laboratory under grant 6142110200407.