Source of this article: National Natural Science Foundation of China , original title "Blockchain and Data Governance", selected from the topic of "Blockchain Technology and Application" of "China Science Foundation" Volume 34, Issue 1, 2020
Author: Meng Xiaofeng *, Liu Lixin, School of Information, Renmin University of China
Nowadays, the “barrier lake” of big data has been formed, and the problem of data governance is imminent. Traditional governance concepts come from the government, enterprise, and IT fields. Data governance has both its generality and its particularity. This paper proposes that the fundamental guarantee of data governance is to increase the transparency of the realization process of the value of big data. With the characteristics of decentralization, openness, transparency and non-tampering, the blockchain can meet the transparency needs of the realization of big data value. It can overcome the current problems of data governance and provide new solutions for data governance. At the same time, implementing data governance based on blockchain also faces many challenges.
- Interpretation | Internet of Things + Blockchain Series: Where are the current limitations?
- Blockchain economic panorama and future: Fintech evolution engine (on)
- When the supermarket buys rice, do you think of the "blockchain"?
- Video: Blockchain opens a new chapter in medical health (Part 2)
- Week in review: Bitcoin prices fluctuate, US stocks laugh
- The market segment suitable for blockchain should have these seven attributes
Keywords: data governance; blockchain; privacy protection; traceability accountability; credible decision-making
In the era of big data, data sources are continuously generated and aggregated to multi-party data collectors. Data has become the key to competition among enterprises and an important factor affecting national competitiveness. Therefore, data governance has become a key area and an important way of corporate governance and national governance. [1,2] . However, large-scale data collection also brings serious privacy leaks, data misuse, and untrustworthy data decisions, which pose new challenges to traditional data governance. For example, the "Facebook-Cambridge Analysis" incident  is a typical case of privacy disclosure, data misuse, and untrustworthy decision-making caused by large-scale data collection. Furthermore, the large-scale autonomous aggregation of data also leads to the emergence of data monopoly, making the data unreasonably distributed and enjoyed . Big data "barrage lake" has been generated, how to effectively solve these problems, and make the data used correctly and standardized is the key to determine the continued value of big data, but also the problem that data governance urgently needs to solve.
The main reason for the above problems is the opaqueness of the realization process of the value of big data. The opaque process of big data collection and sharing leads to difficulties in tracking and accountability such as privacy leakage and data abuse, and leads to the quiet formation of data monopoly problems but lack of basis for evaluation and resolution; the lack of transparency in the process of big data storage, processing and sharing circulation leads to data Problems such as being tampered with are difficult to find, which affects the quality of decision data and ultimately leads to unreliable data decisions. It can be concluded from this that the fundamental guarantee of current data governance is to increase the transparency of the process of realizing the value of big data. The process of data collection and sharing and circulation transparently records the data flow direction, protects privacy in a traceable and accountable manner  and provides a basis for solving the data monopoly; data storage, processing and sharing and circulation and other processes make the decision data auditable and transparent Promote credible data decisions. There are many ways to implement data governance. In addition to laws, regulations and policy standards, technical methods are also required to escort. The blockchain originated from digital currency, and has the characteristics of openness, transparency, decentralization and non-tampering. The progressive development of this technology brings new opportunities for solving the problems facing current data governance [6-10].
This paper proposes that the fundamental guarantee of data governance is to increase the transparency in the realization of big data value, summarizes the development history of data governance and the key content of technically achieving data governance, and conducts research on the status of data governance based on blockchain Analysis and summary, and finally put forward the challenges facing data governance.
1 Introduction to Data Governance
The term "Governance" originated from the Latin word "Steering" and was originally used for "government governance". The goal is to coordinate the interests of the government and other social subjects. Later, it was gradually recognized and valued by enterprises, and "corporate governance" emerged, with the goal of coordinating the interests of stakeholders within the enterprise. With the increasing richness of IT resources and data resources, "IT governance" and "data governance" have appeared again [1, 2]. Later, due to the application characteristics of big data circulation, multi-source data fusion and involving multiple parties, "data governance" has been further extended, and "big data governance" has emerged. "Big Data Governance" focuses on the participating parties such as data producers, data collectors, data users, data processors and data regulators in the big data life cycle. Its goal is to take into account the rights, responsibilities and The value of data is brought into play on the premise of interest, that is, the realization of big data value and risk avoidance.
Since "big data governance" is an extension of "data governance", in order to avoid confusion, the following content of this article uses the concept of "data governance" to discuss data governance in the era of big data. The development process of data governance and involved participants are shown in Figure 1.
The application characteristics of big data and the goals of data governance determine the key content of current data governance. At present, the key content and challenges of data governance focus on the following three aspects:
(1) Improve the quality of decision-making data. The realization of the value of big data requires the integration of multi-source data. However, big data has a wide range of sources and involves multiple parties in the life cycle. Whether the data is actually generated, the data has been tampered with, and the standards and types of multi-source data are inconsistent. In turn, it affects the data decision results of data users. Therefore, data governance needs to support the traceability of big data throughout its life cycle.
(2) Evaluate and supervise the use of personal privacy data. The circulation characteristics of big data applications make data producers lack the right to know and control data acquisition and sharing. As a data producer, users do not know what data is collected, by whom, where it flows and how it is used. At the same time, data collection and aggregation lead to data monopoly. Data monopolies may hinder market competition, impair consumer welfare, hinder industry technological innovation, and bring more serious risks of personal privacy leakage, but data regulators are unable to evaluate and supervise data applications; in addition, big data The multi-source data fusion feature of the application may also lead to more serious privacy leakage issues. Therefore, data governance needs to evaluate and supervise the use of personal privacy data.
(3) Promote data sharing. Data sharing can promote the realization of the value of big data and ease the monopoly of data, but it also needs to solve privacy protection issues. On the one hand, when data sharing flows between data sharing parties, the privacy issues need to be taken into account and the personal privacy of data producers needs to be protected in an effective manner. On the other hand, due to some factors limited by law and practical application, it is necessary to implement distributed data sets for statistical analysis and distributed machine learning based on the data of multiple data holders without directly transmitting the original data. Since there is no complete credibility among multiple parties, it should be able to protect data users to verify their sharing process. Therefore, data governance needs to promote data sharing on the premise of weighing the interests of data producers and data users.
Data governance needs to be implemented in various ways including comprehensive laws, regulations, policy standards, and technical methods. On the one hand, international organizations and relevant national departments have issued corresponding laws, regulations and policy standards. For example, the International Data Governance Institute summarizes the elements of data governance from three aspects: organization, rules and processes ; Provide principles, definitions, and models to help data governance participate in the process of subject assessment, guidance, and supervision of their data utilization . On the other hand, data governance urgently needs safe and reliable technical methods to provide technical support for issues such as data privacy protection, improvement of decision-making data quality, promotion of data sharing, and evaluation of regulatory data application compliance during the application of big data.
Figure 1 Data governance development process and involved participants
2 Data governance based on blockchain
Blockchain is essentially a decentralized distributed database, which has a natural advantage in increasing the transparency of the big data value realization process, and provides the feasibility for solving the key problems of current data governance.
2.1 Data storage and processing that supports auditing
Data decision-making permeates all aspects of people's production and life. Due to the multi-stakeholders involved, there are problems such as data tampering, data falsification, and the types of data from different sources and differences in standard rules in the process of data storage, processing, and shared circulation. These problems will affect the quality of decision data. Therefore, data users need to audit decision data. As a decentralized distributed database, blockchain can realize the storage and processing of data supporting auditing. In addition, based on the blockchain, a decentralized distributed database system is constructed between different stakeholders. The data is quickly broadcast to all stakeholders through the entire network, which can also ensure the authenticity and timeliness of data sharing and circulation.
Each node in the blockchain network stores data. Once the data is stored in the blockchain, it will not be tampered with or lost. Even if there are communication failures and deliberate attacks, the accuracy of data storage can still be guaranteed. Data users can Audit it. In addition, storing data in the blockchain also supports the data processing process and the auditability of the processing results. For a traditional database management system, the current data status is stored and maintained in the database, and only information such as the data processing process is stored in the database log for fault recovery, and does not support historical status query of the data. However, as a decentralized distributed database, blockchain supports querying the historical status of data to confirm whether the current data status is correct. Data storage and processing based on blockchain is of great significance in areas with high data integrity requirements such as insurance , medical [14-17] and supply chain [18-21]. As a result, data users can audit decision-making data and perform analysis and decision-making on trusted data [22-25].
In view of the inconsistency of data types and standard rules of different sources, unified data types and standard rules can be formulated based on blockchain and smart contracts. The smart contract will be stored and synchronized at each node of the blockchain, and the blockchain will automatically perform verification based on the code on the smart contract. Because the execution process of smart contracts is open and transparent, the execution process and execution results are auditable, which can improve the efficiency of multi-source data sharing and there is no single point of failure.
2.2 Data acquisition and sharing supporting traceability and accountability
In the traditional data acquisition and data sharing process, data collectors formulate data usage agreements and inform users of data collection, sharing, and usage information accordingly. As a data producer, users' right to know and control data is still limited to legal constraints and third-party credit endorsements. However, because the process of data acquisition and sharing is not visible to the outside world, the performance of its contract cannot be verified. A 2014 Pew Research Center report on the state of privacy in the United States stated that 91% of respondents believe they have lost control of the data collectors ’collection and use of personal data, and 61% of respondents do not understand how data collectors use Personal data is frustrating ; the 2016 “Report on the Protection of Rights and Interests of Chinese Netizens” shows that 84% of netizens have deep feelings about the adverse effects of personal privacy leakage . The opaqueness of data acquisition and data sharing leads to more serious privacy leakage problems. Although traditional encryption, differential and other privacy protection technologies have a certain protection effect on data privacy, they are currently not enough to deal with the risk of privacy leakage caused by large-scale data collection. The decentralization and immutability of the application blockchain can record the acquisition and sharing of data, further implement traceability, and combine policy compliance (Policy Compliance), violation detection (Violation Detection) and privacy audit (Privacy Audit), When privacy protection technology is invalid, privacy can be protected by traceability and accountability, and technical support can be provided for evaluating regulatory data and solving data monopoly issues.
At present, there have been studies on using blockchain to increase the transparency of data acquisition and sharing in mobile applications , medical [29, 30] and Internet of Things [31-33]. The framework for data acquisition and sharing based on blockchain can be divided into four layers: data acquisition layer—storage layer—blockchain layer—shared layer. At the data acquisition layer, data producers have the right to know the content, form, and purpose of data collection; at the storage layer, data is stored using traditional database management systems, cloud storage, and distributed storage systems, and encryption techniques are used to perform data Encryption to protect data security and privacy; at the blockchain layer, the blockchain performs decentralized access control, so that any data access conditions are recorded in the blockchain through blockchain transactions; at the shared layer, implementation Data sharing and protection of shared relationships. It is through the above four layers that the blockchain increases the transparency of data acquisition and shared circulation.
2.3 Statistical data analysis and machine learning supporting distributed verification
In some application areas such as medical research, public safety, and business cooperation, statistical analysis [34-36] and machine learning tasks [37-41] need to be performed on large-scale distributed data sets, but considering the limitations of factors such as laws and regulations, Distributed data statistical analysis and machine learning need to be conducted without revealing private data. For statistical analysis of distributed data sets, existing solutions are based on technologies such as secure multi-party computing, secret sharing, localized differential privacy, and homomorphic encryption. However, the secure multi-party calculation method is not suitable for large-scale data provider participation; secret sharing makes the data provider lose control of the data; localized differential privacy needs to balance the availability and privacy loss of data; homomorphic encryption can ensure that the data provider does not lose Data control, and no privacy loss needs to be considered, but the premise of implementation is that the data provider provides real data and trusted computing of the computing node. For distributed machine learning, because there is no complete trust between the data provider and the data demander, each data provider may also provide unreliable data or parameters to disrupt the final result, and early withdrawal due to economic benefits and other factors. Therefore, data users need to verify the statistical analysis of distributed data sets and distributed machine learning, and need reasonable economic incentives to promote their smooth execution.
Statistical analysis of distributed data sets that can be verified based on blockchain often includes data providers, multiple computing nodes, multiple verification nodes, and data queryers. Among them, the data provider provides encrypted data, multiple settlement nodes perform ciphertext calculation, and multiple verification nodes are composed of the blockchain and verify the calculation of the calculation nodes. In addition, statistical analysis of distributed data sets needs to consider security and privacy issues such as data confidentiality, unconnectability between data providers and data, confidentiality of query results, and robustness of calculation results. For this reason, shuffling and homomorphic encryption are usually used for protection.
Based on the blockchain to implement verifiable and fair distributed machine learning, the data provider uploads and stores local machine learning parameters to the blockchain, and the blockchain performs cross-validation, each step of the distributed machine learning process Recorded on the blockchain. At the same time, it can also be combined with zero-knowledge proofs and cryptographic commitments to impose economic penalties on malicious parties and promote fairness through economic incentives. In addition, distributed machine learning needs to consider the security of the local parameters of the data provider, because local parameters may also leak data or machine learning models. To this end, technologies such as differential privacy, secret sharing and homomorphic encryption are usually used to protect it.
3 Challenges and problems
Blockchain provides new ideas for data governance, but the specific implementation of data governance will also face many challenges, and at the same time have higher requirements for the blockchain's own technology. In addition, the implementation of data governance based on blockchain will lead to major changes in the management and control mechanisms and business processes of governments and enterprises, which will bring new challenges to government management and enterprise management. At present, the challenges and problems facing the process of implementing data governance mainly include the following three aspects:
(1) Challenges faced in the realization of data governance. On the one hand, although recording data sharing and circulation information on the blockchain can achieve traceability accountability, in the context of large-scale data collection and data sharing and circulation, how to achieve cross-platform and cross-domain traceability accountability is challenging problem. At the same time, traceability accountability may also bring about privacy leakage, so privacy protection in the traceability accountability process is also crucial. On the other hand, although storing data in the blockchain can prevent data tampering to a certain extent and ensure that the data can be traced and traced, there are still challenges to ensure the authenticity and reliability of the data before it is stored in the blockchain.
(2) New challenges to the blockchain's own technology. There are still a lot of problems to be solved in the blockchain's own storage requirements, privacy and security, scalability, and interoperability. Existing mainstream blockchains such as Bitcoin, Ethereum, and Super Ledger cannot meet the data. The need for governance. Therefore, we should consider designing a lightweight, highly scalable, and highly interconnected blockchain suitable for data governance needs. At the same time, with the emergence of various types of blockchain systems, blockchain system evaluation standards and evaluation specifications have also become urgent problems to be solved.
(3) Challenges to government management and enterprise management. The decentralized nature of the blockchain will break the traditional centralized management method and challenge the management authority of the government and enterprises. At the same time, the decentralized nature will also place the responsibility for data security and confidentiality on multiple parties. Enterprise data management and other aspects bring new challenges. In addition, it takes a process to implement data governance based on blockchain and implement corresponding regulatory measures on data accordingly, and with the rapid development of blockchain technology, new requirements will be placed on traditional regulatory systems and laws, regulations and policies.
Data governance has become a key area and an important factor in national governance and corporate governance. With the continuous open sharing of data in various fields, data governance has put forward higher requirements for data sharing, data supervision and privacy protection. These problems can improve the efficiency and transparency of data governance by combining with the blockchain, and will help to build a new era of data information. At the same time, it will bring many new challenges. It requires the joint efforts of multiple disciplines, fields and departments to realize a new chapter of data governance.
［1］ Wu Xindong, Dong Bingbing, Du Xinzheng, etc. Data Governance Technology. Journal of Software, 2019, 30 (9): 2830—2856.
 An Xiaomi, Guo Mingjun, Wei Wei, et al. Big Data Governance System: Core Concepts, Proposals and Analysis of Their Implementation Paths. Information and Documentation Work, 2018, (1): 5-11.
 Jennifer Zhu Scott. Facebook and Cambridge Analytica: what you need to know as fallout widens.
https://www.nytimes.com/2018/03/19/technology/facebook-cambridge-analytica-explained.html. [2018-03-19] / [2020-01-01].
 Meng Xiaofeng, Zhu Minjie. Research on Data Monopoly and Its Governance Model. Information Security Research, 2019, 1 (9): 789-797.
 Meng Xiaofeng, Zhang Xiaojian. Big Data Privacy Management. Computer Research and Development, 2015, 52 (2): 265—281.
 Zhu Liehuang, Gao Feng, Shen Meng, et al. A review of blockchain privacy protection research. Computer Research and Development, 2017, 54 (10): 2170-2185.
 Yuan Yong, Ni Xiaochun, Zeng Shuai, et al. Development status and prospect of blockchain consensus algorithm. Journal of Automation, 2018, 44 (11): 93-104.
 Shao Qifeng, Jin Cheqing, Zhang Zhao, et al. Blockchain technology: architecture and progress. Chinese Journal of Computers, 2018, 41 (5): 3-22.
 Han Xuan, Yuan Yong, Wang Feiyue. Blockchain Security Issues: Research Status and Prospects. Journal of Automation, 2019, 45 (1): 208-227.
 Li Fang, Li Zhuoran, Zhao He. Research on the progress of blockchain cross-chain technology. Journal of Software, 2019, (6): 1649-1660.
 The Data Governance Institute. Data governance institute framework.
http://www.datagovernance.com/wp-ontent/uploads/2014/11/dgi_framework.pdf. [2014-11-15] / [2020-02-13].
 National Standardization Management Committee. "Information Technology-IT Governance-Data Governance-Part 1: Application of ISO / IEC 38500 in Data Governance." Http://www.sac.gov.cn/sgybzeb/gzdt_2132/ 201705 / t20170515_238441.htm. [2017-05-15] / [2020-02-13].
 Vo H. Blockchain-based data management and analytics for micro-insurance applications // Proc of the ACM Int Conf on Information and Knowledge Management. New York: ACM, 2017: 2539—2542.
 Vo, H. Research directions in blockchain data management and Analytics // Proc of Int Conf on Extending Database Technology. Bordeaux: Springer LNCS, 2018: 445—448.
 Vo H. Blockchain-Powered big data analytics platform // Proc of the Int Conf on Big Data Analytics. Berlin: Springer, 2018: 15—32.
 Shae Z, Tsai J P. On the design of a blockchain platform for clinical trial and precision medicine // Proc of the Int Conf on Distributed Computing Systems. Washington: IEEE, 2017: 1972—1980.
 Tsai J. Transform blockchain into distributed parallel computing architecture for pecision medicine // Proc of the Int Conf on Distributed Computing Systems. Washington: IEEE, 2018: 1290—1299.
 Xu XW, Lu QH, Liu Y. Designing blockchain-based applications a case study for imported product traceability. Future Generation Computer Systems 2019, 92: 399—406.
 Swan M. Blockchain: Blueprint for a new economy // O'Reilly Media Inc, 2015: 1—18.
 Vasco L, Luís A. An overview of blockchain integration with robotics and artificial intelligence [EB / OL]. ArXiv preprint, arXiv: 1810.00329, 2018 [2018-09-30]. Https://arxiv.org/abs /1810.00329
 Salah K, Rehman MHU, Nizamuddin N, et al. Blockchain for AI: review and open research challenges. IEEE Access, 2019, 7: 10127—10149.
 Li Y, Zheng K, Yan Y. EtherQL: A query layer for blockchain system // Proc of the Int Conf on Database Systems for Advanced Applications. Berlin: Springer, 2017: 556—567.
 Xu C, Zhang C, Xu J. vChain: Enabling verifiable boolean range queries over blockchain databases [EB / OL]. ArXiv preprint, arXiv: 1812.02386,2018 [2018-12-06]. Https: // arxiv. org / abs / 1812.02386.
 Zhang C, Xu C, Xu J, et al. GEM ^ 2-Tree: A gas-efficient structure for authenticated range queries in blockchain // Proc of the 35th Int Conf on Data Engineering. Washington: IEEE, 2019: 842—853.
 P Ruan, Chen G, TTA Dinh. Fine-grained, secure and efficient data provenance on blockchain systems // Proceeding of the Very Large DataBase. California: ACM, 2019: 975—988Explainable artificial intelligence: A survey.
 Pew Research Center. Public perceptions of privacy and security in the post-Snowden era.
https://www.pewinternet.org/2014/11/12/public-privacy-perceptions/. [2019-01-30] / [2020-01-01].
 China Internet Association. "Report on the Investigation of the Protection of the Rights and Interests of Chinese Internet Users 2016".
http://www.isc.org.cn/zxzx/xhdt/listinfo-33759.html. [2016-06-26] / [2020-01-01].
 Zyskind G, Nathan O. Decentralizing privacy: using blockchain to protect personal data // Proc of IEEE Security and Privacy Workshops. Washington: IEEE, 2015: 180—184.
 Azaria A, Ekblaw A, Vieira T. MedRec: using blockchain for medical data access and permission management // Proc of the Int Conf on Open & Big Data. Washington: IEEE, 2016: 25—30.
 Dubovitskaya A, Xu Z, Ryu S. Secure and trustable electronic medical records sharing using blockchain. American Medical Informatics Association., 2017, 650—659.
 Ouaddah A, Abou Elkalam A, Ait Ouahman A. FairAccess: a new blockchain-based access control framework for the Internet of Things. Security and Communication Networks, 2016, 9 (18): 5943—5964.
 Hossein S, Lukas B. Droplet: Decentralized authorization for IoT Data Streams [EB / OL]. ArXiv preprint, arXiv: 1806.02057, 2018 [2018-11-14]. Https://arxiv.org/abs/1806.02057 .
 Li R, Song T, Mei B. Blockchain for large-scale internet of things data storage and protection. IEEE Transactions on Services Computing, 2018: 1—8.
 Henry C, Dan B. Prio: Private, robust, and scalable computation of aggregate statistics // Proc of the 14th USENIX Symposium on Networked Systems Design and Implementation, Berkeley CA: USENIX, 2017: 259—282.
 Froelicher D, Egger P. UnLynx: a decentralized system for privacy-conscious data sharing // Proc on Privacy Enhancing Technologies. NJ: IEEE, 2017: 232—250.
 Froelicher D, Juan R. Drynx: Decentralized, secure, verifiable system for statistical queries and machine learning on distributed datasets [EB / OL]. ArXiv preprint, arXiv: 1902.03785, 2019 [2019-02-11]. Https: //arxiv.org/abs/1902.03785.
 Nelson Kibichi Bore, Ravi Kiran Raman. Promoting distributed trust in machine learning and computational simulation via a blockchain network. Http://arxiv.org/abs/1810.11126.
 Ravi K, Roman V, Michael H. Trusted multi-party computation and verifiable simulations: a scalable blockchain approach [EB / OL]. ArXiv preprint, arXiv: 1809.08438, 2018 [2018-09-22]. Https: / /arxiv.org/abs/1809.08438.
 Tsung T, Lucila O. ModelChain: decentralized privacy-preserving healthcare predictive modeling framework on private blockchain networks [EB / OL]. ArXiv preprint, arXiv: 1802.01746,2018 ［2018-02-06］. Https: // arxiv .org / abs / 1802.01746.
 Weng J, Zhang J. Deepchain: auditable and privacy-preserving deep learning with blockchain-based incentive. Cryptology ePrint Archive, Report 2018/679.
 KUO, Tsung-Ting; GABRIEL, Rodney A, et al. Fair compute loads enabled by blockchain: sharing models by alternating client and server roles. Journal of the American Medical Informatics Association, 2019, 26 (5): 392— 403.