Open data and privacy. Should I bother?

Privacy is often mentioned as an obstacle when implementing an open data policy, but never really elaborated on. Should you really bother about privacy when opening up your data? My answer: yes you should.

Alan Westin laid the foundation of our modern conception of information privacy, which focuses on the individual’s right to control what is known about him. The modern European right to information privacy still leans on the notion of privacy as a right to control one’s personal information. Article 8 of the Charter of Fundamental Rights of the European Union gives everyone the right “to the protection of personal data concerning him or her”. This fundamental right to information privacy is further elaborated by the EU Data Protection Directive. The concept of ‘processing personal data’ is the touchstone of this directive. Personal data should be processed fairly and for legitimate and specified purposes.

EU data protection is all about the protection of ‘personal data’. Personal data is “information relating to an identified or identifiable natural person” and an identifiable person is “one who can be identified, directly or indirectly, in particular by reference to an identification number or to one or more factors specific to his physical, physiological, mental, economic, cultural or social identity” (Article 2 of the EU Data Protection Directive). Personal data can thus be both directly and indirectly identifying.

Train times, the location of public toilets and the number of car accidents could all be open data. No open data provider will (hopefully) offer names, addresses, social security numbers, or other data that directly or indirectly identifies natural persons as open data. Open data is at the most anonymized or aggregated data that cannot be related to individuals. The Open Knowledge Foundation visualizes open data and “private data” as two non-overlapping subsets. Unfortunately, in reality this distinction is not so easy to draw.

Even when data has been anonymized or aggregated, data analysis techniques now allow us to re-identify individuals in such data (See Paul Ohm for an overview). For instance, when Netflix offered anonymized data for a contest for the best method to improve its movie recommendations, Arvind Narayanan and Vitaly Shmatikov showed that this data could in fact be used to identify Netflix subscribers.

In particular regarding open data, Andrew Simpson demonstrated that it is relatively easy to link statistical open data to individuals. In one case, names and addresses of councillors, and names, posts and salaries of senior public servants were uncovered by combining data from the British open data portal with other already available public data. The lack of consideration of other data in the public domain prior to publication of statistical open data thus led to the identification of individuals.

Combining datasets is at the core of de-anonymizing and de-aggregating data. Data that is non-identifiable today, may turn out be indirectly identifiable tomorrow. The more computing power and publicly available data, the easier it becomes to identify individuals in data. And when data can be related to individuals, data protection law kicks in.

What does this mean for open data providers? Open data providers should not just consider the identifiability of their open data in isolation. They should also take other publicly available data into account when selecting data that they want to offer as open data. That is a difficult task. Maybe open data is not such a great idea after all?

Also read:

Or check out Opendatarecht.nl, a Dutch weblog on open data.

5 comments

  1. Nice article Stefan, some thoughts..

    Open Data & “privacy as a right to control one’s personal information” I believe controlling should only be for legislative purposes. In any other way people/business should be given the clear choice if they want to share their personal data necessary for making analyses. Big-Data analyses organisations should advise & present users with possibilities and let people/users choose, rather then secretly implying interests that are not in the best value for the user.

    “Even when data has been anonymized or aggregated, data analysis techniques now allow us to re-identify individuals in such data” I believe this is because even open data is still de-centralized, their are all small players with limited security systems & funds. While open data cries for more openness their has been little initiative (as far as I know) to try to open up as one big transparant open data bank with a combined acces to all the data-security skills.

    “Should you really bother about privacy when opening up your data?” Your and my answer: yes we should” We should bother and check if we do it right as soon as possible. Opening up data from and for the people gives major analytical opportunities to form sustainable solutions.

  2. Thank you for your thoughts and comments, Tjarco!

    “Opening up data from and for the people gives major analytical opportunities to form sustainable solutions.” I Couldn’t agree more! Don’t get me wrong: I’m an open data enthusiast!

    I also agree that most open data initiatives are small and lack serious funding. Open data enthousiasts have their hands full with just getting the data out there. A deep analysis of privacy and security issues is not the first thing on the priority list. Although privacy in general is a hot topic, I did not come across a lot of serious thinking about privacy and open data yet. I think it’s an awareness problem too.

    “In any other way people/business should be given the clear choice if they want to share their personal data necessary for making analyses.” I personally think privacy is a bit more than choice. Call me paternalistic, but I think privacy is something that needs protection. People are not always aware of the consequences of sharing data that relates to them. And, as I tried to argue in my blog, some data may seem harmless, but can get very personal through combination of data.

  3. you paternalist! haha,

    No i’m aware of the urge for the protection of privacy, but maybe you can explain, what are we afraid of, what are these consequences?

    If this is about the misuse of data, when something else is being done with the data than the user has approved for, than yes. We need to monitor the ‘fair use’ of open privat data, good point!

    But if you state: ‘privacy should be protected’ it could also imply that you suggest that people or business should not be open about their data. This might create fear for sharing data and thus reaching the possibilities for analysis.

    It’s good to warn for possible misuse, but their is nothing wrong with sharing the data if their is a very good initiative, with maybe an independent fair data use policy that assures to only deliver what is being promised.

  4. Thanks for your reply! Yes, I had misuse of data in mind. Monitoring is a great idea. Maybe people in data semantics can help us out.

    But I also had the protection of every individual’s private life in mind. Every individual has a right to control the information that relates to him. However, open data and new technology are blurring the lines between personal data and non-personal data. The determination of what’s personal information and what is not is therefore becoming more difficult. And thus, in my view, this right to control your personal information is under pressure.

    I don’t know (yet) if this means that open data providers should be stopped. I find it a very hard nut to crack. I think in essence the question is about who is responsible for what information use now and in the future. If, hypothetically, in ten years from now, I can deduct your address, how much energy you use and all your traveling patterns by analyzing and combining public data, then do open data providers bear responsibility for that? I know this scenario maybe is a bit over the top, but it makes the question clearer. And again, I don’t know the answer to it.

  5. […] they read and how fast they read it. Although all companies say they aggregate and anonymize data, Stefan Kulk posted on his blog his concerns that there are ways to find information about individuals. Ed Felton voiced similar […]