33 Bits of Entropy

The Secret Life of Data

Some people claim that re-identification attacks don’t matter, the reasoning being: “I’m not important enough for anyone to want to invest time on learning private facts about me.” At first sight that seems like a reasonable argument, at least in the context of the re-identification algorithms I have worked on, which require considerable human and machine effort to implement.

The argument is nonetheless fallacious, because re-identification typically doesn’t happen at the level of the individual. Rather, the investment of effort yields results over the entire database of millions of people (hence the emphasis on “large-scale” or “en masse”.) On the other hand, the harm that occurs from re-identification affects individuals. This asymmetry exists because the party interested in re-identifying you and the party carrying out the re-identification are not the same.

In today’s world, the entities most interested in acquiring and de-anonymizing large databases might be data aggregation companies like ChoicePoint that sell intelligence on individuals, whereas the party interested in using the re-identified information about you would be their clients/customers: law enforcement, an employer, an insurance company, or even a former friend out to slander you.

Data passes through multiple companies or entities before reaching its destination, making it hard to prove or even detect that it originated from a de-anonymized database. There are lots of companies known to sell “anonymized” customer data: for example Practice Fusion “subsidizes its free EMRs by selling de-identified data to insurance groups, clinical researchers and pharmaceutical companies.” On the other hand, companies carrying out data aggregation/de-anonymization are a lot more secretive about it.

Another piece of the puzzle is what happens when a company goes bankrupt. Decode genetics recently did, which is particularly interesting because they are sitting on a ton of genetic data. There are privacy assurances in place in their original Terms of Service with their customers, but will that bind the new owner of the assets? These are legal gray areas, and are frequently exploited by companies looking to acquire data.

At the recent FTC privacy roundtable, Scott Taylor of Hewlett Packard said his company regularly had the problem of not being able to determine where data is being shared downstream after the first point of contact. I’m sure the same is true of other companies as well. (How then could we possibly expect third-party oversight of this process?) Since data fuels the modern Web economy, I suspect that the process of moving data around will continue to become more common as well as more complex, with more steps in the chain. We could use a good name for it — “data laundering,” perhaps?

February 6, 2010 at 8:48 pm 1 comment

In which I come out: Notes from the FTC Privacy Roundtable

I was on a panel at the second FTC privacy roundtable in Berkeley on Thursday. Meeting a new community of people is always a fascinating experience. As a computer scientist, I’m used to showing up to conferences in jeans and a T-shirt; instead I found myself dressing formally and saying things like “oh, not at all, the honor is all mine!”

This post will also be the start of a new direction for this blog. So far, I’ve mostly confined myself to “doing the math” and limiting myself to factual exposition. That’s going to change, for two reasons:

The central theme of this blog and of my Ph.D dissertation — the failure of data anonymization — now seems to be widely accepted in policy circles. This is due in large part to Paul Ohm’s excellent paper, which is a must-read for anyone interested in this topic. I no longer have to worry about the acceptance of the technical idea being “tainted” by my opinions.
I’ve been learning about the various facets of privacy — legal, economic, etc. — for long enough to feel confident in my views. I have something to contribute to the larger discussion of where technological society is heading with respect to privacy.

Underrepresentation of scientists

Living up to the stereotype

As it turned out, I was the only academic computer scientist among the 35 panelists. I found this very surprising. The underrepresentation is not because computer scientists have nothing to contribute — after all, there were other CS Ph.Ds from industry groups like Mozilla. Rather, I believe it is a consequence of the general attitude of academic scientists towards policy issues. Most researchers consider it not worth their time, and a few actively disdain it.

The problem is even deeper: academics have the same disdainful attitude towards the popular exposition of science. The underlying reason is that the goal in academia is to impress one’s peers; making the world better is merely a side-effect, albeit a common one. The incentive structure in academia needs to change. I will pick up this topic in future posts.

The FTC has an admirable approach to regulation

As I found out in the course of the day’s panels, the FTC is not about prescribing or mandating what to do. Pushing a specific privacy-enhancing technology isn’t the kind of thing they are interested in doing at all. Rather, they see their role as getting the market to function better and the industry to self-regulate. The need to avoid harming innovation was repeatedly emphasized, and there was a lot of talk about not throwing the baby out with the bathwater.

The following were the potential (non baby hurting) initiatives that were most talked about:

Market transparency. Markets can only work well when there is full information, and when it comes to privacy the market has failed horribly. Users have no idea what happens to their data once it’s collected, and no one reads privacy policies. Regulation that promotes transparency can help the market fix itself.
Consumer education. This is a counterpart to the previous point. Education about privacy dangers as well as privacy technologies can help.
Enforcement. A few bad apples have been responsible for the most egregious privacy SNAFUs. The larger players are by and large self-regulating. The FTC needs to work with law enforcement to punish the offenders.
Carrots and sticks. Even the specter of regulation, corporate representatives said, is enough to get the industry to self-regulate. Many would disagree, but I think a carrots-and-sticks approach can be made to work.
Incentivizing adoption of PETs (privacy enhancing technologies) in general. The question of how the FTC can spur the adoption of PETs was brought up on almost every panel, but I don’t think there were any halfway convincing answers. Someone mentioned that the government in general could go into the market for PETs, which seems reasonable.

As a libertarian, I think the overall non-interventionist approach here is exactly right. I’m told that the FTC is rather unusual among US regulatory agencies in this regard (which makes sense, considering that the FCC, for example, spends its time protecting children from breasts when it is not making up lists of words.)

Facebook’s two faces

Facebook public policy director Tim Sparapani, who was previously with the ACLU, made a variety of comments on the second panel that were bizarre, to put it mildly. Take a look (my comments are in sub-bullets):

“We absolutely compete on privacy.”
- That’s a weird definition of “compete.” Facebook has a history of rolling out privacy-infringing updates, such as Beacon, the ToS changes, and the recent update that made the graph public. Then they wait to see if there’s an outcry and roll back some of the changes. It is hard to think of another company has had such a cavalier approach.
“There are absolutely no barriers to entry to create a new social network.”
- Except for that little thing called the network effect, which is the mother of all barriers to entry. In a later post I will analyze why Facebook has reached a critical level of penetration in most markets which makes it nearly unassailable as a general-purpose social network.
“Our users have learned to trust us.”
- I don’t even know what to say about this one.
“We are a walled garden.”
- Sparapani is confusing two different senses of “walled garden” here. This was said in response to a statement by the Google rep about Google’s features to let users migrate their data to other services (which I find very commendable). In this sense, Facebook is indeed a walled garden, and doesn’t allow migration, which is a bad thing. But Sparapani said he meant it in the sense that Facebook doesn’t sell user data wholesale to other companies. That sounds like good news, except that third party app developers end up sharing user data with other entities, because enforcement of the application developer Terms of Service is virtually non-existent.
“If you delete the data it’s gone.” (in the context of deleting your account)
- That might be true in a strict sense, but it is misleading. Deleting all your data is actually impossible to achieve because most pieces of data belong to more than one user. Each of your messages will live on in the other person’s inbox (and it would be improper to delete it from theirs). Similarly, photos in which you appear, which you would probably like gone when you delete your account, still live on in the album of whoever took the picture. The same goes for your pokes, likes and other multi-user interactions. These are the very things that make a social network social.
“We now have controls on privacy at the moment you share data. This is an extraordinary innovation and our engineers are really proud of it.”
- The first part of that statement is true: you can now change the privacy controls on each of your Facebook status messages independently. The second part is downright absurd. It is completely trivial to implement from an engineering perspective (and LiveJournal for instance has had it for a decade).

There were more absurd statements, but you get the picture. It’s not just the fact that Sparapani’s comments were unhinged from reality that bothers me — the general tone was belligerent and disturbing. I missed a few minutes of the panel, during which he apparently he responded to a criticism from Chris Conley of the ACLU by saying “I was at the ACLU longer than you’ve been there.” This is unprofessional, undignified and a non-answer. Amusingly, he claimed that Facebook was “very proud” of various aspects of their privacy track record at least half a dozen times in the course of the panel.

Contrast all this with Mark Zuckerberg’s comments in an interview with Michael Arrington, which can be summed up as “the age of privacy is over.” That article goes on to say that Facebook’s actions caused the shift in social norms (to the extent that they have shifted at all) rather than merely responding to them. Either way, it is unquestionable that Facebook’s true behavior at the present time pays lip service to privacy, and Zuckerberg’s statement is a more-or-less honest reflection of that. On the other hand, as I have shown, the company sings a completely different tune when the FTC is listening.

Engaging privacy skeptics

Aside from Facebook’s shenanigans, I feel that that there are two groups in the privacy debate who are talking past each other. One side is represented by consumer advocates, and is largely echoed by the official position of the FTC. The other side’s position can be summed up as “yeah, whatever.” When expressed coherently, there are three tenets of this position (with the caveats that not all privacy skeptics adhere to all three):

Users don’t care about privacy any more
Even if they do, privacy is impossible to achieve in the digital age, so get over it
There are no real harms arising from privacy breaches.

Click image to embiggen

To the right is an illustrative example of a mainstream-media representative who was at the workshop covering it on Twitter through the lens of his preconceived prejudices.

Privacy scholars never engage with the skeptics because the skeptical viewpoint appears obviously false to anyone who has done some serious thinking about privacy. However, it is crucial to engage the opponents, because 1. the skeptical view is extremely common 2. many of the startups coming out of the valley fall into this group, and they are are going to have control over increasing amounts of user data in the years to come.

The “privacy is dead” view was most famously voiced by Scott McNealy. In its extreme form it is easy to argue against: “start streaming yourself live on the Internet 24/7, and then we’ll talk.” (To be sure, a few people did this 10 years ago as a publicity stunt, but it is obvious that the vast majority of people aren’t ready for this level of invasiveness of monitoring/data collection.) But engaging with skeptics isn’t about refutation, it’s about dealing with a different way of thinking and getting the message across to the other side. Unfortunately real engagement hasn’t really been happening.

I have a double life in academia and the startup world, and I think this puts me in a somewhat unusual position of being able to appreciate both sides of the argument. My own viewpoint is somewhere in the middle; I will expand on this theme in future blog posts.

January 31, 2010 at 3:49 am 13 comments

The Entropy of a DNA profile

I’m often asked how much entropy there is in the DNA profiles used in forensic investigations. Specifically, is it more than 33 bits, i.e., can it uniquely identify individuals? The short answer is: yes in theory, but there are many caveats in practice, and false matches are fairly common.

To explain the details, let’s start by looking at what is actually stored in a DNA profile. Your entire genome consists of billions of base pairs, but for profiling purposes, only a tiny portion of it is looked at — specifically, 13 locations or loci (in the U.S. version, which I will focus on. The U.K. version uses 10 loci.) Each of these loci yields a pair of integers which varies from person to person. You can see an example DNA profile on this page.

The degree of variation in the pairs of numbers — genotypes — at each locus has been empirically measured by many studies. Since biological laws dictate that the genotypes at different loci are uncorrelated, we can calculate entropy by simply adding up the entropy at individual loci. I analyzed (source code) the raw data on variation at each locus from a sample of U.S. Caucasians, and arrived at a figure of between 3.0 and 5.6 bits of entropy per locus and 54 bits of entropy for the whole 13-locus DNA profile. In addition, there is 1 sex-determining bit.

Since that number is well over 33 bits, with a high probability there is no one else who shares your DNA profile. However, there are many complications to this rosy picture:

Non-uniform genotype probabilities. The entropy calculation doesn’t quite tell the whole story, because some genotypes at each locus are much more common than others. If you happen to end up with a common genotype in all (or most) of the 13 loci, then there might be a significant chance that someone else in the world shares your DNA profile.

Population structure. The calculation above assumes the Hardy-Weinberg equilibrium, which is only true if mating is random, among other things. In reality, due to the non-random population structure, there is a slight deviation from the theoretical value. This manifests in two ways: first, the allele frequencies for different population groups (ethnic groups) need to be calculated separately. Second, there is a deviation from the expected genotype frequencies even within population groups, which is more difficult to account for (a correction factor called “theta” is applied in forensic calculations).

Familial relationships. Since we share half of our DNA with each parent and sibling, there is a much higher chance of a profile match between close relatives than between unrelated individuals. Therefore DNA database matches often turn up a relative of the perpetrator even if the perpetrator is not in the database (especially with partial matches; see below).

In recent years, law enforcement has sometimes adopted the strategy of turning this problem on its head and using these familial leads as starting points of investigation as a way to get to the true perpetrator. This is a controversial practice.

Each of the above factors results in an increase in the probability of a match between different individuals. But the effect is small; even after taking them into account, as long as we’re talking about the full 13-locus profile, most individuals do in fact have a unique DNA profile, albeit fewer than would be predicted by the simple entropy calculation.

Unfortunately, crime-scene sample collection is far from perfect, and profiles are often not extracted accurately from the physical samples due to limitations of technology and the quality of the sample. These inaccuracies in a crime-scene profile introduce errors into the matching process, which are the primary reason for false matches in investigations.

Partial and mixed profiles. Sometimes only a “partial profile” can be extracted from a crime-scene DNA sample. This means that only a subset of the 13 genotypes can be measured. This could be because the quantity of DNA available is too small (interfering with the “amplification” process at some of the loci), because the DNA has degraded, or because it is contaminated with chemicals called PCR inhibitors that interfere with the decoding process.

The other type of inaccuracy occurs when the DNA sample collected is in fact a mixture from multiple individuals. If this happens, multiple values for some genotypes might be measured. There is no foolproof way of separating the genotypes of each individual in the mixture.

These are very common occurrences, particularly partial profiles. There are no standards on the quality or quantity of the profile data for the evidence to be admissible in court. Instead, an expert witness computes a “likelihood ratio” based on the specific partial or mixed profile, and presents this to the court. Juries are often left not knowing how to interpret the number they are presented and are vulnerable to the prosecutor’s fallacy.

The birthday paradox. The history of DNA testing is littered with false matches and leads; one reason is the birthday paradox. The number of pairs of individuals in a database of size N grows proportional to N². The FBI database, for instance, has about 200,000 crime-scene profiles and 5 million offender profiles, for a total of a 1 trillion pairs of profiles. Due to use of partial profiles to find matches, the probability of a match between two random profiles is much higher than one in a trillion.

This long but fascinating paper has many hilarious stories of false DNA matches. Laboratory errors such as mixing up labels on the samples and contamination of the sample with the technician’s DNA appear to be depressingly common as well. Here is another story of lab contamination that cost $14 million.

Why only 13 loci? One question that all this raises is that if the use of a small number of loci causes problems when only a partial profile is available, why not use more of the genome, or even all of it? Research on mini-STRs shows how to better utilize degraded DNA to recover genotypes from beyond the 13 CODIS loci. The cost of whole-genome genotyping has been falling dramatically, and enables even individuals contributing trace amounts of DNA to a mixture to be identified!

One stumbling block seems to be the small quantity of DNA available from crime scenes; whole genome amplification is being developed to address that. But I suspect that the main reason is inertia: forensic protocols, procedures and expertise in DNA profiling have evolved over the last two decades, and it would be costly to make any changes at all. Whatever the reasons, I’m certain that things are going to be very different in a decade or two, because there are millions of bits of entropy in the entire genome, and forensic science currently uses about 54 of them.

Further reading.

Wikipedia: DNA profiling
Legal decisions concerning DNA statistics
DNA Testing: An Introduction for Non-Scientists
The Potential for Error in Forensic DNA Testing (referenced above, but worth repeating!)

December 2, 2009 at 4:26 pm 1 comment

The Internet has no Delete Button: Limits of the Legal System in Protecting Anonymity

It is futile to try to stay anonymous by getting your name or data purged from the Internet, once it is already out there. Attempts at such censorship have backfired repeatedly and spectacularly, giving rise to the term Streisand effect. A recent lawsuit provides the latest demonstration: two convicted German killers (who have completed their prison sentences) are attempting to prevent Wikipedia from identifying them.

The law in Germany tries to “protect the name and likenesses of private persons from unwanted publicity.” Of course, the Wikimedia foundation is based in the United States, and this attempt runs head-on into the First Amendment, the right to Free Speech. European countries have a variety of restrictions on speech—Holocaust denial is illegal, for instance. But there is little doubt about how U.S. courts will see the issue; Jennifer Granick of the EFF has a nice write-up.

The aspect that interests me is that even if there weren’t a Free Speech issue, it would be utterly impossible for the court system to keep the names of these men from the Internet. I wonder if the German judge who awarded a judgment against the Wikimedia foundation was aware that it would achieve exactly the “unwanted publicity” that the law was intended to avoid. He would probably have ruled as he did in any case, but it is interesting to speculate.

Legislators, on the other hand, would do well to be aware of the limitations of censorship, and the need to update laws to reflect the rules of the information age. There are always alternatives, although they usually involve trade-offs. In this instance, perhaps one option is a state-supplied alternate identity, analogous to the Witness Protection Program?

Returning to the issue of enforceability, the European doctrine apparently falls under “rights of the personality,” specifically the “right to be forgotten,” according to this paper that discusses the trans-atlantic clash. I find the very name rather absurd; it reminds me of attempting not to think of an elephant (try it!)

The above paper, written from the European perspective, laments the irreconcilable differences between the two viewpoints on the issue of Free Speech vs. Privacy. However, there is no discussion of enforceability. The author does suspect, in the final paragraph, that the European doctrine will become rather meaningless due to the Internet, but he believes this to be purely a consequence of the fact that the U.S. courts have put Free Speech first.

I don’t buy it—even if the U.S. courts joined Europe in recognizing a “right to be forgotten,” it would still be essentially unenforceable. Copyright-based rather than privacy-based censorship attempts offer us a lesson here. Copyright law has international scope, due to being standardized by the WIPO, and yet the attempt to take down the AACS encryption key was pitifully unsuccessful.

Taking down a repeat offender (such as a torrent tracker) or a large file (the Windows 2000 source code leak) might be easier. But if we’re talking about a small piece of data, the only factor that seems to matter is the level of public interest in the sensitive information. The only times when censorship of individual facts has been (somewhat) successful in the face of public sentiment is within oppressive regimes with centralized Internet filters.

There are many laws, particularly privacy laws, that need to be revamped for the digital age. What might appear obvious to technologists might be much less apparent to law scholars, lawmakers and the courts. I’ve said it before on this blog, but it bears repeating: there is an acute need for greater interdisciplinary collaboration between technology and the law.

November 28, 2009 at 5:22 am Leave a comment

De-anonymization is not X: The Need for Re-identification Science

In an abstract sense, re-identifying a record in an anonymized collection using a piece of auxiliary information is nothing more than identifying which of N vectors best matches a given vector. As such, it is related to many well-studied problems from other areas of information science: the record linkage problem in statistics and census studies, the search problem in information retrieval, the classification problem in machine learning, and finally, biometric identification. Noticing inter-disciplinary connections is often very illuminating and sometimes leads to breakthroughs, but I fear that in the case of re-identification, these connections have done more harm than good.

Record linkage and k-anonymity. Sweeney‘s well-known experiment with health records was essentially an exercise in record linkage. The re-identification technique used was the simplest possible — a database JOIN. The unfortunate consequence was that for many years, the anonymization problem was overgeneralized based on that single experiment. In particular, it led to the development of two related and heavily flawed notions: k-anonymity and quasi-identifier.

The main problem with k-anonymity it is that it attempts avoid privacy breaches via purely syntactic manipulations to the data, without any model for reasoning about the ‘adversary’ or attacker. A future post will analyze the limitations of k-anonymity in more detail. ‘Quasi-identifier’ is a notion that arises from attempting to see some attributes (such as ZIP code) but not others (such as tastes and behavior) as contributing to re-identifiability. However, the major lesson from the re-identification papers of the last few years has been that any information at all about a person can be potentially used to aid re-identification.

Movie ratings and noise. Let’s move on to other connections that turned out to be red herrings. Prior to our Netflix paper, Frankowski et al. studied de-anonymization of users via movie ratings collected as part of the GroupLens research project. Their algorithm achieved some success, but failed when noise was added to the auxiliary information. I believe this to be because the authors modeled re-identification as a search problem (I have no way to know if that was their mental model, but the algorithms they came up with seem inspired by the search literature.)

What does it mean to view re-identification as a search problem? A user’s anonymized movie preference record is treated as the collection of words on a web page, and the auxiliary information (another record of movie preferences, from a different database) is treated as a list of search terms. The reason this approach fails is that in the movie context, users typically enter distinct, albeit overlapping, sets of information into different sites or sources. This leads to a great deal of ‘noise’ that the algorithm must deal with. While noise in web pages is of course an issue for web search, noise in the search terms themselves is not. That explains why search algorithms come up short when applied to re-identification.

The robustness against noise was the key distinguishing element that made the re-identification attack in the Netflix paper stand out from most previous work. Any re-identification attack that goes beyond Sweeney-style demographic attributes must incorporate this as a key feature. ‘Fuzzy’ matching is tricky, and there is no universal algorithm that can be used. Rather, it needs to be tailored to the type of dataset based on an understanding of human behavior.

Hope for authorship recognition. Now for my final example. I’m collaborating with other researchers, including John Bethencourt and Emil Stefanov, on some (currently exploratory) investigations into authorship recognition (see my post on De-anonymizing the Internet). We’ve been wondering why progress in existing papers seems to hit a wall at around 100 authors, and how we can break past this limit and carry out de-anonymization on a truly Internet scale. My conjecture is that most previous papers hit the wall because they framed authorship recognition as a classification problem, which is probably the right model for forensics applications. For breaking Internet anonymity, however, this model is not appropriate.

In a de-anonymization problem, if you only succeed for some fraction of the authors, but you do so in a verifiable way, i.e, your algorithm either says “Here is the identity of X” or “I am unable to de-anonymize X”, that’s great. In a classification problem, that’s not acceptable. Further, in de-anonymization, if we can reduce the set of candidate identities for X from a million to (say) 10, that’s fantastic. In a classification problem, that’s a 90% error rate.

These may seem like minor differences, but they radically affect the variety of features that we are able to use. We can throw in a whole lot of features that only work for some authors but not for others. This is why I believe that Internet-scale text de-anonymization is fundamentally possible, although it will only work for a subset of users that cannot be predicted beforehand.

Re-identification science. Paul Ohm refers to what I and other researchers do as “re-identification science.” While this is flattering, I don’t think we’ve done enough to deserve the badge. But we need to change that, because efforts to understand re-identification algorithms by reducing them to known paradigms have been unsuccessful, as I have shown in this post. Among other things, we need to better understand the theoretical limits of anonymization and to extract the common principles underlying the more complex re-identification techniques developed in recent years.

Thanks to Vitaly Shmatikov for reviewing an earlier draft of this post.

October 14, 2009 at 9:42 pm 1 comment

Oklahoma Abortion Law: Bloggers get it Wrong

The State of Oklahoma just passed legislation requiring that detailed information about every abortion performed in the state be submitted to the State Department of Health. Reports based on this data are to be made publicly available. The controversy around the law gained steam rapidly after bloggers revealed that even though names and addresses of mothers obtaining abortions were not collected, the women could nevertheless be re-identified from the published data based on a variety of other required attributes such as the date of abortion, age and race, county, etc.

As a computer scientist studying re-identification, this was brought to my attention. I was as indignant on hearing about it as the next smug Californian, and I promptly wrote up a blog post analyzing the serious risk of re-identification based on the answers to the 37 questions that each mother must anonymously report. Just before posting it, however, I decided to give the text of the law a more careful reading, and realized that the bloggers have been misinterpreting the law all along.

While it is true that the law requires submitting a detailed form to the Department of Health, the only information that is made public are annual reports with statistical tallies of the number of abortions performed under very broad categories, which presents a negligible to non-existent re-identification risk.

I’m not defending the law; that is outside my sphere of competence. There do appear to be other serious problems with it, outlined in a lawsuit aimed at stopping the law from going into effect. The text of this complaint, as Paul Ohm notes, does not raise the “public posting” claim. Besides, the wording of the law is very ambiguous, and I can certainly see why it might have been misinterpreted.

But I do want to lament the fact that bloggers and special interest groups can start a controversy based on a careless (or less often, deliberate) misunderstanding, and have it amplified by an emerging category of news outlets like the Huffington post, which have the credibility of blogs but a readership approaching traditional media. At this point the outrage becomes self-sustaining, and the factual inaccuracies become impossible to combat. I’m reminded of the affair of the gay sheep.

October 9, 2009 at 6:24 pm 10 comments

Livejournal Done Right: The Case for a Social Network with Built-in Privacy

Is it time to give up on privacy in social networking? I argue that the exact opposite is true. Impatient readers can skip to the bullet-point summary at the end.

Based on my work on de-anonymizing social networks with Shmatikov, and other research such as Bonneau & Preibusch’s survey of the dismal state of privacy in social networks, many people have concluded that it is time to give up on social networking privacy. In my opinion, this couldn’t be farther from the truth.

Being a hard-headed pragmatist (at least by the lax standards of academia (-:), I will make the case that there is a market for a social networking site designed from the ground-up with privacy in mind, as opposed to privacy being tagged on piecemeal in reaction to PR incidents.

It would seem that a good place to start would be to look at existing social networks with designed-in privacy, and see how they have fared. Unfortunately, researchers are still hammering out exactly what that would look like, and there are no real examples in the wild. In fact, part of the reason for this post is to flesh out some principles for designed-in privacy. So I will use a definition based on privacy outcomes instead:

The privacy strength of a social network is the extent to which its users share sensitive information with one another.

Viewed from this perspective, there is only one widely-used social network (at least in the U.S.) that has strong privacy, one that stands out from all the rest: LiveJournal.

While Facebook’s privacy controls are more technologically sophisticated, there is little doubt that far more revelations of a private nature are made on LiveJournal. This discrepancy is central to the point I want to make: achieving privacy is not just about technological decisions.

There is one overarching reason for LiveJournal’s privacy success: They make it (relatively) easy for users to communicate their mental access control rules to the system. In my opinion, this should the the single most important privacy goal of a social network; the technical problem of implementing those access control rules is secondary and much easier.

On Livejournal, the goal is achieved largely due to two normative user behaviors:

Friending is not indiscriminate (see below).
Users actually use friend lists for access control.

Herding users into these behaviors is far from easy, and LiveJournal stumbled there through a variety of disparate design decisions, some wise, some not so wise, some that worked against their interest in the long run, and some downright bizarre.

Friendship is not mutual. While in practice over 90% of friendships are reciprocated, the difference crucially captures the asymmetric nature of trust.
The site is insular — it plays poorly with search engines; RSS support has been way behind other blog platforms.
Privacy settings are highly visible, rather than being tucked away in a configuration page. Just a couple of examples:
- there is a privacy-level dropdown menu on the post-new-entry page.
- when you add a friend, you are prompted to add them to one or more friend lists.
Weak identity. The site does not require or encourage a user to use their real name. Many users choose to hide their real-life identity from everyone except their friend-list.
Livejournal doesn’t inform users when they are friended. From the privacy perspective, this is a feature(!) rather than a bug — it decreases the embarrassment of an unreciprocated friending by letting both users pretend that the user who was friended didn’t notice (even though most regular users use external tools to receive such notifications.). The social norms around friending are in general far more complex than on Facebook, and there is a paper that analyzes them.

As you may have gathered from the above, social norms have a huge impact on the privacy outcome of a site; this explains both why privacy is about more than technology, as well as why privacy can never be achieved as an afterthought — because norms that have evolved can hardly ever be undone. Regrettably, but unsurprisingly, the CS literature on social network privacy has been largely blind to this aspect. (Fortunately, economists, philosophers, some hard-to-categorize researchers, and needless to say, sociologists and legal scholars have been researching social network privacy.)

Returning to my main thesis, I believe that privacy has been the central selling-point of Livejournal, even though it was never marketed to users in those terms. The privacy-centric view explains why the userbase is so notoriously vocal, why the site is able to get users to pay, why they have a huge fanfic community, much of it illegal, and why Livejournal users find it impossible to migrate to other mainstream social networks, which all lack any semblance of the privacy norms that exist on Livejournal.

Livejournal is dying, at least in the U.S., which I believe is largely due to erratic design decisions. While the decay of the site has been obvious to most users (who have seen the frequency of new posts basically fall off a cliff in the last few months), I don’t have concrete data on post frequency. Fortunately, it is not essential to the point I’m making, which is that Livejournal got a few things right but also made a lot of mistakes. We now know a lot more about privacy by design in social networks than we did a decade ago, and it is possible to do much better by starting from scratch. There is now a huge unfulfilled need in the market for someone to take a crack at.

Finally, I’m going to throw in two examples of design decisions that Livejournal (or any other network) never implemented but I believe would be hugely beneficial in achieving positive privacy outcomes:

“Everyone-but-X” access control. This is an example of a whole class of access control primitives that make no sense from the traditional computer science security perspective. If an item is visible to every logged-in user except X, X can always create a fake (“sybil”) account to get around it.

However, let me give you one simple example that I hope will immediately convince you that everyone-but-X is a good idea: your sibling is on your friends list and you want to post about your sex life. It’s not so much that you want to prevent X from having access to your post, but rather that both of you prefer that X didn’t have access to it. The relationship is not adversarial. Extrapolating a little bit, most users can benefit from everyone-but-X privacy in one context or another, but amazingly, no social network has thought of it.

The problem here is that traditional CS security theory lacks even the vocabulary to express what’s going on here. Fortunately, researchers are wising up to this, and a new paper that will be presented at ESORICS later this month argues that we need a new access control model to reason about social network privacy, and presents one that is based on Facebook (I really like this paper).

Stupidly easy friend lists. Having to manually manage friend-lists puts it beyond the patience level of the average user, and offers no hope of getting users who already have several hundred uncategorized friends to start categorizing. But technology can help: I’ve written about automated friend-list clustering and classification before.

Summary. As promised, in bullet points:

Livejournal is the only major social network whose users regularly share highly private material.
Livejournal achieved this largely because they made it easy for users to communicate their mental access control rules to the system.
To habituate users into doing this, social norms are crucial. They matter more than technology in affecting privacy outcomes.
Designing privacy is therefore largely about building the right tools to get the right social norms to evolve.
Livejournal doesn’t seem to have a bright future. Besides, they made many mistakes and never realized their full potential.
Therefore, privacy-conscious users form a large and currently severely underserved segment of the social networking audience.
The lessons of Livejournal and recent research can help us design privacy effectively from the ground up. The time is right, and the market is ripe.

Final note. I will be presenting the gist of this essay (preceded by a survey of the academic attempts at privacy by design) at the Social Networking Security Workshop at Stanford this Friday.

Some of the ideas in this post were inspired by these essays by Matthew Skala.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

September 9, 2009 at 11:52 am 21 comments

Privacy Law Scholars Conference

I had a great time at the Privacy Law Scholars Conference in Berkeley last week, perhaps more so than at any CS conference I’ve attended. A major reason was that there were — get this — no talks. Well, just one keynote speech. The format centered around 75 minutes-long discussion sessions (which seem to be called workshops), with 5 parallel tracks; in each session, you pick which track you want to attend. You are supposed to have read the paper beforehand, and usually everyone in the room has something to say and gets a chance to do so.

This seems way more sensible to me than the format of CS conferences, where there is only one track. I can’t imagine that anyone would genuinely want to attend all the talks. Ideally, for any given talk, half the people should skip it and spend their time networking instead, but in my experience this never happens. Worse, the talks are only 20-30 minutes long; while this is enough time to motiviate the paper and inspire the listeners to go read it afterward, it is never enough to explain the whole paper. Sometimes speakers don’t get this concept, and the results are not pretty.

Anyways, I was surprised by the ease with which I could read law papers and participate in the discussions, even if my understanding was (obviously) not nearly as deep as that of a law scholar. This is something to ponder — while legalese is dense and frequently obfuscated, law papers are a breeze to read, at least based on my small sample size.

There is one paper, by Paul Ohm, that I particularly enjoyed: it is about re-examining privacy laws and regulatory strategies in the light of re-identification techniques. This generated a lot of interest at the conference, and I found the discussion fascinating. A major reason I started 33bits was to to be able to play a part in informing these developments; it seems that this blog has indeed helped, which is highly gratifying. I learnt a lot about privacy and anonymity in general, and I look forward to writing more about it in future posts, to the extent that I can do so without talking about specific workshop discussions, which are confidential.

June 10, 2009 at 8:16 pm 8 comments

Graduation and plans

I defended my Ph.D thesis earlier this month, and I will soon be starting as a post-doctoral researcher at Stanford supervised by Dan Boneh. I’m very excited! I will still work on data anonymity, but it will not be my sole research focus.

Here is the introductory chapter to my thesis, formatted as a stand-alone document. I expect it to be useful mainly as a glossary and a very brief survey of data collection and sharing. It explains why non-interactive data sharing is popular and why anonymization is so tempting as a privacy protection mechanism.

As you can see, the chapter is less than 4 pages long, excluding references; the rest of my thesis consists of my papers concatenated together. Fortunately, the doctoral dissertation is generally treated as a formality in Computer Science, a fact that I am very grateful for since a dissertation is a stupendously inefficient way of communicating research results. I’m glad that my committee members made my life easy, while also providing useful comments on my defense talk.

I presented the social network de-anonymization paper at the S&P conference today at Oakland. Email me for the slides.

May 20, 2009 at 6:35 am 3 comments

Your Morning Commute is Unique: On the Anonymity of Home/Work Location Pairs

Philippe Golle and Kurt Partridge of PARC have a cute paper (pdf) on the anonymity of geo-location data. They analyze data from the U.S. Census and show that for the average person, knowing their approximate home and work locations — to a block level — identifies them uniquely.

Even if we look at the much coarser granularity of a census tract — tracts correspond roughly to ZIP codes; there are on average 1,500 people per census tract — for the average person, there are only around 20 other people who share the same home and work location. There’s more: 5% of people are uniquely identified by their home and work locations even if it is known only at the census tract level. One reason for this is that people who live and work in very different areas (say, different counties) are much more easily identifiable, as one might expect.

The paper is timely, because Location Based Services are proliferating rapidly. To understand the privacy threats, we need to ask the two usual questions:

who has access to anonymized location data?
how can they get access to auxiliary data linking people to location pairs, which they can then use to carry out re-identification?

The authors don’t say much about these questions, but that’s probably because there are too many possibilities to list! In this post I will examine a few.

GPS navigation. This is the most obvious application that comes to mind, and probably the most privacy-sensitive: there have been many controversies around tracking of vehicle movements, such as NYC cab drivers threatening to strike. The privacy goal is to keep the location trail of the user/vehicle unknown even to the service provider — unlike in the context of social networks, people often don’t even trust the service provider. There are several papers on anonymizing GPS-related queries, but there doesn’t seem to be much you can do to hide the origin and destination except via charmingly unrealistic cryptographic protocols.

The accuracy of GPS is a few tens or few hundreds of feet, which is the same order of magnitude as a city block. So your daily commute is pretty much unique. If you took a (GPS-enabled) cab home from work at a certain time, there’s a good chance the trip can be tied to you. If you made a detour to stop somewhere, the location of your stop can probably be determined. This is true even if there is no record tying you to a specific vehicle.

Location based social networking. Pretty soon, every smartphone will be capable of running applications that transmit location data to web services. Google Latitude and Loopt are two of the major players in this space, providing some very nifty social networking functionality on top of location awareness. It is quite tempting for service providers to outsource research/data-mining by sharing de-identified data. I don’t know if anything of the sort is being done yet, but I think it is clear that de-identification would offer very little privacy protection in this context. If a pair of locations is uniquely identifying, a trail is emphatically so.

The same threat also applies to data being subpoena’d, so data retention policies need to take into consideration the uselessness of anonymizing location data.

I don’t know if cellular carriers themselves collect a location trail from phones as a matter of course. Any idea?

Plain old web browsing. Every website worth the name identifies you with a cookie, whether you log in or not. So if you browse the web from a laptop or mobile phone from both home and work, your home and work IP addresses can be tied together based on the cookie. There are a number of free or paid databases for turning IP addresses into geographical locations. These are generally accurate up to the city level, but beyond that the accuracy is shaky.

A more accurate location fix can be obtained by IDing WiFi access points. This is a curious technological marvel that is not widely known. Skyhook, Inc. has spent years wardriving the country (and abroad) to map out the MAC addresses of wireless routers. Given the MAC address of an access point, their database can tell you where it is located. There are browser add-ons that query Skyhook’s database and determine the user’s current location. Note that you don’t have to be browsing wirelessly — all you need is at least one WiFi access point within range. This information can then be transmitted to websites which can provide location-based functionality; Opera, in particular, has teamed up with Skyhook and is “looking forward to a future where geolocation data is as assumed part of the browsing experience.” The protocol by which the browser communicates geolocation to the website is being standardized by the W3C.

The good news from the privacy standpoint is that the accurate geolocation technologies like the Skyhook plug-in (and a competing offering that is part of Google Gears) require user consent. However, I anticipate that once the plug-ins become common, websites will entice users to enable access by (correctly) pointing out that their location can only be determined to within a few hundred meters, and users will leave themselves vulnerable to inference attacks that make use of location pairs rather than individual locations.

Image metadata. An increasing number of cameras these days have (GPS-based) geotagging built-in and enabled by default. Even more awesome is the Eye-Fi card, which automatically uploads pictures you snap to Flickr (or any of dozens of other image sharing websites you can pick from) by connecting to available WiFi access points nearby. Some versions of the card do automatic geotagging in addition.

If you regularly post pseudonymously to (say) Flickr, then the geolocations of your pictures will probably reveal prominent clusters around the places you frequent, including your home and work. This can be combined with auxiliary data to tie the pictures to your identity.

Now let us turn to the other major question: what are the sources of auxiliary data that might link location pairs to identities? The easiest approach is probably to buy data from Acxiom, or another provider of direct-marketing address lists. Knowing approximate home and work locations, all that the attacker needs to do is to obtain data corresponding to both neighborhoods and do a “join,” i.e, find the (hopefully) unique common individual. This should be easy with Axciom, which lets you filter the list by “DMA code, census tract, state, MSA code, congressional district, census block group, county, ZIP code, ZIP range, radius, multi-location radius, carrier route, CBSA (whatever that is), area code, and phone prefix.”

Google and Facebook also know my home and work addresses, because I gave them that information. I expect that other major social networking sites also have such information on tens of millions of users. When one of these sites is the adversary — such as when you’re trying to browse anonymously — the adversary already has access to the auxiliary data. Google’s power in this context is amplified by the fact that they own DoubleClick, which lets them tie together your browsing activity on any number of different websites that are tracked by DoubleClick cookies.

Finally, while I’ve talked about image data being the target of de-anonymization, it may equally well be used as the auxiliary information that links a location pair to an identity — a non-anonymous Flickr account with sufficiently many geotagged photos probably reveals an identifiable user’s home and work locations. (Some attack techniques that I describe on this blog, such as crawling image metadata from Flickr to reveal people’s home and work locations, are computationally expensive to carry out on a large scale but not algorithmically hard; such attacks, as can be expected, will rapidly become more feasible with time.)

Summary. A number of devices in our daily lives transmit our physical location to service providers whom we don’t necessarily trust, and who keep might keep this data around or transmit it to third parties we don’t know about. The average user simply doesn’t have the patience to analyze and understand the privacy implications, making anonymity a misleadingly simple way to assuage their concerns. Unfortunately, anonymity breaks down very quickly when more than one location is associated with a person, as is usually the case.

May 13, 2009 at 6:42 am 24 comments