33 Bits of Entropy

Is Anonymity Research Ethical?

A researcher who is working on writing style analysis (“stylometry”), after reading my post on related de-anonymization techniques, wonders what the positive impact of such research could be, given my statement that the malicious uses of the technology are far greater than the beneficial ones. He says:

Sometimes when I’m thinking of an interesting research topic it’s hard to forget the Patton Oswalt line “Hey, we made cancer airborne and contagious! You’re welcome! We’re science: we’re all about coulda, not shoulda.”

This was my answer:

To me, generic research on algorithms always has a positive impact (if you’re breaking a specific website or system, that’s a different story; a bioweapon is a whole different category.) I do not recognize a moral question here, and therefore it does not affect what I choose to work on.

My belief that the research will have a positive impact is not at odds with my belief that the uses of the technology are predominantly evil. In fact, the two are positively correlated. If we’re talking about web search technology, if academics don’t invent it, then (benevolent) companies will. But if we’re talking about de-anonymization technology, if we don’t do it, then malevolent entities will invent it (if they haven’t already), and of course, keep it to themselves. It comes down to a choice between a world where everyone has access to de-anonymization techniques, and hopefully defenses against it, versus one in which only the bad guys do. I think it’s pretty clear which world most people will choose to live in.

I realize I lean toward the “coulda” side of the question of whether Science is—or should be—amoral. Someone like Prof. Benjamin Kuipers here at UT seems to be close to the other end of the spectrum: he won’t take any DARPA money.

Part of the problem with allowing morality to affect the direction of science is that it is often arbitrary. The Patton Oswalt quote above is a perfect example: he apparently said that in response to news of science enabling a 63 year old woman to give birth. The notion that something is wrong simply because it is not “natural” is one that I find most repugnant. If the freedom of a 63 year old woman to give birth is not an important issue to you, let me note that more serious issues such as stem cell research, that could save lives, fall under the same category.

Going back to anonymity, it is interesting that tools like Tor face much criticism, but for enabling the anonymity of “bad” people rather than breaking the anonymity of “good” people. Who is to be the arbiter of the line between good and bad? I share the opinion of most techies that Tor is a wonderful thing for the world to have.

There are many sides to this issue and many possible views. I’d love to hear your thoughts.

April 9, 2009 at 8:42 pm 8 comments

Anonymous Data Collection: Lessons from the A-Rod Affair

Recently, the Alex Rodriguez steroid controversy has been in the news. The aspect that interests me is the manner in which it came to attention: A-Rod provided a urine sample as part of a supposedly anonymous survey of Major League Baseball players in 2003, the goal of which was to determine if more than 5% of players were using banned substances. When Federal agents came calling, the sample turned out to be not so anonymous after all.

The failure of anonymity here was total–the testing lab simply failed to destroy the samples or even take the labels off them, and the Players’ Union, which conducted the survey, failed to call the lab and ask them to do so during the more than one-week window that they had before the subpoena was issued.

However, there are a number of ways in which things could have gone wrong even if one or more of the parties had followed proper procedure. None of the scenarios below result in as straightforward an association between player and steroid use as we have seen. On the other hand, they can be just as damaging in the court of public opinion.

If the samples were not destroyed, but simply de-identified, DNA can be recovered even after years, and the DNA can be used to match the player to the sample. You might argue the feds can’t easily get hold of players’ DNA to run such a matching, but once the association between drug test result and DNA has been made, it is a sword of Damocles hanging over the player’s head (note that A-Rod’s drug test happened six years ago.) The trend in recent years has been toward increased DNA profiling and bigger and bigger databases, and unlabeled samples therefore pose a clear danger.
If the samples are destroyed, and the test results are stored in de-identified form, anonymity could still be compromised. A drug test measures the concentrations of a bunch of different chemicals in the urine. It is likely that this results in a “profile” that is characteristic of a person–just like a variety of other biometric characteristics. If the same player, having stopped the use of banned substances, provides another urine sample, it is possible that this profile can be matched to the old one based on the fact that most of the urine chemicals have not changed in concentration. It is an interesting research question to see how stable the “profiles” are, and what their discriminatory power is.
Even more sophisticated attacks are possible. Let’s say that participant names are known, but other than that the only thing that’s released is a single statistic: the percentage of players that tested positive. Now, if the survey is performed on a regular basis, and a certain player (who happens to use steroids) participates only some of the time, the overall statistic is going to be slightly higher whenever that player participates. In spite of confounding factors, such as the fact that other players might also drop in and out, statistical techniques can be used to tease out this correlation.
This might sound like a tall order at first, but it is a proven attack strategy. The technique was used recently in a PLoS Genetics paper to identify if an individual had contributed DNA to an aggregate sample of hundreds of individuals.

I performed a quick experiment, assuming that there are 1,000 players in the sample, of which 100 participate half the time (the rest participate all the time). 5% of the players dope, and each player either dopes throughout the study period or not at all. Testing is done every 3 months; the list of participants in each wave of the survey is known, as well as the percentage of players who tested positive in each wave. I found that after 3 years, there is enough information to identify 80% of the cheating players who participate irregularly. (Players who participate regularly are clearly safe.)

[Technical note: that’s an equal error rate of 20%; i.e, 20% of the cheating players are not accused, and 20% of the accused are innocent. There is a trade-off between the two numbers, as always; if a higher accuracy is required, say only 10% of accused players are innocent, then 65% of the cheating players can be identified.]
When applicable, a combination of the above techniques such as matching de-identified profiles across different time-periods of a survey (or different surveys) can greatly increase the attacker’s potential.

The point of the above scenarios is to convince you that you can never, ever be certain that the connection between a person and their data has been definitively severed. Regular readers of this blog will know that this is a recurring theme of my research. The quantity of data being collected today and the computational power available have destroyed the traditional and ingrained assumptions about anonymity. Well-established procedures have been shown to be completely inadequate, and it is far from clear that things can be fixed. Anyone who cares about their privacy must be vigilant against giving up their data under false promises of anonymity.

February 19, 2009 at 2:24 am Leave a comment

Social Network Analysis: Can Quantity Compensate for Quality?

Science magazine has labeled Christakis and Fowler the “dynamic duo”

Nicholas Christakis of Harvard and James Fowler of UC San Diego have produced a series of ground-breaking papers analyzing the spread of various traits in social networks: obesity, smoking, happiness, and most recently, in collaboration with John Cacioppo, loneliness. The Christakis-Fowler collaboration has now become well-known, but from a technical perspective, what was special about their work?

It turns out that they found a way to distinguish between the three reasons why people who are related in a social network are similar to each other.

Homophily is the tendency of people to seek others who are alike. For example, most of us restrict our dates to smokers or non-smokers, mirroring our own behavior.
Confounding is the phenomenon of related individuals developing a trait because of a (shared) environmental circumstance. For example, people living right next to a McDonald’s might all gradually become obese.
Induction is the process of one individual passing a trait or behavior on to their friends, whether by active encouragement or by setting an example.

Clearly, only induction can cause a trait to actually spread in a social network. To distinguish between the three effects and to prove causality, according to the authors, the key is longitudinal data–data from the same individuals collected over a period of years or decades. All of the works cited above are based on the Framingham Heart Study. This corpus of data is ideally suited in several ways:

It contains data from three generations of individuals.
Very few of the participants (10 out of over 5,000) dropped out:

“Even subjects who migrate out of the town of Framingham (to points throughout the U.S.) remain in the study and, remarkably, come back every few years to be examined and to complete survey forms.”

The original study sample comprised the majority of the population of Framingham, which is (presumably) a somewhat closed social network.

This illustrates the traditional way of doing things, using carefully selected high-quality data. With the growth of online social networking websites, however, a radically different approach is gaining prominence. A good example is this Slate article that analyzes the recent “25 random things” Facebook meme using well-known epidemiological models, and concludes that marketers should “introduce a wide variety of schemes into the wild and pray like hell that one of them evolves into a virulent meme.” For a more academic/rigorous example, see the paper “Characterizing Social Cascades in Flickr” (pdf), which looks at how information disseminates through social links.

Many analogies come to mind when comparing the Old School to the New School: the Cathedral vs. the Bazaar, or Britannica vs. Wikipedia. Information in social networking sites is collected through a chaotic, organic, unsupervised process. The set of participants is entirely self-selected. Against these objections stands the indisputable fact that the process produces several orders of magnitude more data at a fraction of the cost.

Despite being only a few years old, online social network analysis has already produced deep insights: the work of Jon Kleinberg springs to mind. But will it supplant the traditional approach? I think so. My hypothesis is that with sufficiently powerful analytical methods, quantity can compensate for noise in the data. Don’t take my word for it: Harvard professor Gary King considers the availability of data from online social networks to be the “most significant turning point in the history of sociology.”

The amount and variety of social network data available to researchers, marketers, etc. is rapidly increasing; there is a detailed survey in my forthcoming paper (at IEEE S&P) on de-anonymizing social networks. In spite of the rather serious privacy concerns that are identified in the paper, the balance of business incentives appears to be towards more openness, and my prediction is that social networks will continue to move in that direction. Facebook alone has an incredible wealth of as-yet untapped data on information flow–recent the feed-focused redesign instantly transformed posted items, group memberships and fan pages into meme propagation mechanisms.

The new approach to social network analysis has benefits other than the quantity of data available. Equally important is the fact that users of social networking sites are not participating in a study; we get to observe their lives directly. The data is thus closer to reality. Furthermore, there is the possibility of studying the population actively rather than passively. For instance, if the goal is to study meme propagation, why not introduce memes into the population? This gives the researcher much greater control over the timing, point of introduction, and content of the memes being studied. Of course, this raises ethical and methodological questions, but they will be worked out in due course.

A third benefit of the new approach is that social network users often express themselves using free form text; utilized properly, this could yield much deeper data than making study participants check boxes on a Likert scale in response to canned questions (such as the now famous “How does it feel to be poor and black?“). The Flickr paper cited above analyzes the tags people use to describe pictures. With more technical sophistication, it should be possible, for example, to apply automated sentiment analysis to blog posts, tweets, etc. to determine how your opinion of a movie or book is influenced by those of your friends.

True, we don’t yet have data spanning several decades, but then things happen on a far faster timescale in the online world. There will always be research questions that fundamentally depend on studying aspects of the real world that are not replicable virtually. By and large, however, I believe the new approach is about to supplant the old. There is still a ways to go in terms of developing the techniques we need for analyzing massive, noisy datasets, but we will get there in a few short years. The Christakis-Fowler papers may soon exemplify the exception, rather than the rule, for social network analysis.

February 15, 2009 at 7:40 am 1 comment

De-anonymizing the Internet

I’ve been thinking about this problem for quite a while: is it possible to de-anonymize text that is posted anonymously on the Internet by matching the writing style with other Web pages/posts where the authorship is known? I’ve discussed this with many privacy researchers but until recently never written anything down. When someone asked essentially the same question on Hacker News, I barfed up a stream of thought on the subject :-) Here it is, lightly edited.

Each one of us has a writing style that is idiosyncratic enough to have a unique “fingerprint”. However, it is an open question whether it can be efficiently extracted.

The basic idea for constructing a fingerprint is this. Consider two words that are nearly interchangeable, say ‘since’ and ‘because’. Different people use the two words in a differing proportion. By comparing the relative frequency of the two words, you get a little bit of information about a person, typically under 1 bit. But by putting together enough of these ‘markers’, you can construct a profile.

The beginning of modern, rigorous research in this field was by Mosteller and Wallace in 1964: they identified the author of the disputed Federalist papers, almost 200 years after they were written (note that there were only three possible candidates!). They got on the cover of TIME, apparently. Other “coups” for writing-style de-anonymization are the identification of the author of Primary Colors, as well as the unabomber (his brother recognized his style, it wasn’t done by statistical/computational means).

The current state of the art is summarized in this bibliography. Now, that list stops at 2005, but I’m assuming there haven’t been earth-shattering changes since then. I’m familiar with the results from those papers; the curious thing is that they stop at corpuses of a couple hundred authors or so — i.e, identifying one anonymous poster out of say 200, rather than a million. This is probably because they had different applications in mind, such as identification within a company, instead of Internet-scale de-anonymization. Note that the amount of information you need is always logarithmic in the potential number of authors, and so if you can do 200 authors you can almost definitely push it to a few tens of thousands of authors.

The other interesting thing is that the papers are fixated with ‘topic-free’ identification, where the texts aren’t about a particular topic, making the problem harder. The good news is that when you’re doing this Internet-scale, nobody is stopping you from using topic information, making it a lot easier.

So my educated guess is that Internet-scale writing style de-anonymization is possible. However, you’d need fairly long texts, perhaps a page or two. It’s doubtful that anything can be done with a single average-length email.

Another potential de-anonymization strategy is to use typing pattern fingerprinting (keystroke dynamics), i.e, analyzing the timing between our keystrokes (yes, this works even for non-touch typists.) This is already used in commercial products as an additional factor in password authentication. However, the implications for de-anonymization have not been explored, and I think it’s very, very feasible. i.e, if google were to insert javascript into gmail to fingerprint you when you were logged in, they could use the same javascript to identify you on any web page where you type in text even if you don’t identify yourself. Now think about the de-anonymization possibilities you can get by combining analysis of writing style and keystroke dynamics…

By the way, make no mistake: the malicious uses of this far overwhelm the benevolent uses. Once this technology becomes available, it will be very hard to post anonymously at all. Think of the consequences for political dissent or whistleblowers. The great firewall of China could simply insert a piece of javascript into every web page, and poof, there goes the anonymity of everyone in China.

It think it’s likely that one can build a tool to protect anonymity by taking a chunk of writing and removing your fingerprint from it, but it will need a lot of work, and will probably lead to a cat-and-mouse game between improved de-anonymization and obfuscation techniques. Note the caveats, however: most ordinary people will not have the foreknowledge to find and use such a tool. Second, think of all the compromising posts — rants about employers, accounts from cheating spouses, political dissent, etc. — that have already been written. The day will come when some kid will download a script, let a crawler loose on the web, and post the de-anonymized results for all to see. There will be interesting consequences.

If you’re interested in working on this problem–either writing style analysis for breaking anonymity or obfuscation techniques for protecting anonymity–drop me a line.

January 15, 2009 at 3:16 am 21 comments

The Fallacy of Anonymous Institutions

The graph below is from the paper “Chains of affection: The structure of adolescent romantic and sexual networks.” The name of the school that the data was collected from is not revealed, and is given the working name “Jefferson High.” It is part of the National Longitudinal Study of Adolescent Health, containing very detailed health information on 100,000 high school students in 140 schools. In 12 of the schools, the entire sexual network was mapped out.

Clearly, the authors felt that concealing the identity of the school is important for protecting the privacy of the participants. It’s not hard to see why: firstly, the aggregate information presented in the study could by itself be unpleasant, especially facts about the prevalence of adolescent sexual activity in a conservative rural town (see below). Second, and more importantly, knowing the identity of the school can lead to further de-anonymization of the individuals in the network.

The graph above is rich enough that a few individuals can identify themselves purely based on the local information available to them, and thus learn things about their neighbors in the graph. A group of individuals getting together will have an even easier time of it. Furthermore, the actual paper provides a richer, temporally ordered version of the graph above.

But even strangers may benefit: depending on how well the temporal information in the sexual graph correlates with other temporal information that may be available, say from Facebook, de-anonymization might be possible with little or no co-operation from the subjects themselves. Soon, I will have more to say about research results on de-anonymizing graphs with loosely correlated external/auxiliary data.

Having established the privacy risk, let’s see how easy it is to re-identify Jefferson High. The authors give us these helpful clues:

“Jefferson High School” is an almost all-white high school of roughly 1000 students located in a mid-sized mid-western town. Jefferson High is the only public high school in the town. The town, “Jefferson City” is over an hour away by car from the nearest large city. Jefferson City is surrounded by beautiful countryside, home to many agricultural enterprises. The town itself is working class, although there remain some vestiges of better times. At one period, the town served as a resort for city dwellers, drawing an annual influx of summer visitors. This is no longer the case, and many of the old resort properties show signs of decay. The community is densely settled. At the time of our fieldwork, students were reacting to the deaths of two girls killed in an automobile accident.

Some further facts presented have high amusement value, and are equally useful for re-identification:

Jefferson students earn lower grades, are suspended more, feel less attached to school, and come from poorer families than those at comparable schools. They are more likely than students in other high schools to have trouble paying attention, have lower self-esteem, pray more, have fewer expectations about college, and are more likely to have a permanent tattoo. Compared to other students in large disproportionately white schools, adolescents in Jefferson High are more likely to drink until they are drunk. In schools of comparable race and size, on average 30% of 10th-12th grade students smoke cigarettes regularly, whereas in Jefferson, 36% of all 10th to 12th graders smoke. Drug use is moderate, comparable to national norms. Somewhat more than half of all students report having had sex, a rate comparable to the national average, and only slightly higher than observed for schools similar with respect to race and size. Nevertheless, if Jefferson is not Middletown, it looks like an awful lot like it. The adolescents at Jefferson High are pretty normal. In describing the events of the past year, many students report that there is absolutely nothing to do in Jefferson. For fun, students like to drive to the outskirts of town and get drunk. Jefferson is a close-knit insular predominantly working-class community which offers few activities for its youth.

A database of public schools in the U.S. is available for sale for $75, containing very detailed information about each school. I’m quite confident that the information in there is sufficient to re-identify Jefferson High.

This thesis of this blog that the amount of entropy required to de-anonymize an individual — 33 bits — is low enough that it doesn’t offer meaningful protection in most circumstances. Obviously, the argument applies even more strongly to the anonymity of a well-defined group of people.

Let’s be clear: the paper is from 1994; who slept with whom in high school is not a huge deal a decade and a half later. However, the problem is systemic, and IRBs (Institutional Review Boards) keep blithely approving releases of data with such nominal de-identification applied. The re-identification of the institutional affiliation of an entire population of a study is of more concern from the privacy perspective than the de-anonymization of individual identities: it needs to be done only once, and affects hundreds or thousands of individuals.

Recently, a group of researchers from the Berkman Center released a dataset of Facebook profile information from an entire cohort (the class of 2009) of college students from “an anonymous, northeastern American university.” It was promptly de-anonymized by Michael Zimmer, who revealed that it was Harvard College:

As I noted here, the press release and the public codebook for the dataset provided many clues to where the data came from: we know it is a northeastern US university, it is private, co-ed, and whose class of 2009 initially had 1640 students in it. A quick search for schools reveals there are only 7 private, co-ed colleges in New England states (CT, ME, MA, NH, R , VT) with total undergraduate populations between 5000 and 7500 students (a likely range if there were 1640 in the 2006 freshman class): Tufts University, Suffolk University, Yale University, University of Hartford, Quinnipiac University, Brown University, and Harvard College.

[…]

Finally, and perhaps most convincingly, only Harvard College offers the specific variety of the subjects’ majors that are listed in the codebook. While nearly all univerersities offer the common majors of “History”, “Chemistry” or “Economics”, one only needs to search for the more uniquely phrased majors to discover a shared home institution.

Another amusing example is a paper on mobile phone call graphs which attempts to keep the identity of an entire country secret. I found that the approximate population of the country reported in the paper together with the mobile phone penetration rate is sufficient to uniquely identify it.

Suppressing the identity of your study population has some privacy benefits: at least, it won’t show up in google searches. But relying on it for any kind of serious privacy protection would be foolish. Scrubbing an entire dataset or research paper of clues about the study population can be hard or impossible; further, a single study participant corroborating the published results or methodology might be sufficient for de-anonymization of the group. The only solution is therefore to assume that the identity of the study population will be discovered, and to try to ensure that individual identities will still be safe from re-identification.

December 15, 2008 at 10:48 am 3 comments

Graph Isomorphism: Deceptively Hard

This is the first in a series of loosely related posts that will lead up to an announcement of new research results.

Asymptotic complexity is often very useful in succintly expressing how hard a problem is. Occasionally, though, it can obfuscate the picture. Graph isomorphism is a perfect example.

Many people mistakenly believe that graph isomorphism (GI) is hard — either NP-hard, or hard enough to be insoluble in practice for large problem sizes. Certainly the complexity of the best known algorithm — exp(O(sqrt(n log n))), due to Babai and Luks — would appear to support this belief. However, the truth is that GI belongs to its own complexity class, not known to be NP-hard nor known to be soluble in polynomial time. More importantly, on real inputs, graph isomorphism is ridiculously easy. So easy that real-life solvers can plow through hundreds of thousands of instances per second.

Why is this the case? It turns out that on random graphs, GI is solvable in a very straightforward way. You only need to look at the local structure of each node — due to the randomness, every node has a distinctive neighborhood. Thus, the complexity of the worst case input and the average case input are on opposite ends of a spectrum. To see where “real” graphs fall on this spectrum, note that generative models of real graphs have a regular part and a random part. The regular part is captured in rules such as preferential attachment, but such rules only induce a probability distribution; there is still quite a bit of coin tossing needed to generate each edge. Intuitively, if there is “enough” randomness in the neighborhood of each node, then you only need to look at the local structure to tell different nodes apart. Thus, real-world graphs behave pretty much like random graphs.

It gets better: one of the uses of nauty, the leading graph isomorphism solver, is to find hard instances of graph isomorphism. The way this is done is by generating hundreds of candidate hard instances, solving them, and picking the ones that are the hardest to solve. (These are usually highly specialized graphs which are strongly regular and have fearful symmetry groups.) It would appear, then, that finding hard inputs for GI is GI-hard! I wonder if this property can be formally defined, and if there are other known problems for which this is true. This is a similar notion to Impagliazzo’s Heuristica, a world where NP != P, but for any problem in NP, and for inputs drawn from any samplable distribution (i.e, for any “real-world” input), there exists a polynomial time algorithm to decide it.

An example of the kind of input that gives graph isomorphism solvers any trouble.

While I find this an interesting theory problem, and would love to hear opinions on it from theorists, my reason for posting this is to point out that with all the current evidence, graph isomorphism can be assumed to be solvable in randomized polynomial time for any input that you would actually encounter in reality.

November 20, 2008 at 6:03 am Leave a comment

Lendingclub.com: A De-anonymization Walkthrough

The AOL and Netflix privacy incidents have shown that people responsible for data release at these companies do not put themselves in the potential attacker’s shoes in order to reason about privacy. The only rule that is ever applied is “remove personally identifiable information,” which has been repeatedly shown not to work. This fallacy deserves a post of its own, and so I will leave it at that for now.

The reality is that there is no way to guarantee privacy of published customer data without going through complex, data-driven reasoning. So let me give you an attacker’s-eye-view account of a de-anonymization I carried out last week—perhaps an understanding of the adversarial process will help reason about the privacy risks of data release.

Lending Club, a company specializing in peer-to-peer loans, makes the financial information collected from their customers (borrowers) publicly available. I learned of this a week ago, and there are around 4,600 users in the dataset as of now. This could be a textbook example illustrating a variety of types of sensitive information and a variety of attack methods to identify the individuals! Each record contains the following information:

I. Screen name
II. Loan Title, Loan Description,
III. Location, Hometown, Home Ownership, Current Employer, Previous Employers, Education, Associations
IV. Amount Requested, Interest Rate, APR, Loan Length, Amount Funded, Number of Lenders, Expiration Date, Status, Application Date
V. Credit Rating, Tenure, Monthly Income, Debt-To-Income Ratio, FICO Range, Earliest Credit Line,Open Credit Lines,Total Credit Lines, Revolving Credit Balance, Revolving Line Utilization,Inquiries in the Last 6 Months, Accounts Now Delinquent, Delinquent Amount, Delinquencies (Last 2 yrs), Months Since Last Delinquency, Public Records On File, Months Since Last Record

What data is sensitive?

Of course, any of the above fields might be considered sensitive by one or another user, but there are two types of data that are of particular concern: financial data and the loan description. The financial data includes monthly income, credit rating and FICO credit score; enough said. Loan description is an interesting column. A few users just put in “student loans” or “consolidate credit card debt.” However, a more informative description is the norm, such as this one:

This loan will be used to pay off my 19% Business Credit Card with AMEX. I have supporting documentation to prove my personal Income. I would much rater get a loan and pay back fixed amount each month rather then being charged more and more each month on the same balance. I can afford to pay at min $800 a month. I have 4 Reserves in the bank and have over 70% of my credit limit open for use.

Often, users reveal a lot about their personal life in the hope of appealing to the emotions of the prospective lender. Here’s an example (this is fairly common in the data):

My husband’s lawyer has told us that we need $5000 up front to pay for his child custody case. We are going to file for primary custody. Right now he has no visitation rights according to their divorce agreement. His ex-wife has been evicted twice in the four months and is living with 2 of their 3 daughters in a two bedroom apartment with her boyfriend. She has no job or car and the only money they have is what we give them in child support and she blows all of it on junk. We have a 2000+ square foot house, both have stable jobs, and our own cars. Both girls(12 and 15 years old) are allowed to go and do whatever they please even though they are failing classes at school. We are clearly the better situation for them to be raised in but we simply do not have that much money all at once. We would be able to pay around $200 per month for repayment.

A few loan descriptions are quite hilarious. This one is my personal favorite.

Who’s the “bad guy” and what might they do with data of this kind, assuming it can be re-identified with the individuals in question? Certainly, it would help shady characters carry out identity theft. But there is also the unpleasant possibility that a customer’s family members or a boss might learn something about them that the customer didn’t intend them to know. The techniques below focus on the former threat model, en masse de-anonymization. The latter is even easier to carry out since human intelligence can be applied.

How to de-anonymize

The “screen name” field

Releasing the screen name seems totally unnecessary. Many people use a unique username everywhere (in fact, this tendency is so strong that there is a website to automate the process of testing your favorite username across websites). Often, googling a username brings up a profile page on other websites. Furthermore, these results can be further pruned in an automated way by looking at the profile information on the page. Here is an example (mjchrissy) taken from the Lending Club dataset. By obvserving that the person in the MySpace profile is in the same geographical location (NJ) as the person in the dataset, we can be reasonably sure that it’s the same person.

To measure the overall vulnerability, I wrote a script to find the estimated Google results count for each username in the dataset, using Google’s search API. If there are less than 100 results, I consider the person to be highly vulnerable to this attack; if there are between 100 and 1,000, they are moderately vulnerable. The Google count is only an approximate measure. For example, the estimated count for my standard username (randomwalker) is in the tens of thousands, but most of the results in the first few pages relate to me, and again, this can be confirmed by parsing the profile pages that are found by the search. Also, the query can be made more specific by using auxiliary terms such as “user” and “profile.” For example, the username radiothermal, also from the dataset, appears to be a normal word with tens of thousands of hits, but with the word “profile” thrown in, we get their identity right away.

Some users choose their email address as their username. This can be considered as immediately compromising their identity even if there are no google search results for it. Finally, there are users who use their real name as their screen name. This is harder to measure, but we can get a lower bound with a clever enough script. (You can find my script here; I’m quite proud of it :-)) The table below summarizes the different types and level of risk. Note that some of the categories are overlapping; the total number of high-risk records is 1725 and the total number of medium-risk records is 939.

Risk type	Risk level	No. of users
result count = 0	low	1198
0 < result count < 100	high	1610
100 <= result count < 1000	medium	560
1000 <= result count	low	1196
username is email	high	51
either first or last name	medium	429
both first and last name	high	204

Location and work-related fields

The combination of hometown, current location, employer, previous employer and education (i.e, college) should be uniquely identifying for modern Americans, considering how mobile we are (except if you live in a rural town and have never left there). In fact, any 3 or 4 of those fields will probably do. As a sanity check, I verified that there are no duplicates on these fields within the database itself.

Amusingly, there were around 40 duplicates and even a few triplicates, but all of these turned out to be people re-listing their information in the hope of increasing their chances of getting funded. Since the dataset consists of only approved loans, all of these people were approved multiple times! This is a great example of how k-anonymity breaks down in a natural way. [k-anonymity is an intuitively appealing but fundamentally flawed approach to protecting privacy that tries to make each record indistinguishable from a few other records. Here is a technical paper showing that k-anonymity and related methods such as l-diversity are useless. This is again something that deserves its own post, and so I won’t belabor the point.]

While I’m sure that auxiliary information exists to de-anonymize people based on these fields, I’m not sure what’s the easiest way to get it, considering that It needs to be put together from a variety of different sources. Companies such as Choicepoint probably have this data in one place already, but you need a name or social security number to search. Instead, screen-scraping social network sites would be a good way to start aggregating this information. Once auxiliary information is available, the re-identification process is trivial algorithmically.

The “Associations” field

I love this field, since it is very similar to the high dimensional data in the Netflix paper. Since Lendingclub was launched as a Facebook application, it appears that they are asking for everyone’s Facebook groups. Anyone who is familiar with de-anonymizing high-dimensional data would know that you only need 3-4 items to uniquely identify a person. It gets worse: the Facebook API allows you to get user’s names and affiliations by searching for group membership. You can use the affiliations field (which is a list of networks you belong to, and is distinct from the group memberships) to narrow things down once you get to a few tens or even hundreds of candidate users. This gives you a person’s identity in the most concrete manner possible: a Facebook id, name and picture.

How many users are vulnerable? Based on manually analyzing a small sample of users, it appears that (roughly) anyone with three or more groups listed is vulnerable, so around 300. (Users with two listed groups may be vulnerable if they are both not very popular, and users with many groups may not be vulnerable if they are all popular, but let’s ignore that.)

Now, automating the de-anonymization is hard, since the group name is presented as free form text. The field separator (comma) that separates different group names in the same cell appears in the names of groups as well! Secondly, the Facebook API doesn’t allow you to search by group name.

I managed to overcome both of these limitations. I wrote a script that evaluates the context around a comma and determines if it occurs at the boundary of a group name or in the middle of it. Mapping a group name to a Facebook group id is a much harder problem. One possible solution is to use a Google search, and parse the “gid” parameter from the from the url of matching search results. Example: “Addicted to Taco Bell site:facebook.com.” There are various hacks that can be used to refine it, such as putting the group name in quotes or using Google’s “allinurl:” to match the pattern of the Facebook group page URL’s.

The other strategy, and the one that I pursued, is to use the search feature on Facebook itself. A higher percentage of searches succeed with this approach, but it is harder because I needed to parse the HTML that is returned. With either strategy, the hardest part is in distinguishing between multiple groups that often have almost identical names. My current strategy succeeds for about one-third of the groups, and maps the group name to either a single gid or a short list of candidate gids. I suspect that a combination of Google and Facebook searches would work best. Of course, using human intelligence would increase the yield considerably.

The final step here is to get the group members via the Facebook Query language, find the users who are common to all the listed groups, and use the affiliations to further prune the set of users. I’ve written the FQL query and verified that it works. Running it en-masse is a little slow, however, since the query takes a long time to return. I’ll probably run it when I have some more free time to analyze the results.

Let’s summarize

The interesting thing about this dataset is that Lending Club makes it very clear in their privacy policy that they publish the data in this fashion. And yet, it seems that intuitively, this is an egregious violation of privacy, no matter what the privacy policy might say. I will have more to say on this soon.

Almost everyone in the dataset can be re-identified if their location and work information is known, although this information is a little hard to gather on a large scale. The majority of customers are vulnerable to some extent because of identifying usernames, and more than a third are highly vulnerable. The privacy policy does state that the username will be shared in conjunction with other information, but can users really be expected to be aware of how easy it is to automate re-identification via their username? More importantly, why publish the username? What were they thinking? And certainly, the possbility of re-identification via their group associations must come as a complete surprise to most customers.

In general, what does an attacker need to carry out de-anonymization attacks of the sort described here? A little ingenuity in looking for auxiliary information is a must. Being able to write clients for different APIs, and also screen scraping code is very helpful. Finally, there a number of tasks involving a little bit of “AI,” such as matching group names, for which there is no straightforward algorithm but where using different heuristics can get you very close to an optimal solution.

Thanks to David Molnar for helping me figure out Facebook’s and Google’s APIs. Thanks to Vitaly Shmatikov and David Molnar for reading a draft of this essay. (more…)

November 12, 2008 at 8:14 pm 8 comments

Bay Area Visit/Talk Schedule

I’m visiting the Bay Area for a couple of weeks. (I’m likely to graduate next Summer and looking to move here.) I gave a talk about my work at Stanford today. It went great, I got lots of feedback and met many cool people. Thanks everyone who showed up!

I will be visiting the Research and Search groups at Microsoft on Monday. I’m also speaking at PARC on Tuesday (11-12) and (tentatively) Berkeley on Wednesday. A couple of other people invited me to give talks at their respective organizations today; I will update once I schedule those.

Here’s the talk abstract:

Privacy and Anonymity in a World of Interconnected Data

The new Web economy relies on the collection of personal data on an ever-increasing scale. Data is collected about our tastes, purchases, searches, browsing history, friendships and relationships, health
history, genetics and so forth. The aggregated datasets are not stationary: they shared with advertisers, marketers and researchers for business reasons. Nor does each such dataset exist in isolation:
it contains implicit or explicit references to other datasets. Unsurprisingly, this has led to a host of privacy issues.

In this talk, I survey the different types of data that are being collected and shared, and propose theoretical models for analyzing privacy in such datasets. Next, I discuss the subtle relationship between anonymity and privacy and present a few related techniques for de-anonymizing large datasets, accompanied by the results of experiments. Finally, I will touch upon the broader threats to privacy arising from these techniques and discuss possible solutions, which, out of necessity, will have a non-technological dimension.

October 15, 2008 at 9:09 am Leave a comment

Eccentricity Explained

When trying to find someone in an ‘anonymous’ collection of data, two major questions need to be answered:

Which is the best match among all the data records, given what I know about the person I’m looking for?
Is the match meaningful, and not an unrelated record that co-incidentally happens to be similar?

The first question is conceptually simple: one needs to come up with a “scoring function” that compares two sets of data and produces a numerical match score. However, this needs to be done with domain-specific knowledge. In the case of the Netflix dataset, for instance, the scoring function incorporated our intuitions about how long a person might take to review a movie after watching it.

The second question is harder, put perhaps somewhat surprisingly, can be done in a domain-independent way. This is the notion of “eccentricity” that we came up with, and it might be of independent interest. During my talks, I could see that there was some confusion and misunderstanding time and again; hence this post.

The concept behind eccentricity is to measure how much the matching record “stands out” from the crowd. One way to do that would be do measure the difference between the top score and the mean score. (As a multiple of the standard deviation. You always need to divide by the standard deviation to derive a dimensionless quantity.)

The problem with this intuitive measure is that in a large enough collection of data, there will always be entries that have a high enough matching score, purely by chance. To be able to model what scores you’d expect by chance, you need to know everything about how people rate movies (or browse the web, or the equivalent in your dataset) and the correlations between them. And that’s clearly impossible.

The trick is to look at the difference between the best match and the second best match as a multiple of the standard deviation. If the scores are distributed according to an exponential distribution (that’s what you’d expect in this case by pure chance, not a Gaussian), then the difference between the top two matches also follows an exponential distribution. That’s a nifty little lemma.

So, if the best match is 10 standard deviations away from the second best, it argues very strongly against the “null hypothesis,” which is that the match occured by fluke. Visually, the results of applying eccentricity are immediately apparent:

Perhaps the reason that eccentricity is at first counterintuitive is that it looks at only two items and throws away the rest of the numbers. But this is common in statistical estimation. Here’s a puzzle that might help: given n samples a₁, a₂, … a_n from the interval [0, x], where x is unknown, what is the best predictor of x?

–Select whitespace below for answer–

Answer: max(a_i) * (n+1)/n.

In other words, throw away all but one of the samples! Counterintuitive?

—

October 3, 2008 at 2:38 am 4 comments