Adversarial Thinking Considered Harmful (Sometimes)
This article starts from the example of a simple privacy mishap and argues that the flawed thinking it exposes is a symptom of a deeper malaise and that the structure of privacy research in computer science might require rethinking.
I was surprised by a statement in a recent blog post by Geni, a genealogy-based social networking site, that plainly asserted, “following does not have any privacy implications.” This was in reference to the feature to “follow” a user or profile on the site, which among other things notifies you instantly of new information or activity about the person. (Admirably, however, Geni listened to their users and made some changes to the feature.)
Of course following has privacy implications. Without the follow feature — not just on Geni but on virtually every site that provides an equivalent capability — to obtain the same level of up-to-date information about a person, you’d have to either sit around constantly refreshing their profile or else write a bot that will do that for you and notify you of any updates by email. It is precisely because of this vast difference in the ease of keeping track of people that there was a backlash when Facebook introduced News Feed several years ago.[1]
Why then would anyone claim that following has no privacy implications? The culprit here is “adversarial thinking,” an analytical process that computer scientists and security engineers are trained in. Under this paradigm, users are viewed as all-powerful “adversaries” (limited only by the fundamental computational limits of nature), typically interested in learning as much information about everyone as possible. Clearly, if everyone is an “adversary,” the follow feature makes not a whit of difference, since anyone could create and operate the bot mentioned above with no effort at all.[2]
Weird as it may seem to the uninitiated, adversarial thinking is second nature to computer scientists. It is adversarial thinking that leads to the formulation of privacy as an access-control problem, something that I’ve criticized; the Geni blog post explicitly mentions this as their formulation of privacy. Privacy-as-access-control makes for neat papers but tends to break down quickly in the real world.
Let me be clear: adversarial thinking is a deep and valuable skill that is indispensable in the context that it is meant for — designing cryptosystems. However, it is not always the right paradigm in the privacy context. The theoretical study of database privacy seems to be doing rather well by borrowing methods from cryptography, and I’ve argued in support of adversarial thinking therein. On the other hand, social networking privacy falls squarely in the class of studies in which I find the adversarial approach to have limited value.
There’s a bigger take-away here: the structure of privacy research within computer science might require rethinking. Privacy is currently not considered a first-rate topic but is instead a side-interest of different communities such as security, cryptography and databases/datamining. As a result of this lack of primacy, not only do we frequently use the wrong methods — when all you’ve got is a hammer, everything looks like a nail — we’re also missing out on the chance to borrow from the literature on privacy in fields like law, economics, sociology, and human-computer interaction.
Endnotes
[1] This is not the only reason why the follow feature has privacy implications. On Livejournal, being followed by people with offensive usernames is sometimes a problem, compounded by the fact that due to the UI, it is not obvious who is following whom. In fact, the privacy changes made by Geni seem intended to address roughly this type of concern rather than the ease-of-tracking issue.
[2] While the term adversary is standard, adversarial thinking is a term I’ve coined here to describe a somewhat loose collection of axioms (including, for example, Kerckhoff’s principle) that constitute the dominant paradigm of cryptography/security. I don’t think there is an extant term; I’d love to be corrected.
Thanks to Aleksandra Korolova for comments on a draft.
To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.
Facebook’s Instant Personalization: An Analysis of Fundamental Privacy Flaws
Facebook has begun to accelerate the web-wide roll-out of the Instant Personalization program. The number of partner websites recently jumped from three to five, and a partnership with early stage venture firm YCombinator is set to greatly expand that number in the coming months.[1]
Instant Personalization allows a partner website to automatically learn the identity of a visitor (as well as some data about them) without any explicit user action, provided that the visitor is a logged-in Facebook user. It is probably the most privacy-intrusive change introduced by the company this year, and could lead to a profound change in how the web works and is perceived.
Facebook’s superficially reassuring line is that only data that is already public is shared with partner sites. Even ignoring the fact that it is hard for users to figure out exactly what data is public, and is only getting harder, I find the official explanation to be a red herring. In this article I will examine the various fundamental flaws of Instant Personalization.
1. Sneakiness. All the information transmitted via Instant Personalization is available via Facebook connect; the sole purpose of Instant Personalization is to eliminate the element of user authorization from the process. Thus, I find the very raison d’etre to be questionable. If a user declines to use Facebook connect, perhaps they had a good reason for doing so. Think about a porn site — I don’t think I need to elaborate.
2. Identity. To me, what is much more worrisome than third parties getting your data is third parties getting your identity when you browse. The idea that a website knows who you are as soon as you land on it is inherently creepy because it violates users’ mental model of how the web works. The cumulative effect is worse — people are intensely uncomfortable when they feel they are being “followed around” as they browse the web.
From a technical perspective, an Instant Personalization partner could itself turn around and become an Instant Personalization provider, and so could any website that this partner provided Instant Personalization services for, ad infinitum. This is because any number of tracking devices (invisible iframes) can be nested within a page.
Implementation bugs on partner sites also have the effect of leaking your identity to other parties. In my ubercookies series, I documented a series of bugs that can be exploited by an arbitrary website to learn the visitor’s identity. All of these apply to Instant Personalization, i.e., if any one of the partner websites has such a bug, that can be exploited by an arbitrary attacker to instantly de-anonymize a visitor to his site. Security researcher theharmonyguy has a great post on cross-site scripting vulnerabilities on both Rotten Tomatoes and Scribd that compromise Instant Personalization in this fashion.[2]
3. Facebook gets your clickstream. Instant Personalization is a two way street: while the partner site gets access to the user’s identity, Facebook learns the URLs of the pages the user visits. In a world where Instant Personalization is widely deployed, Facebook will be able to monitor a large fraction, perhaps the majority, of clicks that you make around the web.
While troubling, this is not unprecedented: the Faceook like button constitutes a very similar privacy problem — Facebook sees you whenever you visit any page with the like button (or another social plugin) installed, even if you don’t click the like button.[3] Facebook bowed to pressure from privacy advocates and agreed to delete the logs from social plugins after 90 days; I would like to see the same policy applied to Instant Personalization logs as well.
4. Third parties could get your clickstream. Normally, an Instant Personalization partner can only see your clicks on their own site. However, think of an Instant Personalization partner whose product is a social widget or an analytics plugin that is intended to be installed on many client sites. From a technical perspective, loading a page or widget in an iframe is not fundamentally different from visiting the site directly. That means it is feasible for an Instant Personalization partner with a social widget to monitor your clicks — tied to your real identity, of course — on all sites with the widget installed.[4]
5. Lack of enforcement. So far I have described the lack of technological barriers to various types of misuse and abuse of Instant Personalization. However, Facebook contractually prohibits partners from misusing the data. The natural question is whether this is effective.
It is too early to tell yet, because there are currently only five partners. To predict how things will turn out once numerous startups — without the resources or incentive for security testing and privacy compliance — get on board, we can look to the track-record of Facebook’s third party application platform. As you may recall, this has been rather poor, with enforcement of Terms of Service violations being haphazard at best.
Mitigation. In my opinion these flaws are inherent, and I don’t think Instant Personalization will turn out well from a security and privacy perspective. User expectations are not malleable, cross-site scripting bugs will always exist, there will soon be too many partner sites to monitor closely, and some of them will look for ways to push the boundaries of what they can do.
However, there are two things Facebook can do to mitigate the extent of the damage. The first is to make public both the technical specification and the Terms of Use of the Instant Personalization program, so that there can be some independent monitoring of bugs and policy violations. The second is to commit resources to ToS enforcement — Facebook needs to signal that their enforcement efforts have some teeth, and that there will be penalties for partners with buggy sites or noncompliant data use practices.
Footnotes.
[1] YCombinator-funded companies will get “priority access” to various Facebook technologies including “Facebook Credits, Instant Personalization and upcoming beta features”. Interestingly, Instant Personalization seems to be the feature that YCombinator is most interested in.
[2] Yelp.com was also found vulnerable to a cross-site scripting bug soon after Instant Personalization launch. This means the majority of partner sites — 3 out of 5 — have had vulnerabilities that compromise Instant Personalization.
[3] In Instant Personalization, Facebook and the partner site communicate invisibly in the background each time the user visits a page on the partner site; in this way the mechanism is different from social widgets.
[4] Large-scale clickstream data is prone to misuse in various ways: government coercion, hacking, or being purchased as part of bankruptcy settlements (expecially when we’re talking about startups).
Thanks to Kevin Bankston for pointing me to Facebook’s log rentention policy for social plugins.
To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.
“Do Not Track” Explained
While the debate over online behavioral advertising and tracking has been going on for several years, it has recently intensified due to media coverage — for example, the Wall Street Journal What They Know series — and congressional and senate attention. The problems are clear; what can be done? Since purely technological solutions don’t seem to exist, it is time to consider legislative remedies.
One of the simplest and potentially most effective proposals is Do Not Track (DNT) which would give users a way to opt out of behavioral tracking universally. It is a way to move past the arms race between tracking technologies and defense mechanisms, focusing on the actions of the trackers rather than their tools. A variety of consumer groups and civil liberties organizations have expressed support for Do Not Track; Jon Leibowitz, chairman of the Federal Trade Comission has also indicated that DNT is on the agency’s radar.
Not a list. While Do Not Track is named in analogy to the Do Not Call registry, and the two are similar in spirit, they are very different in implementation. Early DNT proposals envisaged a registry of users, or a registry of tracking domains; both are needlessly complicated.
The user-registry approach has various shortcomings, at least one of which is fatal: there are no universally recognized user identifiers in use on the Web. Tracking is based on ad-hoc identification mechanisms, including cookies, that the ad networks deploy; by mandating a global, robust identifer, a user registry would in one sense exacerbate the very problem it attempts to solve. It also allows for little flexibility in allowing the user to configure DNT on a site-by-site basis.
The domain-registry approach involves mandating ad networks to register domains used for tracking with a central authority. Users would have the ability to download this list of domains and configure their browser to block them. This strategy has multiple problems, including: (i) the centralization required makes it fickle (ii) it is not clear how to block tracking domains without blocking ads altogether, since displaying an ad requires contacting the server that hosts it and (iii) it requires a level of consumer vigilance that is unreasonable to expect — for example, making sure that the domain list is kept up-to-date by every piece of installed web-enabled software.
The header approach. Today, consensus has been emerging around a far simpler DNT mechanism: have the browser signal to websites the user’s wish to opt out of tracking, specifially, via a HTTP header, such as “X-Do-Not-Track”. The header is sent out with every web request — this includes the page the user wishes to view, as well as each of the objects and scripts embedded within the page, including ads and trackers. It is trivial to implement in the web browser — indeed, there is already a Firefox add-on that implements a such a header.
The header-based approach also has the advantage of requiring no centralization or persistence. But in order for it to be meaningful, advertisers will have to respect the user’s preference not to be tracked. How would this be enforced? There is a spectrum of possibilities, ranging from self-regulation via the Network Advertising Initiative, to supervised self-regulation or “co-regulation,” to direct regulation.
At the very least, by standardizing the mechanism and meaning of opt-out, the DNT header promises a greatly simplified way for users to opt-out compared to the current cookie mechanism. Opt-out cookies are not robust, they are not supported by all ad networks, and are interpreted variously by those that do (no tracking vs. no behavioral advertising). The DNT header avoids these limitations and is also future-proof, in that a newly emergent ad network requires no new user action.
In the rest of this article, I will discuss the technical aspects of the header-based Do Not Track proposal. I will discuss four issues: the danger of a tiered web, how to define tracking, detecting violations, and finally user-empowerment tools. Throughout this discussion I will make a conceptual distinction between content providers or publishers (2nd party) and ad networks (3rd party).
Tiered web. Harlan Yu has raised a concern that DNT will lead to a tiered web in which sites will require users to disable DNT to access certain features or content. This type of restriction, if widespread, could substantially undermine the effectiveness of DNT.
There are two questions to address here: how likely is it that DNT will lead to a tiered web, and what, if anything, should be done to prevent it. The latter is a policy question — should DNT regulation prevent sites from tiering service — so I will restrict myself to the former.
Examining ad blocking allows us to predict how publishers, whether acting by themselves or due to pressure from advertisers, might react to DNT. From the user’s perspective, assuming DNT is implemented as a browser plug-in, ad blocking and DNT would be equivalent to install and, as necessary, disable for certain sites. And from the site’s perspective, ad blocking would result in a far greater decline in revenue than merely preventing behavioral ads. We should therefore expect that DNT will be at least as well tolerated by websites as ad blocking.
This is encouraging, since there are very few mainstream sites today that refuse to serve content to visitors with ad blocking enabled. Ad blocking is quite popular (indeed, the most popular extensions for both Firefox and Chrome are ad blockers). A few sites have experimented with tiering for ad-blocking users, but soon after rescinded due to user backlash. Public perception is a another factor that is likely to skew things even further in favor of DNT being well-tolerated: access to content in exchange for watching ads sounds like a much more palatable bargain than access in exchange for giving up privacy.
One might nonetheless speculate what a tiered web might look like if the ad industry, for whatever reason, decided to take a hard stance against DNT. It is once again easy to look to existing technologies, since we already have a tiered web: logged-in vs anonymous browsing. To reiterate, I do not believe that disabling DNT as a requirement for service will become anywhere near as prevalent as logging in as a requirement for service. I bring up login only to make the comforting observation there seems to be a healthy equilibrium between sites that require login always, some of the time, or never.
Defining tracking. It is beyond the scope of this article to give a complete definition of tracking. Any viable definition will necessarily be complex and comprise both technological and policy components. Eliminating loopholes and at the same time avoiding collateral damage — for example, to web analytics or click-fraud detection — will be a tricky proposition. What I will do instead is bring up a list of questions that will need to be addressed by any such definition:
- How are 2nd parties and 3rd parties delineated? Does DNT affect 2nd-party data collection in any manner, or only 3rd parties?
- Are only specific uses of tracking (primarily, targeted advertising) covered, or is all cross-site tracking covered by default, save possibly for specific exceptions?
- Under use-cases covered (i.e., prohibited) under DNT, can 3rd parties collect any individual data at all or should no data be collected? What about aggregate statistical data?
- If individual data can be collected, what categories? How long can it be retained, and for what purposes can it be used?
Detecting violations. The majority of ad networks will likely have an incentive to comply voluntarily with DNT. Nonetheless, it would be useful to build technological tools to detect tracking or behavioral advertising carried out in violation of DNT. It is important to note that since some types of tracking might be permitted by DNT, the tools in question are merely aids to determine when a further investigation is warranted.
There are a variety of passive (“fingerprinting”) and active (“tagging”) techniques to track users. Tagging is trivially detectable, since it requires modifying the state of the browser. As for fingerprinting, everything except for IP address and the user-agent string requires extra API calls and network activity that is in principle detectable. In summary, some crude tracking methods might be able to pass under the radar, while the finer grained and more reliable methods are detectable.
Detection of impermissible behavioral advertising is significantly easier. Intuitively, two users with DNT enabled should see roughly the same distribution of advertisements on the same web page, no matter how different their browsing history. In a single page view, there could be differences due to fluctuating inventories, A/B testing, and randomness, but in the aggregate, two DNT users should see the same ads. The challenge would be in automating as much of this testing process as possible.
User empowerment technologies. As noted earlier, there is already a Firefox add-on that implements a DNT HTTP header. It should be fairly straightforward to create one for each of the other major browsers. If for some reason this were not possible for a specific browser, an HTTP proxy (for instance, based on privoxy) is another viable solution, and it is independent of the browser.
A useful feature for the add-ons would be the ability to enable/disable DNT on a site-by-site basis. This capability could be very powerful, with the caveat that the user-interface needs to be carefully designed to avoid usability problems. The user could choose to allow all trackers on a given 2nd party domain, or allow tracking by a specific 3rd party on all domains, or some combination of these. One might even imagine lists of block/allow rules similar to the Adblock Plus filter lists, reflecting commonly held perceptions of trust.
To prevent fingerprinting, web browsers should attempt to minimize the amount of information leaked by web requests and APIs. There are 3 contexts in which this could be implemented: by default, as part of the existing private browsing mode, or in a new “anonymous browsing mode.” While minimizing information leakage benefits all users, it helps DNT users in particular by making it harder to implement silent tracking mechanisms. Both Mozilla and reportedly the Chrome team are already making serious efforts in this direction, and I would encourage other browser vendors to do the same.
A final avenue for user empowerment that I want to highlight is the possibility of achieving some form of browser history-based targeting without tracking. This gives me an opportunity to plug Adnostic, a Stanford-NYU collaborative effort which was developed with just this motivation. Our whitepaper describes the design as well as a prototype implementation.
This article is the result of several conversations with Jonathan Mayer and Lee Tien, as well as discussions with Peter Eckersley, Sid Stamm, John Mitchell, Dan Boneh and others. Elie Bursztein also deserves thanks for originally bringing DNT to my attention. Any errors, omissions and opinions are my own.
To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.
Women in Tech: How Anonymity Contributes to the Problem
Like Michael Arrington, I too have sat on the sidelines of the debate on women in tech. Unlike Michael Arrington, I did so because nobody asked for my opinion. There is, however, one aspect of the debate that I’m qualified to comment on.
The central issue seems to be whether the low participation rate of women in technology is due to a hostile environment in the tech industry (e.g., sexism, overt or covert) or due to external factors, whether genetic or social, that influence women to pick career paths other than technology without even giving it a shot.
Arrington thinks it’s the latter, and makes a strong case for his position. In response, many have pointed out various behaviors common in the tech industry that make it unappealing to women. Jessica B. Hamrick talks about rampant elitism which affects women disproportionately. What I’m more interested in today is Michelle Greer’s account of being viciously attacked for a relatively innocuous comment on Arrington’s post.
Let me come right out and say it: while I am a defender of the right to anonymous speech, I believe it has no place whatsoever in the vast majority of discussion forums. The reason is simple: there is something about anonymity that completely dismantles our evolved social norms and civility and makes us behave like apes. Not all of us, to be sure, but it only takes a few to ruin it for everyone. Or to put it in plainer terms:
There is no doubt that sexist comments online — the vast majority of them anonymous — contribute hugely to the problem of tech being a hostile environment for women. While there are rude comments directed at everyone, just look around if you need convincing that the ones that attack someone specifically for being female tend to be much more depraved. It is also true that rude behavior online is not limited to tech fields, but it creates more of a barrier there because online participation is essential for being relevant.
Here’s my suggestion to everyone who’d like to do something to make tech less hostile to women: perhaps the best return on your time that you can get is by making anonymous, unmoderated comments a thing of the past. Abolish it on your own sites, and write to other site admins and educate them about the importance of this issue. And when you see an uncivil comment, either educate or ignore the person, but try not to get enraged — you’d be feeding the troll.
Thanks to Ann Kilzer for reviewing a draft.
To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.
Thoughts on the White House/DHS Identity Plan
The White House and the Department of Homeland Security have come out with an initiative called the National Strategy for Trusted Identities in Cyberspace (NSTIC). Before you ask, no, government people don’t ever plan to stop using the word “cyberspace.”☺
The NSTIC is a vision for identities online, with the Government exerting a significant level of authority over the system and the implementation being carried out largely by the private sector. I earlier submitted technical comments on the draft coauthored with Stanford colleague Jonathan Mayer. This post reflects my personal view of the NSTIC.
Depth. The first thing one notices about the NSTIC strategy document is that it is quite high-level. A lot of the details are going to hinge on a separate implementation plan document. There is an early draft of the implementation document being kicked around, but it is not yet available to the public. The public comment period for the strategy document has closed.
Scope. Identity is a highly overloaded term. The NSTIC doesn’t offer a clear definition; in our comment we analyzed the different flavors and aspects of identity that are being addressed: (i) plain old self-asserted online identity (ii) linking real-world ID to online ID (iii) public-key infrastructure (iv) attribute authentication (v) anonymous credentials, (vi) credential management and (vii) identity interoperability. The NSTIC tries to do a lot.
Why identity? What problems does it solve? Many others have commented on the fact that the hard problems of cybersecurity, such as malware, cannot really be solved by identity; moreover, solving them might be a requirement for getting an identity infrastructure to work, rather than an outcome. In our technical document we offer a detailed breakdown of what security threats exist today and how the identity plan addresses (or fails to address) each of them.
The NSTIC claims to have been developed in response to a variety of cybersecurity threats. It appears to me, however, that the main goal here is to develop an identity system, and the cybersecurity motivations were tacked on as afterthoughts. While I don’t see a problem with wanting an identity infrastructure for its own sake, it is important not to view it as some kind of panacea.
Process. The only public comment process was via the ideascale website and lasted about 3 weeks. Creating a Web 2.0 crowdsourcing site with a cute theme and opening up the gates is a solution to some problems, but not all. Sometimes actual expertise is required. I find it hilarious that one of the few comments with a deep grasp of the technical issues — by the ACM — was voted down to a -2.
It’s unfortunate that there was no effort to get thet input of CS security researchers in the drafting of the NSTIC. There are many aspects of the plan whose implementation involves research questions, and it is not clear how they are going to be solved. I sure hope there will be more outreach to the security community as the process moves forward.
Summary. I definitely see a role for government leadership in the identity space, and I am glad that this is being worked on. At the same time, there are a variety of concerns with the proposal, including scope creep (look what happened to SSN), malware and software security, hardware security and vested interests (especially since they’re talking about smart cards, DRM, etc.), and usability. It is too early to tell how this is going to turn out. Let’s keep our fingers crossed.
Thanks to Jonathan Mayer for comments on a draft of this blog post.
To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.
What Every Developer Needs to Know About “Public” Data and Privacy
It is natural for developers building web applications to operate under a public/private dichotomy, the assumption being that if a user made a piece of data public, then they’ve given up any privacy expectation. But as we saw in a previous article, users often expect more subtle distinctions, and many unfortunate privacy blunders have resulted. To avoid repeats of these, engineers need to be able to reason about the privacy implications of specific technical features. This article presents a set of criteria for doing so.
1. Archiving
Computers are designed to keep data around forever unless explicitly deleted. But this assumption makes many nontechnical people deeply uncomfortable. There have been a number of proposals to “make the Internet forget,” bringing it in line with humans’ anthropomorphic expectations. While nothing much will probably result from these broad proposals, there need to be some controls on archiving, especially by third parties. Here are three examples that illustrate why this is important:
- A woman was fired from her job recently because of her employer found some of her online revelations objectionable. She got caught because Topsy, a Twitter search engine, retained her personal data in its cache even after she had deleted it from Twitter.
- Joe Bonneau revealed that the vulnerability of photo-sharing sites failing to delete photos from their CDN caches persists on many sites, a full year after it was first made public and received media attention.
- Facebook acted in a heavy-handed manner in its recent spat with Pete Warden. The company’s rationale for prohibiting crawlers seems to be that they want to impose fine-grained restrictions on third party data use. Nontrivial policies can be specified via the Terms of Use, but not via robots.txt.
The examples above show a clear need for a standard for machine-readable third-party data retention policy — a robots.txt on steroids, if you will. Pete Warden proposed expanding robots.txt a few months ago; now that multiple sites are facing this problem, perhaps there will be some momentum in this direction.
2. Real-time
The real-time web relies on “pushing” updates to clients instead of the traditional model of crawling. The push model greatly improves timeliness and machine load, but the problem is that there is typically no way to delete or update existing items in real-time.
This fact bites me on a regular basis. When I make a blog post, Google reader gets hold of it immediately, but if I realize I wrote something stupid and update the post, it doesn’t show up for several hours because updates don’t propagate through the real-time mechanism.
Or consider tweets: if you tweet something inappropriate and delete it a second later, it might be too late: Twitter’s partners could have already gotten hold of it through the “firehose,” and it might already be displayed on a sidebar on some other site.
Google’s “undo send” feature is a great solution to this type of problem — it holds the message in a queue for a few seconds before sending it out. Every real-time system needs such a panic feature!
3. Search
While making data searchable greatly increases its utility, it also dramatically increases the privacy risks. It is tempting to tell users to get used to the fact that everything they write is searchable, but that hasn’t been successful so far, as IRSeek found out when they tried to launch an IRC search engine. There are entire companies like ReputationDefender that help you clean up the web search results for your name.
The lack of searchability of your site can be a feature. This is obviously not true for the majority of sites, but it is worth keeping in mind. One major reason why LiveJournal has a “closed” feel — which is a big part of its appeal — is that posts don’t rank well in Google searches, if they are indexed at all. For example, Livejournal posts have a numeric ID instead of title words in the URL. Although it sounds like someone skipped SEO 101, it is actually by design.
4. Aggregation
By aggregate data I mean data from a single source or website, comprising all or a significant fraction of the users. The appeal of aggregate data for research is clear: not only are larger quantities better, aggregation avoids the bias problems of sampling. On the other hand, the privacy concerns are also clear: the fear is that the data will end up at the hands of the wrong people, such as one of the database marketing companies.
Aggregation is the most common of the privacy problems among the 7 examples I listed in my previous article. In some cases the original source made the data available and then backtracked, in other cases a third party crawled the data and got into trouble, and some were a mix of both.
For websites sitting on interesting data, an excellent compromise would be in-house data analysis (or perhaps a partnership program with outside researchers), as an alternative to making data public. OkCupid has been doing this extremely well, in my opinion — they have a great series of blog posts on race, looks and everything else that affects online dating. The man-hours spent on data analysis are well worth the increased pageviews and mindshare. Facebook has a data team as well, but given the quantity of data they have, they could be publishing quite a bit more.
5. Linkage
By linkage I refer to connecting the same person across multiple websites. Confusingly, this is sometimes referred to as aggregation. Linkage can take the form of database marketers connecting different databases of personal information, or in the online context, it can take the form of tools that link together individual profiles on different websites.
Pervasive online identities are becoming the norm, which is something I’ve been writing about. All of your online activities are going to be easily linkable sooner or later unless you explicitly take steps to keep your identities separate. But again, users haven’t quite woken up to this yet. Unwanted linkage is therefore something that can upset users greatly. The auto-connect feature in Google Buzz is the best example. Opt-in rather than opt-out is probably the way to go, at least for a few years until everyone gets used to it.
Summary. While well-understood access control principles tell us how to implement the privacy of data marked private, the privacy of “public” data is just as big a concern. So far there has been no systematic way of analyzing exactly what it is that users object to. In this article I’ve presented five such features. To avoid nasty surprises, developers building websites need to think carefully about privacy and user behavior when implementing any of these features.
Thanks to Ann Kilzer for reviewing a draft.
To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.
Myths and Fallacies of “Personally Identifiable Information”
I have a new paper (PDF) with Vitaly Shmatikov in the June issue of the Communications of the ACM. We talk about the technical and legal meanings of “personally identifiable information” (PII) and argue that the term means next to nothing and must be greatly de-emphasized, if not abandoned, in order to have a meaningful discourse on data privacy. Here are the main points:
The notion of PII is found in two very different types of laws: data breach notification laws and information privacy laws. In the former, the spirit of the term is to encompass information that could be used for identity theft. We have absolutely no issue with the sense in which PII is used in this category of laws.
On the other hand, in laws and regulations aimed at protecting consumer privacy, the intent is to compel data trustees who want to share or sell data to scrub “PII” in a way that prevents the possibility of re-identification. As readers of this blog know, this is essentially impossible to do in a foolproof way without losing the utility of the data. Our paper elaborates on this and explains why “PII” has no technical meaning, given that virtually any non-trivial information can potentially be used for re-identification.
What we are gunning after is the get-out-of-jail-free card, a.k.a. “safe harbor,” particularly in the HIPAA (health information privacy) context. In current practice, data owners can absolve themselves of responsibility by performing a syntactic “de-identification” of the data (although this isn’t the spirit of the law). Even your genome is not considered identifying!
Meaningful privacy protection is possible if account is taken of the specific types of computations that will be performed on the data (e.g., collaborative filtering, fraud detection, etc.). It is virtually impossible to guarantee privacy by considering the data alone, without carefully defining and analyzing its desired uses.
We are well aware of the burden that this imposes on data trustees, many of whom find even the current compliance requirements onerous. Often there is no one available who understands computer science or programming, and there is no budget to hire someone who does. That is certainly a conundrum, and it isn’t going to be fixed overnight. However, the current situation is a farce and needs to change.
Given that technologically sophisticated privacy protection mechanisms require a fair bit of expertise (although we hope that they will become commoditized in a few years), one possible way forward is by introducing stronger acceptable-use agreements. Such agreements would dictate what the collector or recipient of the data can and cannot do with it. They should be combined with some form of informed consent, where users (or, in the health care context, patients) acknowledge their understanding that there is a re-identification risk. But the law needs to change to pave the way for this more enlightened approach.
Thanks to Vitaly Shmatikov for comments on a draft of this post.
To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.
Conferences: The Good, the Bad and the Ugly aspects
I attended a couple of conferences this week that are outside my usual community. Taking stock of and interacting with a new crowd is always a very interesting experience.
The first was the IAPP Practical Privacy Series. The International Association of Privacy Professionals came about as a result of the fact that the Chief Privacy Officer (and equivalent) positions have suddenly emerged — over the last decade — and become ubiquitous. The role can be broadly described as “privacy compliance.” A big part of the initial impetus seems to have been HIPAA compliance, but the IAPP composition has now diversified greatly, because virtually every company is sitting on a pile of consumer data. There was even someone from Starbucks.
I spoke about anonymization. I was trying to answer the question, “I need to share/sell my data and you’re telling me that anonymization is broken. So what should I do?”. It’s always a fun challenge to make computer science accessible to a non-tech audience (largely lawyers in this case). I think I managed reasonably well.
Next was the ACM Computers, Freedom and Privacy conference (which goes on until Friday). As I understand it, CFP was born at a time when “Cyberspace” was analogous to the Wild West, and there was a big need for self-governance and figuring out the emerging norms. The landscape is of course very different now, since the Internet isn’t a band of outlaws anymore but integrated into normal society. The conference has accordingly morphed somewhat, although a lot of the old crowd still definitely comes here.
The quality of the events I attended were highly variable. I checked out the “unconferences,” but only a couple had a meaningful level of participation and the one I went to seemed to devolve pretty quickly into a penis-waving contest. The session I liked best was a tutorial by Mike Godwin (of Godwin’s law, now counsel for the Wikimedia foundation) on Cyberlaw, mainly First Amendment law.
CFP has parallel sessions. I had a great experience with that format at the Privacy Law Scholars Conference, but this time I’m not so sure — I’m regularly finding conflicts among the sessions I want to attend.
I’m bummed about the fact that there is really no mechanism for me to learn about conferences that are relevant to my interests but are outside my community. (I only learned about the IAPP workshop because I was invited to speak, and CFP purely coincidentally.) Do other researchers face this problem as well? I’m curious to hear about how people keep abreast. I mean, it’s 2010, and this is exactly the kind of problem that social media is supposed to be great at solving, but it’s not really working for me.
To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.
Yet Another Identity Stealing Bug. Will Creeping Normalcy be the Result?
Elie Bursztein points me to a “Cross Site URL Hijacking” attack which, among other things, allows a website to identify a visitor instantly (if they are using Firefox) by finding their Google and possibly Facebook IDs. Here is a live demo and here’s a paper.
For the security geeks, the attack works by exploiting a Firefox bug that allows a page in the attacker domain to infer URLs of pages in the target domain. If a page like target.com/home redirects to target.com/?user=[username] (which is quite common), the attacker can learn the username by requesting the page target.com/home in a script tag.
Let us put this attack in context. Stealing the identity of a web visitor should be familiar to readers of this blog. I’ve recently written about doing this via history stealing, then a bug in Google spreadsheets, and now we have this. While the spreadsheets bug was fixed, the history stealing vulnerability remains in most browsers. Will new bugs be found faster than existing ones getting fixed? The answer is probably yes.
Something that is of much more concern in the long run is Facebook’s instant personalization, which is basically like identity stealing, except it is a feature rather than a bug. Currently Facebook identities are available without user consent to only 3 partners (Yelp, Pandora and docs.com) but there will be inevitable competitive pressures both for Facebook to open this up to more websites as well as for other identity providers to offer a similar service.
Legitimate methods and hacks based on bugs are not entirely distinct. Two XSS attacks on yelp.com were found in quick succession either of which could have been exploited by a third (fourth?) party for identity stealing. Instant personalization (and similar attempts at an “identity layer”) greatly increase the chance of bugs that leak your identity to every website, authorized or not.
As identity-stealing bugs as well as identity-sharing features proliferate, the result is going to be creeping normalcy — users will get slowly inured to the idea that any website they visit might have their identity. And that will be a profound change for the way the web works. Of course, savvy users will know how to turn off the various tracking mechanisms, but most people will be left in the lurch.
We are still at the early stages of this shift. It is clear that it will have both good and ill effects. For example, people are much more civil when interacting under their real-life identity. For this reason, there is quite a clamor for identity. For instance, see News Sites Rethink Anonymous Online Comments and The Forces Align Against Anonymity. But like every change, this one is going to be hard to get used to.
To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.
Facebook, Privacy, Public Opinion and Pitchforks
As just about everyone is already aware, Facebook has been up to a bunch of big brotherly stuff lately, including “instant personalization” — making your identity and data available to 3rd party sites you visit, arguing to treat ToS violations as criminal violations, and forcing you to make your “interests” public (or delete them). Overall, it looks like they’re making a bold move to take control of everyone’s identity and connections, privacy be damned.
The entirely predictable effect of this has been that everything the company now does is being viewed with extreme suspicion. The pitchforks have been sharpened, and the mob gets set off on almost any excuse. In the last week, one somewhat questionable feature, one minor bug and one utter non-event have each been reported as sinister privacy disasters:- The questionable feature was linking your statuses to “connections” pages. The outrage was based on the meme “if your status contains the word FBI then the FBI will have a record of it,” which appears to have started here. That article is full of hyperbole and understandably appears to have been widely misunderstood to be claiming that even private statuses appear on Connection pages (they don’t). There’s really nothing new in terms of the visibility of your statuses: Facebook already had real-time search for public statuses, and the only difference is that someone can now click on the “FBI” page instead of having to type in “FBI” into the search box.
- The minor bug was that Facebook started listing Connect-enabled websites you visit in the “Applications” tab in your privacy settings. The sites didn’t get your identity, any of your data, nor did they have priveleges to post to your wall. The fact that you visited them was not visible to anyone else. No actual harm was done. And yet an article titled Facebook’s new features secretly add apps to your profile alleged all of these things without making any real effort to check with Facebook. Facebook quickly fixed the bug and contacted the authors, and they updated the story, but it did little to quell the rumors which took on a life of their own.
- The non-issue was Facebook leaking your IP address in email notifications. This is normal behavior: most webmail providers, except gmail, put the sender’s IP into the message header as a spam-prevention technique. This kicked up another shitstorm.
In spite of these unfair accusations, it is hard for me to feel any sympathy for the beleaguered company. This is how public opinion works, and they can’t claim not to have seen it coming. As this fantastic visualization by Matt McKeon shows, Facebook has been on a long and consistent path to make all of your information public, essentially pulling a giant bait-and-switch on their users. They stepped up the pace recently, asked their users to give up too much too fast, and something just snapped.
I think Facebook underestimated the extent to which privacy correlates with trust. They were forgiven for Beacon and other problems in the past, but after the most recent series of privacy violations, it became clear that these were not missteps but deliberate actions. I believe that Facebook’s relationship with its users has changed fundamentally, and isn’t going to mend any time soon. Perhaps Facebook’s reckoning is that they are now big enough that it doesn’t matter any more. That remains to be seen.
On a personal note, someone pretty high up at Facebook emailed me a couple of months ago (although “not in an official capacity”) to have a discussion about privacy issues with some of their upcoming product launches. Unfortunately I was traveling at the time, and when I got back they were no longer interested. I guess by then it was too close to f8 and all the important decisions had been made. I can’t help wondering if the outcome might have been different if I’d been able to meet with them — perhaps they might have eased off just a little bit on their world-domination plans and avoided the straw that broke the camel’s back. But I suspect that that’s just wishful thinking, given that the imperative for their current push in all likelihood came from the very top.
To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.


