Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Apr 9;110(15):5802-5.
doi: 10.1073/pnas.1218772110. Epub 2013 Mar 11.

Private traits and attributes are predictable from digital records of human behavior

Affiliations

Private traits and attributes are predictable from digital records of human behavior

Michal Kosinski et al. Proc Natl Acad Sci U S A. .

Abstract

We show that easily accessible digital records of behavior, Facebook Likes, can be used to automatically and accurately predict a range of highly sensitive personal attributes including: sexual orientation, ethnicity, religious and political views, personality traits, intelligence, happiness, use of addictive substances, parental separation, age, and gender. The analysis presented is based on a dataset of over 58,000 volunteers who provided their Facebook Likes, detailed demographic profiles, and the results of several psychometric tests. The proposed model uses dimensionality reduction for preprocessing the Likes data, which are then entered into logistic/linear regression to predict individual psychodemographic profiles from Likes. The model correctly discriminates between homosexual and heterosexual men in 88% of cases, African Americans and Caucasian Americans in 95% of cases, and between Democrat and Republican in 85% of cases. For the personality trait "Openness," prediction accuracy is close to the test-retest accuracy of a standard personality test. We give examples of associations between attributes and Likes and discuss implications for online personalization and privacy.

PubMed Disclaimer

Conflict of interest statement

Conflict of interest statement: D.S. received revenue as owner of the myPersonality Facebook application.

Figures

Fig. 1.
Fig. 1.
The study is based on a sample of 58,466 volunteers from the United States, obtained through the myPersonality Facebook application (www.mypersonality.org/wiki), which included their Facebook profile information, a list of their Likes (n = 170 Likes per person on average), psychometric test scores, and survey information. Users and their Likes were represented as a sparse user–Like matrix, the entries of which were set to 1 if there existed an association between a user and a Like and 0 otherwise. The dimensionality of the user–Like matrix was reduced using singular-value decomposition (SVD) (24). Numeric variables such as age or intelligence were predicted using a linear regression model, whereas dichotomous variables such as gender or sexual orientation were predicted using logistic regression. In both cases, we applied 10-fold cross-validation and used the k = 100 top SVD components. For sexual orientation, parents’ relationship status, and drug consumption only k = 30 top SVD components were used because of the smaller number of users for which this information was available.
Fig. 2.
Fig. 2.
Prediction accuracy of classification for dichotomous/dichotomized attributes expressed by the AUC.
Fig. 3.
Fig. 3.
Prediction accuracy of regression for numeric attributes and traits expressed by the Pearson correlation coefficient between predicted and actual attribute values; all correlations are significant at the P < 0.001 level. The transparent bars indicate the questionnaire’s baseline accuracy, expressed in terms of test–retest reliability.
Fig. 4.
Fig. 4.
Accuracy of selected predictions as a function of the number of available Likes. Accuracy is expressed as AUC (gender) and Pearson’s correlation coefficient (age and Openness). About 50% of users in this sample had at least 100 Likes and about 20% had at least 250 Likes. Note, that for gender (dichotomous variable) the random guessing baseline corresponds to an AUC = 0.50.

References

    1. Lazer D, et al. Computational social science. Science. 2009;323(5915):721–723. - PMC - PubMed
    1. Koren Y, Bell R, Volinsky C. Matrix factorization techniques for recommender systems. Computer. 2009;42(8):30–37.
    1. Chen Y, Pavlov D, Canny JF. 2009. Large-scale behavioral targeting. International Conference on Knowledge Discovery and Data Mining, pp 209–218.
    1. Butler D. Data sharing threatens privacy. Nature. 2007;449(7163):644–645. - PubMed
    1. Narayanan A, Shmatikov V. 2008. Robust de-anonymization of large sparse datasets. IEEE Symposium on Security and Privacy, pp 111–125.

Publication types