Private traits and attributes are predictable from digital records of human behavior

Introduction

Social media sits and apps have become an important part of many people’s life. These people live not only in the real world but also the cyber space of social media sits. From the point view of data sciences, an important difference of real world and cyber space is that data in cyber space can be easily recorded and analyzed. As a result, by using machine learning techniques, it is possible to predict social media users’ private traits and attributes which they didn’t intend to release. The following picture shows the different words used by social media users with different ages, which means that even you didn’t fill in your age information on the website, it’s still not a secret for data scientists.

Private trait predicting on social media is still a new research area where many problems are not researched or discussed. As the accuracy of predictions is being updated in many papers, it’s unknown to what extent our private traits can be predicted. On the other hand, since the data used for the prediction are all from users’ public data, the ethics problem is also hard to give a simple answer.

There are mainly three steps to build a predictor for private traits and attributes for social media users by using their public data, including data collection, feature extraction, and models training.

Data Collection

There are two types of data needed for the research of social media users’ private traits predicting, including social media data (e.g. texts, photos, egocentric network, activities) and labels of interests. The social media data can be downloaded in two ways in general. The first method is using a webpage crawler to download the html data. On the other hand, it’s also possible to download the data through an official interface. An OAuth authorization is often required for using the official interfaces. A limit of frequency on downloading is another limitation, which can be solved by using multiple tokens and IP addresses.

Feature Extraction

The data exist on social media are complex heterogeneous data which are composed with texts, images, network data, temporal data, and so on. To apply machine learning techniques on such datasets, it’s necessary to extract features and form a structured dataset.

Text data

Text data are the most common data used on social media. The regular way to extract features from texts are all based on bag-of-words model, that is, a text (such as a post on Twitter) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity. Following are several specific methods:

Lexicon based methods. Such methods are based on lexicons with domain knowledge. One example is sentiment lexicon which can be used to identify the positive/negative emotions of social media users. Another example is LIWC, which is a psychological lexicon that has more than 80 classes of keywords. By using lexicon based methods, features can be extracted by just counting the frequencies of different classes of keywords in the lexicons.
Topic model. LDA is the most popular topic model. A LDA model can be trained on a large corpus and then be used to extract features from texts.
Open vocabulary method. This method is proposed by Prof. Schwartz in his paper Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach. It’s called open because it’s compared with the traditional lexicon based methods which have fixed number of features. This method actually extracts a mixture of topic and keywords by calculating the correlation coefficient between the features and the target variables.

Images

Nowadays the image feature extraction has been mostly occupied by deep leaning methods. Pre-trained CNN models (e.g. VGG16/19, GoogleNet, resnet) can be used to extract image features with a fixed number a dimensions. The regular way is to just ignore the last layer (the original prediction layer) and use the other last several layers (fully connected layers) as features layers. It has proven that pre-trained model can be used to extract features and be used to work with another domain without problems.

Platform Specific Features

Different platforms have different functions, and as a result each platform has its own specific features.

A good example is the feature of Facebook Likes. This picture shows the method to extract structured features from Facebook Likes. It has proven that by using only the Likes data, it’s possible to predict a wide range of people’s private traits.

Model Training

After the features are extracted, regular supervised machine learning methods can be used to train a prediction model. The most commonly used supervised learning algorithms are SVM and Random Forest, which are both nonlinear algorithms. They are most commonly used because they are normally the best performed ready-to-use algorithms in most tasks.

A more advanced way to train a models is to formalize the problem based on the characters of each problem, instead of just using supervised learning algorithms as a black box. Take depression prediction as an example, considering people behave differently on work and off work, two groups of features can be extracted based the working time. Then a lose function can be written based on the psychological assumptions regarding to the working time and depression. At last, an optimization method need to be proposed to find the solution.

Discussion

How can we prevent privacy disclosure on social media? My answer is there’s no way. As long as you have posted enough texts or have enough activities on social media website, you have already released too much information to the public. And as long as the data scientist gather enough labelled data, which is already there, a prediction model can be trained to predict a wide range of privacy of yours that you don’t want people to know.