Data Mining & Privacy: How Anonymous Are You Really?

Sam Carroll //

employee_scWhen I started at BHIS I was surprised at the sensitivity of personal data, such as my birthday. I was soon reminded of a data mining class I took last year where Dr. Karlsson (South Dakota School of Mines & Technology) started with an ethics portion. Specifically the ethics he warned us about was anonymization of users’ data and re-identification of personal data.

Sensitive information that has been poorly obfuscated can be reversed to discover very specific information about individuals. This has been such a big concern for individuals and corporations since 1998, when GeoCities had told customers information would not be shared but then sold the data to third parties.  The FCC ruled that companies must not lie about their privacy policies.

Think about how many companies have you agree to a privacy policy and sometimes, due to bad anonymization, sensitive information can leak out. One of the most egregious example is from the early 90’s when Latanya Sweeney discovered that about 90% of the US population could be uniquely identified by their zip code, birthdate, and sex. To prove this point, Sweeney bought voter rolls (public record) and combined it with data from GIC, a health insurance purchaser for state employees. Though GIC had removed names, SSN’s, and home addresses, Sweeney was able to identify the governor’s health records, including prescriptions (who personally vouched for the security of the anonymization).

Though this incident of health care re-identification was ‘contained’ to Massachusetts, reidentification is a problem that affects just about everyone, including tech giants.

2006 had two famous examples of breach of privacy of two well known companies, Netflix and AOL. Netflix announced a competition to beat their suggestion algorithm, so people could train and test their solutions. Netflix removed the usernames of 500,000 users from their ratings, but gave unique identifiers in the place of usernames. In a study conducted on this data researchers coupled ratings on IMDB (that had usernames associated with the IMDB profile) with the Netflix database, and just 6 movie ratings, nearly all users in Netflix’s database were discovered.

AOL similarly released tens of millions of search queries from a 3 month time span, and anonymized the data by removing usernames and IP Addresses, and again gave unique identifiers for each user, meaning each user was still uniquely marked but not immediately identifiable. Using this data, researchers were able to combine the searches for a single user and discover personal information about them using all their searches, such as “how is the weather in New York City”, “what’s fun for a 18 year old to do on Saturday”, searching their own name or SSN’s. Thus giving anyone who is interested, and committed enough, personal information that should be left undisclosed. Some of the information included things of a more private nature, such as how to come out as an abuse victim to your family, or how to get out of an abusive relationship.

In 2009 Carnegie Mellon discovered a way to analyze data to discover the SSN of an individual. They did this using only the birth location (as SSN’s use physical location as the first 5 numbers). The last four numbers were reduced to only 1000 combinations, and they reduced this by using public death records that record SSN’s to find a pattern of the last 4 digits with high correlation to birthdate. Thus with only two little pieces of information (both of which are pretty much provided by any social networking website), and a little work it’s relatively easy to uncover an individual’s social security number.

Congress passed a bill last week that allows the government and commercial operators of drones to collect potentially personally identifiable data about individuals (including facial recognition), without disclosing it. Also, this bill did not include provisions on how they would use the data and if/when it would be destroyed, showing that we still face concerns over our privacy.

Pokemon Go even had a terrible breach of privacy on the iOS version of the app, which originally required rights to a user’s entire Gmail account. Some even went as far to say this included the ability to send emails, read calendar events, access contacts, and photos. Though developer Niantic says no information was gathered, one thing is clear.  Privacy is not a growing concern. You must be be careful about what you share even with Pokemon.

Be careful what you share, the most insidious (and perhaps one of the best ways) to extract personal information is to ask for it. People are very careful about what they know will compromise them, but the data that they feel won’t compromise them they are freely willing to distribute. However, even professionals will sometimes fail to keep data truly anonymous, and sometimes deeply embarrassing or highly private data can be compromised because of it. Assume you’re already compromised, and take provisions to stay secure.