Essentialness of Data

by jefftkjefftk1 min read14th Jul 202116 comments

14

PrivacyWorld Modeling
Frontpage

In discussing online privacy, people will sometimes say things like " if the data isn't required to operate, you don't need it." If I can turn on tracking protection, stop sending data, and nothing breaks, then clearly it wasn't needed, right? But consider these three cases:

  • Learning from implicit feedback: dictation software can operate without learning what corrections people make, or a search engine can operate without learning what links people click on, but the overall quality will be lower. Each individual piece of information isn't required, but the feedback loop allows building a substantially better product.

  • Incremental rollouts: when you make changes to software that operates in complex environments it can be very difficult to ensure that it operates correctly through testing alone. Incremental rollouts, with telemetry to verify that there are no regressions or that relevant bugs have been fixed, produces better software. Even Firefox collects telemetry by default.

  • Ads: most websites are able to offer their writing for free, without a paywall, because they can get paid for showing ads. Collecting more data makes ads more efficient, which makes them more profitable for the sites, which translates into more competition to provide users with things to read. (more)

Instead of pushing for "don't collect data", I think it would make a lot more sense for advocates to push for "only collect data privately" and work to make that easier (carrot) or mandatory (stick). None of these uses require individual level data, they're just easiest to implement by sending all of the data back to a central server and processing it there.

(What does "private" mean? Ideally it means that no one reviewing the data can tell what your, or any other individual's, contribution was. This is formalized as differential privacy, and is typically implemented by adding noise proportional to the maximum contribution any individual could have. In some cases k-anonymity may also provide good protection, but it's trickier. And this is only the beginning; privacy researchers and engineers have been putting a lot of work into this space.)

Comment via: facebook

14

16 comments, sorted by Highlighting new comments since Today at 8:55 AM
New Comment

One significant problem is that differential privacy requires trusting that the implementation is free from bugs, both intentional and unintentional ones. In some cases you can monitor the network packets (though who monitors all the network packets all the time?), but in many cases you can't. That's especially an issue with hardware devices that communicate using encryption.

I generally would trust that Firefox isn't going to have an intentional bug in their telemetry, and I don't think that Google would either (they have too much to lose from the bad publicity), but what about all of the miscellaneous ad companies? And anyone can have an unintentional bug in their implementation.

The companies involved have, as a general group, already shown that they will act maliciously, so I can't trust them when they say they aren't being malicious.

what about all of the miscellaneous ad companies?

The approach the major browsers (except Firefox) have been taking is to provide new APIs that allow ad-related functionality without individual-level tracking (and then try to block cross-site tracking). Examples:

This seems like a good place to put it, to me. Users choose their browsers, and browsers are generally open source. This still does not do anything about same-site tracking, but in that case users are choosing which sites they interact with. Also, while this is being built with cross site tracking use cases in mind, I would like to see it built in a way where individual sites can also use it to demonstrate that their data collection is private.

Instead of pushing for "don't collect data", I think it would make a lot more sense for advocates to push for "only collect data privately" and work to make that easier (carrot) or mandatory (stick).

Why not both? A fine for collecting data without explicit free informed consent; prison time for providing the collected data to a third party...

I lock my door despite that we already have laws against theft. Because the extra protection feels useful, and because I focus on the part that is under my control.

I do not consent to companies collecting data about me. In a nice society, the discussion should end here.

In the post I'm arguing that collecting data has downstream effects users value: services that improve as people use them, break less often, and/or fund themselves without charging money. Today, with typical individual-level data collection, this means there's a tradeoff between individual privacy and overall value, but if we had only private data collection then we'd get the value without the tradeoff. If I'm reading you correctly, however, you are opposed even to differentially private data collection?

I guess there are two issues here: lack of consent, and lack of trust.

Each of them is important for me separately. But together, it's like, I say "no", you say "but trust me, you are going to like it, it will be good for you", and I say "first, I don't trust you, and second, I already said no".

To explain the problem with trust, let's use some outside view here. I don't want to read hundred pages of the latest specification written by Google, because I am not a security expert, I cannot verify that the functionality will actually be implemented 100% according to the specification and without bugs, and most importantly the specification can be updated in the future after people get used to it.

The historical experience is that first you have a TV, then you have a TV full of ads, then you agree to pay for cable TV because you acknowledge that they need money and you want to get rid of the ads, and then you find yourself paying for a cable TV which is full of ads anyway. (And the story doesn't really end here. Then you find out that your TV is collecting data about your watching habits and sending it god knows where. And maybe there is a hidden camera in the TV that looks at your face and analyzes whether you are smiling or bored, and how much attention you are paying. And I am pretty sure the story doesn't end here either, and I wonder what my children and grandchildren will consider "normal" when they are at my age.) So the outside view says that whevener you give up some X in return for a promise of getting some Y back, in long term you are going to lose both X and Y and also some unexpected Z.

Whatever value you are promising me to get back after accepting your proposal, I expect to lose it at the very next update. (That is, if the first version is implemented flawlessly, and if it truly contains no security holes. Otherwise, I expect to lose the value again five seconds after we shake hands.)

From the perspective of consent, I am not really free to say "no" here, am I? The corporations want my private data, and given the difference in power, they will get them. The bottom line is already written, and the only thing this debate could achieve is to make me feel less bad about having my preferences violated. My opinion about differential privacy or anything else makes zero difference, it's just shouting in the wind.

The description of differential privacy on Wikipedia sounds interesting, and I don't for a moment assume that it will be implemented correctly. The idea, if I understand it correctly, is that whenever someone makes a query, you add a certain (mathematically determined) amount of noise to the results. Which means the data in the database itself are without the noise. Which means we just have to trust that no one implements an extra button that would return the data without the added noise. Not even if their boss, or the government tells them to. I don't believe this for a moment.

This viewpoint on how media has changed over time seems really strange to me. My history would have been almost the reverse - a history of going from everything being expensive (pay per view TV?) and usually low quality as well to almost everything being cheap and many things being literally free, including a lot of high-quality things.

Viliam's description matches my understanding of the history. TV started out with free over-the-air broadcasting, with admittedly just a few offerings (only three channels originally). They were ad supported, but generally had much fewer ads than today. Between the 1960s and 1980s cable was introduced, and initially was ad-free but paid. Then they gradually introduced ads to the format once people were locked in and hooked to the shows available. Now the quantity of ads is greater than ever, plus it costs money. Today a few channels still offer that free over-the-air broadcasting, but it's much fewer than there used to be at the peak. So now we're in a state where you both need to pay and get a large quantity of ads.

Of course, streaming services have somewhat disrupted that model, but many streaming services both cost money and have ads. Plus to get all the same shows as on cable you need a large number of streaming services and they all add up to cost more than the cable package would!

You might say that the quality of the television shows has gone up, and you can definitely say that it's worth the tradeoff, but it's not obviously worth it to everyone.

It seems strange to say that you do not consent to people gathering data about you when you are providing it to them yourself.

If you don't want me to see your baby pictures, the easy approach is for you to not send them to me. Instead, the more common approach seems to be to send me your baby pictures, then claim that you 'do not consent' to me looking at them.

If that refers to people uploading their baby pictures on Facebook, okay.

But e.g. telemetry in operating systems is more like: "by living in a house I built (in a country where 90% of houses are built by me) you automatically consent to streaming videos from your shower".

I mostly agree, but the underlying difficulty is not technical implementation, but social (and legal) acceptance.  It's almost impossible to explain the topic to a layperson who's worried about it but not very sophisticated.  And it's very hard, even for experts, to define "good" and "bad" uses of anonymized-but-segmented (by business/interest/demographics/etc) data.

This topic seems to be a lot about trust. With Googles recent decision to introduce political censorship to Google Docs, it's not a company to be trusted with a lot of data. 

What is this in reference to?

https://support.google.com/docs/answer/148505?fbclid=IwAR0fmSQihmU1GHJ8oU5SySyrMzTtGXR23ga92KazXxsToHpJD1gnyEOQhF0#zippy=%2Cmisleading-content

Forbidding Google docs to be used for journalistic purposes to question official government narratives is pretty totalitarian. And there's no limit to Western governments in that policy either.

Do not distribute content that deceives, misleads, or confuses users. This includes:

  • Misleading content related to civic and democratic processes: Content that is demonstrably false and could significantly undermine participation or trust in civic or democratic processes. This includes information about public voting procedures, political candidate eligibility based on age / birthplace, election results, or census participation that contradicts official government records. It also includes incorrect claims that a political figure or government official has died, been involved in an accident, or is suffering from a sudden serious illness.
  • Misleading content related to harmful health practices: Misleading health or medical content that promotes or encourages others to engage in practices that may lead to serious physical or emotional harm to individuals, or serious public health harm.
  • Manipulated media: Media that has been technically manipulated or doctored in a way that misleads users and may pose a serious risk of egregious harm.

Misleading content may be allowed in an educational, documentary, scientific, or artistic context, but please be mindful to provide enough information to help people understand this context. In some cases, no amount of context will allow this content to remain on our platforms.

This is about content distribution, so it's not about what sort of documents you can compose on the platform, but about what sorts of documents they're willing to host for sharing. The general idea of having this kind of restriction makes sense to me: if you don't have these restrictions then people use your platform to host harmful content, and I think any individual company should be able to decide they don't want to be associated with that.

I definitely think there should be some places where you can post whatever you want, and the market for general web hosting is big enough and competitive enough that it does work out this way. Additionally, hosts are much less concerned when it's not their domain in the URL.

As for the specifics of this policy, how do you see it as prohibiting journalistic use for documenting lies by governments? It looks to me like the exceptions at the end cover that. I do think, in practice, this is a difficult distinction for content moderators to enforce, which is not great. But again, there are lots of places you can publish journalism, and Google docs is not even a common place to do that?

(Disclosure: I work for Google, speaking only for myself)

I think any individual company should be able to decide they don't want to be associated with that.

Yes, and everybody should be able to decide not to give that company any data because they are not trustworthy. "A company should be legally able to do X" and "A company that does X deserves trust" are two very different claims.

As for the specifics of this policy, how do you see it as prohibiting journalistic use for documenting lies by governments? It looks to me like the exceptions at the end cover that.

There's a list of uses that are allowed and journalistic use is not among them. Which also makes sense as journalists are the kind of people who are usually in the best position to spread misinformation. 

This is about content distribution

The term content distribution has a legal meaning and "sharing" isn't the legal meaning. Any use of Google Docs includes content distribution in the legal sense. If the legal sense isn't meant the policy should be worded differently. Besides there a question of "content staying on our platforms" and not "we will stop your ability to share objectionable content. 

Thanks for this. This is the kind of post which seems obvious in retrospect but I didn't think/know beforehand.