One of the responses to my Uber self-driving car post was objecting to Uber experimenting on public roads:
Self-driving research as practiced across the industry is in violation of basic research ethics. They should not be allowed to toss informed consent out the window, no matter how cool or revolutionary they think their research is.I've seen this general sentiment before: if you want to run an experiment involving people you need to get their consent, and get approval from an IRB, right?
While academia and medicine do run on a model of informed consent, it's not required or even customary in most fields. Experimentation is widespread, as organizations want to learn what effect their actions have. Online companies run tons of a/b tests. UPS ran experiments on routing and found it was more efficient if they planned routes to avoid left turns. Companies introduce new products in test markets. This is all very standard and has been happening for decades, though automation has made it easier and cheaper, so there's more now.
When you look at historical cases of experimentation gone wrong, the problem is generally that the intervention was unethical on its own. Leaving syphilis untreated, infecting people with diseases, telling people to shock others, and dropping mosquitoes from planes are all things you normally shouldn't do. The problem in these cases wasn't that they were experimenting on people, but that they were harming people.
Similarly, the problem with Uber's car was that if you have an automatic driving system that can't recognize pedestrians, can't anticipate the movements of jaywalkers, freezes in response to dangerous situations, and won't brake to mitigate collisions, it is absolutely nowhere near ready to guide a car on public roads.
We have a weird situation where the rules for experimentation in academia and medicine are much more restrictive than everywhere else. So restrictive that even a very simple study where you do everything you normally do but also record whether two diagnostics agreed with each other can be bureaucratically impractical to run. We should remove most of these restrictions: you should still have to get approval and informed consent if you want to hurt people or violate a duty you have to them, but "if it's ok to do A or B then it's fine to run an experiment on A vs B" should apply everywhere.
(I wrote something similar earlier, after facebook's sentiment analysis experiment.)
Comment via: facebook
The first sentence is almost true,* but the second sentence if false. The first sentence is the relevant point for this post, so maybe all I have to say is nitpicking, but I'm going to say it. Scott's experience is not representative. There are many aspects that caused it to be held to a higher standard. Your use of the word "even" implies that it was held to a lower standard, and thus is a representative, even a conservative estimate. While it is true that such a study should be held to a lower standard, it was in fact held to a standard much higher than usual in academia. Or, rather, it was probably not held to a standard, but simply sabotaged. Indeed, partly it was the very nature of the project, attacking existing tools that showed that he was "not a team player" and probably contributed to his treatment.
IRBs have arbitrary power, not high standards. At research universities, research is the principal revenue center and thus IRBs allow research to occur. Since academics do publish human subjects research and we know that IRBs often have lower standards than Scott's hospital. The hospital was not a research hospital and thus the IRB was a vestigial organ without much pressure to actually function. Moreover, Scott did not have a research grant, so his project was a pure cost center.
*The standards in practice are, on average, very high, but the first sentence is wrong to claim that academia has rules. Sometimes academics are acclaimed for attempting murder. If they actually had rules, low or high, they would require an IRB for joining a gang.
Edit: inserted disclaimer as second sentence.
Edit2: Also, I meant to object to the word "bureaucratically." To some people this implies consistency, which is exactly what I was trying to rebut in this comment. But to others it means plausible deniability, which is what I am claiming. More: For example, in Scott's case, it was not clear who had the authority to grant him exemptions. Whether this was incompetence or malice, this was a standard example of a bureaucracy failing to do what it claims to do, not an example of stringent standards. I reminds me of this post and its distinction between types of compliance costs.
Allowing A and B, and allowing an experiment on A vs. B, may create different incentives, and these incentives may be different enough to change whether we should allow the experiment versus allowing A and B.
Maybe, but if this is common enough to justify limiting experimentation I'd expect people to be able to easily find examples.
I'm not making that argument, but I do think that it's easy to produce examples. For example, that was the problem with the Tuskegee experiment. The original incarnation was harmless, merely failing to provide expensive treatment that wouldn't have been provided by default, only a problem in Schrödinger's ethics. But later the investigators interfered with several other groups (eg, the WWII military) who wanted to provide treatment.
Isn't the problem that the human driver wasn't paying attention? My car also cannot recognize pedestrians etc, but it's fine to allow it on public roads because I am vigilant.
To the extent that Uber is at fault (rather than their employee), it seems to me that it's not that they let their cars on the road before they were advanced enough; it's that they didn't adequately ensure that their drivers would be vigilant (via eye-tracking, training, or etc.).
The NTSB report was released last week, showing that Uber's engineering was doing some things very wrong (with specifics that had not been reported before). Self-driving programs shouldn't go on public roads with that kind of system, even with a driver ready to take over.
I've seen that. Maybe I'm missing something, but I still stand by my comment. My car is even less capable than the vehicles described there and it's fine to drive.
Seems like the only reason that my car should be allowed on the roads but these should not be is some kind of expectation of moral hazard or false confidence on the part of the driver. No?
Perhaps one could argue that the car is in an uncanny valley where no one can be taught to monitor it correctly. But then it seems like that should be the emphasis rather than simply that the car was not good enough yet at driving itself.
Humans are known to be extremely bad at this kind of task (passively watching something for hours while remaining ready to respond to danger within a few seconds) and Uber should have known this. If Uber wanted to go ahead with this bad strategy anyway, it should have screened its employees to make sure they were capable of the task they were given.
I don't think anyone is capable of it. A system that depends on passive vigilance and instant response from a human is broken from the start. Selection and training will not change this. You cannot select for what does not exist, nor train for what cannot be done. There's a gap that has to be crossed between involving the human at all times and involving the human not at all.
For those who haven't seen it, starting at second 15 here, the driver can be seen to be looking down (presumably at their phone), for 6 full seconds before looking up and realizing that they're about to hit someone. This would not be safe to do in any car.
There are now actual driverless cars in Phoenix that you can hail. If they get into an emergency situation they need to resolve it entirely on their own because there isn't time to bring anyone else in.
The step before this was probably having a safety driver in the car who isn't expected to take over immediately, but can do things like move the car to the side of the road after an emergency stop. In that case the person in the driver's seat spending most of their time reading their phone would safe.
This presupposes that the fact that online companies have been doing it makes it ethical. Giving different results to different people for the same input is unethical. Even in just the online realm, it can cause major issues for people with learning disabilities or older people who aren't able to deal with change. If they need help with software, it can be a blocker for them if what they experience is different from what they see in help pages or on other people's computers. If either A or B is better for the user, they are getting discriminated against by the random algorithm that chooses which version of the software to show them. The only thing to make it possibly ethical is to allow the users to choose between the A and B versions in the settings, and even that is iffy because users will likely not know that they can choose.
You're going to need to give more justification for this. Here are some examples that I think even someone who's skeptical should be ok with:
If we both get mystery-flavor dum-dum lollipops they won't taste the same.
If we both open packs of Magic cards you might get much better cards than I do.
If we search Gmail for a phrase we'll get different results.
If we search Facebook for "John Smith" we should see different profiles, since FB considers the friend graph in ranking responses.
If I search Amazon for "piezos" it shows me piezo pickup disks, but if I search it in an incognito window I get "Showing results for piezas". This is because it has learned something about what sort of products I'm likely to want to buy.
If we ask for directions on Waze we may get different routings. All the routes it sends people on are reasonable ones (as far as it knows) and you get much better routing than you'd get from a hypothetical Waze that didn't have all its users as an experimental pool.
You give two arguments:
It sounds like you're mostly talking about user-interface experiments? Like, if Tumblr shows me different results than it shows you that doesn't limit your ability to help me, or my ability to use help pages. Even just with UI experiments, your argument proves too much: it says it's unethical for companies to ever change their UI. Now people who are used to it working one way need all need to learn how to use the new interface. And all the Stack Overflow answers are wrong now. But clearly making changes to your UI is ok!
Companies run A/B tests when they don't know which of A or B is better, and running these tests allows them to make products that are better than if they didn't run the tests. Giving everyone worse outcomes to make sure everyone always gets identical outcomes would not be an improvement.
Are there other reasons behind your claim?
Addendum to my other comment:
Empirically, as a trend across the industry, this has turned out to be false. “Design by A/B test” has dramatically eroded the quality of UI/UX design over the last 10-15 years.
On the contrary, it quite often would be an improvement—and a big one. Not only are “worse” outcomes by the metrics usually used in A/B tests often not even actually worse by any measure that users might care about, but the gains from consistency (both synchronic and diachronic) are commonly underestimated (for example—clearly—by you); in fact such gains are massive, and compound in the long run. Inconsistency, on the other hand, has many detrimental knock-on effects (increased centralization and dependence on unaccountable authorities, un-democratization of expertise, increased education and support costs, the creation and maintenance of a self-perpetuating expert class and the power imbalances that result—all of these things are either directly caused, or exacerbated, by the synchronic and diachronic UI inconsistency that is rampant in today’s software).
One man’s modus tollens is another’s modus ponens. I wouldn’t go so far as to say “ever”, but the frequency of UI changes that is commonplace today, I would say, is indeed unethical. I do not agree that “clearly making changes to your UI is ok”. It may be fine—there may be good reasons to do it—but as far as I’m concerned, the default is that it’s not fine.
The fact is, “people who are used to it working one way … all need to learn how to use the new interface” is a serious, and seriously underappreciated, problem in today’s UX design practices. Many, many hours of productivity are lost to constant, pointless UI changes; a vast amount of frustration is caused. What, in sum, is the human toll of all of this—this self-indulgent experimentation by UX designers, this constant “innovation” and chasing after novelty? It’s not small; not small by any means.
I say that it is unethical. I say that if we, UX designers, had a stronger sense of professional ethics, then we would not do this, and instead would enshrine “thou shalt not change the UI unless you’re damn sure that it’s necessary and good for all users—existing ones most especially” in our professional codes of conduct.
In short: the argument given in the grandparent proves exactly as much as it should.
And they needn’t be terribly dramatic reasons; “we added a feature” is a fine reason to change the UI… just enough to accommodate that feature. ↩︎
Changing UIs has costs to users. So does charging for your service. Is charging for your service unethical? Think about the vast amount of frustration caused by people not having enough money, just so the company can shovel even more money onto already overpaid CEOs. (Want to modus again?)
I do think companies should seriously consider the disruption UI changes cause, just like they seriously consider the disruption of price increases, and often it will make sense for the company to put in extra development to save their users' frustration. For example, for large changes like the ~2011 Gmail redesign you can have a period of offering both UIs with a toggle to switch between them. (And stats on how people use that toggle give you very useful information about how the redesign is working.)
Companies that followed your suggestions would, over the years, look very dated. Their UIs wouldn't be missing features, exactly, but their features would be clunky, having been patched onto UIs that were designed around an earlier understanding of the problem. As the world changed, and which features were most useful to users changed, the UI would keep emphasizing whatever was originally most important. Users would leave for products offered by new companies that better fit their needs, and the company would especially have a hard time getting new users.
“Dated” is not a problem unless you treat UX design like fashion. UIs don’t rust.
The “earlier understanding” of many problems in UX design was more correct. Knowledge and understanding in the industry has, in many cases, degenerated, not improved.
Yes, this is certainly the story that designers, engineers, and managers tell themselves. Sometimes it’s even true. Often it’s a lie, to cover the design-as-fashion dynamic.
Charging for your service isn’t unethical—though overcharging certainly might be! If companies didn’t charge for their service, they couldn’t provide it (and in cases where this isn’t true, the ethics of charging should certainly be examined). So, yes, once again.
But that’s not the important point. Consider this thought experiment: how much value, translated into money, does the company gain from constant, unnecessary UI changes? Does the company even gain anything from this, or only the designers within it? If the company does gain some value from it, how much of this value is merely from not losing in zero-sum signaling/fashion races with other companies in the industry? And, finally, having arrived at a figure—how does this compare with the aggregate value lost by users?
The entire exercise is vastly negative-sum. It is destructive of value on a massive scale. Nothing even remotely like “charging money for products or services” can compare to it. Every CEO in the world can go and buy themselves five additional yachts, right now, and raise prices accordingly, and if in exchange this nonsense of “UX design as fashion” dies forever, I will consider that to be an astoundingly favorable bargain.
That is, changes not motivated by specific usability flaws, specific feature additions, etc. ↩︎
"Dated" is a problem for companies because users care about it in selecting products. Compare:
Original GMail: https://upload.wikimedia.org/wikipedia/en/6/67/Gmail_2004.png
Current GMail: https://upload.wikimedia.org/wikipedia/en/1/1b/Gmail_inbox_in_Japanese.png
The first UI isn't "rusted", but users looking at it will have a low impression of it and will prefer competing products with newer UIs. I don't think fashion is the main motivator here, but it is real and you can't make it go away just by unilaterally stopping playing. (I mean I can but I'm an individual running a personal website, not a company.)
How so? I can think of cases where earlier UX was a better fit for experienced users and newer UXes are "dumbed down", is that what you mean?
Let's take a case where all the externalities should be internalized: internal tooling at a well run company. I use many internal UIs in my day-to-day work, and every so often one of them is reworked. There's not much in the way of fashion here, since it's internal, but there are still UI changes. The kind of general "let's redo the UI and stop being stuck in a local maximum" is the main motivation, and I'm generally pretty happy with it.
I don't think the public-facing version is that different. If there was massive value destruction then users would move to software that changed UI less.
Mistakenly, of course. This is a well-attested problem, and is fundamental to this entire topic of discussion.
No, the halo effect is the main motivator.
I never said that you could. (Although, in fact, I will now say that you can do so to a much greater extent than people usually assume, though not, of course, completely.)
In part. A full treatment of this question is beyond the scope of a tangential comment thread, though indeed the question is worthy of a full treatment. I will have to decline to elaborate for now.
In practice this is often impossible. For example, how do I move to a browser with which I can effectively browse every website, but whose UI stays static? I can’t (in large part because of anti-competitive behavior and general shadiness on the part of Google, in part because of other trends).
The fact is that such simplistic, spherical-cow models of user behavior and systemic incentives fail to capture a large number and scope of “Molochian” dynamics in the tech industry (and the world at large).
I'm not sure that this is mistaken: companies that can keep their UI current can probably, in general, make better software. This probably only holds for large companies: since small companies face more of a choice of what to prioritize while large companies that look like they're from 2005 are more likely to be environments that can't get anything done.
I'm generally pretty retrogrouch, and do often prefer older interfaces (I live on the command line, code in emacs, etc). But I also recognize that different interfaces work well for different people and as more people start using tech I get farther and farther from the norm.
That was how I interpreted your suggestion that UX people start to follow a "change UIs only when functionality demands". Anyone who tried to do the "responsible" thing would lose out to less responsible folks. Even if you got a large group of UX people to refuse work they considered to be changing UIs for fashion, companies are in a much stronger position since the barrier to entry for UX work is relatively low.
The rendering engines of Chrome/Edge/Opera (Blink), Safari (WebKit), and Firefox (Gecko) are all open source and there are many projects that wrap their own UI around a rendering engine. The amount of work is really not that much, especially on mobile (where iOS requires you to take this approach). If this was something that many people cared about it would not be hard for open source projects to take it on, or companies to sell it. That no one is prioritizing a UI-stable browser really is strong evidence that there's not much demand.
Not sure what you're referring to here?
To the contrary: companies that update their UI to be “current” probably, in general, make worse software (and not only in virtue of the fact that the UI updates often directly make the software worse).
Do they? It’s funny; I’ve seen this sort of sentiment quite a few times. It’s always either “well, actually, I like older UIs, but newer UIs work better [in unspecified ways] for some people [but not me]”, or “I prefer newer UIs, because they’re [vague handwaving about ‘modern’, ‘current’, ‘clean’, ‘not outdated’, etc.,]”. Much less frequent, somehow—to the point of being almost totally absent from my experience—are sentiments along the lines of “I prefer modern UIs, for the following specific reasons; they are superior to older UIs, which have the following specific flaws (which modern UIs lack)”.
But note that this objection essentially concedes the point: that the pressure toward “modernization” of UX design is a Molochian race to the bottom.
I have a hard time believing that you are serious, here. I find this to be an absurd claim.
Once again, it is difficult for me to believe that you actually don’t know what I’m talking about—you would have to have spent the last five years, at the very least, not paying any attention to developments in web technologies. But if that’s so, then perhaps the inferential distance between us is too great.
I think maybe what's going on is that people who are good at talking about what they like generally prefer older approaches? But if you run usability tests, focus groups, A/B tests, etc you see users do better with modern UIs.
I do think there's a coordination failure here, as there is in any signaling situation. I think it explains less of what's going on than you do, and I also don't think getting UX people to agree on a code of ethics that prohibited non-feature-driven UI changes would be useful. (I also can't tell if that's a proposal you're still pushing.)
To be specific, I'm estimating that the amount of work required to build and maintain a simple and constant UI wrapper around a browser rendering engine is about one full time experienced engineer for two weeks to build and then 10% of their time (usually 0% but occasionally a lot of work when the underlying implementation changes) going forward. The interface between the engine and the UI is pretty clean. For example, have a look at Apple's documentation for WebView:
The situation on Android is similar. Hundreds of apps, including many single-developer ones, use
WebViewto bring a web browser into their app, with the UI fully under their control.
I've been paying a lot of attention to this, since that's been the core of what I've worked on since 2012: first on mod_pagespeed and now on GPT. When I look back at the last five years of web technology changes the main things I see (not exhaustive, just what I remember) are:
I'm still not sure what you're referring to?
(As before: I work at Google, and am commenting only for myself.)
At the first glance this seems to me like "everything was better in the past". It seems to me like a website that's stuck in how things were done in the past like Wikipedia which doesn't do any A/B tests loses in usability compared to more modern websites that are highly optimized.
In the company where I work we don't have A/B test and plenty of changes are made for reasons of internal company politics and as a result the users still suffer from bad UI changes.
How do you get “everything was better in the past” out of what I wrote?
I am saying that one specific category of thing was better in the past. For this to be unbelievable to you, to trigger this sort of response, you must believe that nothing was better in the past—which is surely absurd, yes?
Wikipedia has considerably superior usability to the majority of modern websites.
To write a comment on this website I can click on "reply", then write my text and click "submit". On Wikipedia I would have to click on "edit" then find the right section to reply to. Once I have found it I have to decide on the right combination of * and : to put in front of my reply. After I wrote my comment I have to sign it by writing ~~~~. After jumping through those hoops I can click on "publish" (a recent change because user research suggested people were confused by "save").
Then if I'm lucky my post is published. If I'm unlucky I have to deal with a merge conflict. It's hard for me to see Wikipedia here as user-friendly.
This creates a pressure where some discussion about Wiki editing get pushed to Facebook or Telegram groups that are more user-friendly to use because it takes a lot less effort to write a new message.
When it comes to menus you have a left side menus. You have the menus on the left and right side on the top of the article. Then you have the top menu on the right side. It's not clear to a new user why "related changes" is somewhere completely different then "history".
More importantly the kind of results that A/B testing reveals are often not as obvious but there effects accumulate. The fact that Wikipedia lost editors over the last decade is for me a sign that they weren't effective at evolving software that people actually want to use to contribute.
Wikipedia is generally pretty good, but the "lines run the full width of your monitor on desktop no matter how wide your screen" is terrible.
If that’s the most severe (or one of the most severe) problems with Wikipedia’s UI that you can think of, then this only proves my point. As you say, Wikipedia is generally pretty good—which cannot be said for the overwhelming majority of modern websites, even—especially!—those that (quite correctly and reasonably) conform to the “limit text column width” typographic guideline.
I didn't introduce Wikipedia as an example of a site with poor UI. I think it's pretty good aside from, as I said, the line width issue. It's also in a space that people have a lot of experience with: displaying textual information to people. Wikipedia could likely benefit from some A/B tests to optimize their page load times, but that's all behind the scenes.
Another ethical consideration is that most A/B tests aren't aimed to help the user, but to improve metrics that matter to the company, like engagement or conversion rate. All the sketchy stuff you see on the web - sign up for your free account, install our mobile app, allow notifications, autoplay videos, social buttons, fixed headers and footers, animated ads - was probably justified by A/B testing at some point.
Companies optimize for making money, and while ideally they do that by providing value for people in some situations they'll do that best by annoying users. The problem here is bad incentives, though, and if you took way A/B testing you'd just see cargo culting instead.
I agree that A/B tests aren't evil, and are often useful. All I'm saying is, sometimes they give ammo to dark pattern thinking in the minds of people within your company, and reversing that process isn't easy.
It's not just money, but short term profits. A/B testing is an exceptionally good tool for measuring short term profits, but not as good a tool for measuring long term changes in behavior that come as a result of "optimized" design.
Anyone who has a long-term view into user identity (FB, email providers, anywhere you log in) can totally do long-term experiments and account for user learning effects. Google published a good paper about this: Focusing on the Long-term: It’s Good for Users and Business (2015)
(Disclosure: I work for Google)
And, as that paper inadvertently demonstrates (among others, including my own A/B testing), most companies manage to not run any of those long-term experiments and do things like overload ads to get short-term revenue boosts at the cost of both user happiness and their own long-term bottom line.
That includes Google: note that at the end of a paper published in 2015, for a company which has been around for a while in the online ad business, let us say, they are shocked to realize they are running way too many ads and can boost revenue by cutting ad load.
Ads are the core of Google's business and the core of all A/B testing as practiced. Ads are the first, second, third, and last thing any online business will A/B test, and if there's time left over, maybe something else will get tested. If even Google can fuck that up for so long so badly, what else are they fucking up UI-wise? A fortiori, what else is everyone else online fucking up even worse?
The claim was that A/B testing was "not as good a tool for measuring long term changes in behavior" and I'm saying that A/B testing is a very good tool for that purpose. That companies generally don't do it I think is mostly a lack of long-term focus, independent of experiments. I'm sure Amazon does it.
The paper was published in 2015, but describes work on estimating long-term value going back to at least 2007. It sounds like you're referring to the end of section five, where they say "In 2013 we ran experiments that changed the ad load on mobile devices ... This and similar ads blindness studies led to a sequence of launches that decreased the search ad load on Google’s mobile traffic by 50%, resulting in dramatic gains in user experience metrics." By 2013 they were certainly already taking into account long-term value, even on mobile (which was pretty small until just around 2013). This section isn't saying "we set the threshold for the number of ads to run too high" but "we were able to use our long-term value measurements to better figure out which ads not to run". So I don't think "if even Google can fuck that up for so long so badly" is a good reading of the paper.
I work in display ads and I don't think this is right. Where you see the most A/B testing is in funnels. If you're selling something the gains from optimizing the flow from "user arrives on your site" to "user finishes buying the thing" are often enormous, like >10x. Whereas with ads if you just stick AdSense or something similar on your page you're going to be within, say, 60% of where you could be with a super complicated header bidding setup. And if you want to make more money with ads your time is better spent on negotiating direct deals with advertisers than on A/B testing. I dearly wish I could get publishers to A/B test their ad setups!
And the paper you linked showed that it wasn't being done for most of Google's history. If Google doesn't do it, I would be doubtful if anyone, even a peer like Amazon, does. Is it such a good tool if no one uses it?
Which is just another way of saying that before then they hadn't used their long-term value measurements to figure out what threshold of ads to run before. Whether 2015 or 2013, this is damning. (As are, of course, the other ones I collate, with the exception of Mozilla who don't dare make an explosive move like shipping adblockers installed by default, so the VoI to them is minimal.)
The result which would have been exculpatory is if they said, "we ran an extra-special long-term experiment to check we weren't fucking up anything, and it turns out that, thanks to all our earlier long-term experiments dating back many years which were run on a regular basis as a matter of course, we had already gotten it about right! Phew! We don't need to worry about it after all. Turns out we hadn't A/B-tested our way into a user-hostile design by using wrong or short-sighted metrics. Boy it sure would be bad if we had designed things so badly that simply reducing ads could increase revenue so much." But that is not what they said.
This is a nitpick, but 2000-2007 (the period between when AdWords launched and when the paper says they started quantitative ad blindness research) is 1/3 of Google's history, not "most".
I'm also not sure if the experiments could have been run much earlier, because I'm not sure identity was stable enough before users were signing into search pages.
Also, this sort of optimization isn't that valuable compared to much bigger opportunities for growth they had in the early 2000s.
Why are you saying Google doesn't do it? I understand arguing about whether Google was doing it at various times, whether they should have prioritized it more highly, etc, but it's clearly used and I've talked to people who work on it.
Would you be interested in betting on whether Amazon has quantified the effects of ad blindness? I think we could probably find an Amazon employee to verify.
It's specifically about mobile, which in 2013 was only about 10% of traffic and much less by monetization. Similar desktop experiments had been run earlier.
But I also think you're misinterpreting the paper to be about "how many ads should we run" and that those launches simply reduced the number of ads they were running. I'm claiming that the tuning of how many ads to run to maximize long-term value was already pretty good by 2013, but having a better experimental framework allowed them to increase long-term value by figuring out which specific kinds of ads to run or not run. As a rough example (from my head, I haven't looked at these launches) imagine an advertiser is willing to pay you a lot to run a bad ad that makes people pay less attention to your ads overall. If you turn down your threshold for how many ads to show, this bad ad will still get through. Measuring this kind of negative externality that varies on a per-ad basis is really hard, and it's especially hard if you have to run very long experiments to quantify the effect. One of the powerful tools in the paper is estimating long-term impacts from short term metrics so you can iterate faster, which makes it easier to evaluate many things including these kind of externalities.
(As before, speaking only for myself and not for Google)
This is really cool, thanks for the link!
The question is whether the cost of the test itself (users being confused by new UIs) outweighs the benefit of running the test. In my personal experience, both as a user and as tech-support, the benefits of new UIs are, at best, marginal. The costs, however, are considerable.
The unstated assumption in your assertion is that A/B testing is the only way for companies to get feedback on their UIs. It isn't. They can do user-testing with focus groups, and I would be willing to wager that they would learn as much from the focus groups as they would from the A/B tests on their production UI. The only reason to prefer A/B tests in production is because it's cheaper, and the only reason it's cheaper is because you've offloaded the externality of having to learn a new UI onto the user.
(Assuming we're still talking about A/B testing significant changes to UIs on products that a lot of people use, which is a very small part of A/B testing)
Wait, I don't think this. Running lots of tiny tests and dogfooding can both give you early feedback about product changes before rolling them out. You can run extensive focus groups with real users once you have something ready to release. But if you take the results from those tests just launch to 100%, sometimes you're going to make bad decisions . Real user testing is especially good for catching issues that apply infrequently, affect populations that are hard to bring in for focus groups, or that only come up after a long time using the product.
Here's an example of how I think these should be approached:
Say eBay was considering a major redesign of their seller UI. They felt like their current UI was near a local maximum, but if they reworked it they could get somewhere much better.
They run mockups by some people who don't currently sell on eBay, and they like how much easier it is to list products
They build out something fake but interactive and run focus groups, which are also positive.
They implement the new version and make it available under a new URL, and add a link to the old version that says "try the new eBay" (and a link to the new one that says "switch back to the old eBay").
When people try the new UI and then choose to switch back they're offered a comment box where they can say why they're switching. Most people leave it blank, and it's annoying to triage all the comments, but there are some real bugs and the team fixes them.
At first they just pay attention to the behavior of people who click the link: are they running into errors? Are they more or less likely to abandon listings? This isn't an A/B test and they don't have a proper control group because users are self-selected and learning effects are hard, but they can get rough metrics that let them know if there are major issues they didn't anticipate. Some things come up, they fix them.
They start a controlled experiment where people opening the seller UI for the first time get either the new or old UI, still with the buttons for switching in the upper corner. They use "intention to treat" and compare selling success between the two groups. Some key metrics are worse, they figure out why, they fix them. This experiment starts looking positive.
They start switching a small fraction of existing users over, and again look at how it goes and how many users chose to switch back to the old UI. Not too many switch back, and they ramp the experiment up.
They add a note to the old UI saying that it's going away and encouraging people to try out the new UI.
They announce a deprecation date for the old UI and ramp up the experiment to move people over. At this point the only people on the old UI are people who'e tried the new UI and switched back.
They put popups in the old UI asking people to say why they're not switching. They fix issues that come up there.
They turn down the old UI.
It sounds like you're saying they should skip all the steps after "They implement the new version and make it available under a new URL" and jump right to "They turn down the old UI"?
That whole process seems plausibly ethical. The problem is that most companies go straight from "considering a major redesign" to "implement the new version" and then switch half of users over to the new UI and leave half on the old UI. And even with that whole step, I have literally seen disassociative episodes occur because of having a user interface changed (specifically, the Gmail interface update that happened last year). It should be done only with extreme care.
Are you talking about the Inbox deprecation?
No, the one described in https://www.theverge.com/2018/4/12/17227974/google-gmail-design-features-update-2018-redesign that came in April 2018
That's not talking about a UI refresh, but about Gmail adding new features:
Is that what you're talking about or am I still looking at the wrong thing?
That rollout of new features also included a UI refresh making it look "cleaner."
See https://www.cultofmac.com/544433/how-to-switch-on-new-gmail-redesign/, this HN post, and https://www.theverge.com/2018/4/25/17277360/gmail-redesign-live-features-google-update which says "The new look, which exhibits a lot of softer forms and pill-shaped buttons, will have to prove itself over time"