There are at least two types of confirmation bias.
The first is selective attention: a tendency to pay attention to, or recall, that which confirms the hypothesis you are thinking about rather than that which speaks against it.
The second is selective experimentation: a tendency to do experiments which will confirm, rather than falsify, the hypothesis.
The standard advice for both cases seems to be "explicitly look for things which would falsify the hypothesis". I think this advice is helpful, but it is subtly wrong, especially for the selective-experimentation type of confirmation bias. Selective attention is relatively straightforward, but selective experimentation is much more complex than it initially sounds.
Looking for Falsification
What the standard (Popperian) advice tells you to do is try as hard as you can to falsify your hypothesis. You should think up experiments where your beloved hypothesis really could fail.
What this advice definitely does do is guard against the mistake of making experiments which could not falsify your hypothesis. Such a test is either violating conservation of expected evidence (by claiming to provide evidence one way without having any possibility of providing evidence the other way), or providing only very weak evidence for your claim (by looking much the same whether your claim is true or false). Looking for tests which can falsify your result steers you towards tests which would provide strong evidence, and helps you avoid violating the law of expected evidence.
However, there are more subtle ways in which confirmation bias can act.
Predicting Results in Advance
You can propose a test which would indeed fit your hypothesis if it came out one way, and which would disconfirm your hypothesis if it came out the other way -- but where you can predict the outcome in advance. It's easy to not realize you are doing this. You'll appear to provide significant evidence for your hypothesis, but actually you've cherry-picked your evidence before even looking at it; you knew enough about the world to know where to look to see what you wanted to see.
Suppose Dr. Y studies a rare disease, Swernish syndrome. Many scientists have formed an intuition that Swernish syndrome has something to do with a chemical G-complex. Dr. Y is thinking on this one night, when the intuition crystallizes into G-complex theory, which would provide a complete explanation of how Swernish syndrome develops. G-complex theory makes the novel prediction that G-complex in the bloodstream will spike during early onset of the disease; if this were false, G-complex theory would have to be false. Dr. Y does the experiment, and finds that the spike does occur. No one has measured this before, nor has anyone else put forward a model which makes that prediction. However, it happens that anyone familiar with the details of Dr. Y's experimental results over the past decade would have strongly suspected the same spike to occur, whether or not they endorsed G-complex theory. Does the experimental result constitute significant evidence?
This is a subtle kind of double-counting of evidence. You have enough evidence to know the result of the experiment; also, your evidence has caused you to generate a hypothesis. You cannot then claim the success of the experiment as more evidence for your hypothesis: you already know what would happen, so it can't alter the certainty of your hypothesis.
If we're dealing only with personal rationality, we could invoke conservation of expected evidence again: if you already predict the outcome with high probability, you cannot simultaneously derive much evidence from it. However, in group rationality, there are plenty of cases where you want to predict an experiment in advance and then claim it as evidence. You may already be convinced, but you need to convince skeptics. So, we can't criticize someone just for being able to predict their experimental results in advance. That would be absurd. The problem is, the hypothesis isn't what did the work of predicting the outcome. Dr. Y had general world-knowledge which allowed him to select an experiment whose results would be in line with his theory.
To Dr. Y, it just feels like "if I am right, we will see the spike. If I am wrong, we won't see it." From the outside, we might be tempted to say that Dr. Y is not "trying hard enough to falsify G-complex theory". But how can Dr. Y use this advice to avoid the mistake? A hypothesis is an explicit model of the world, which guides your predictions. When asked to try to falsify, though, what's your guide? If you find your hypothesis very compelling, you may have difficulty imagining how it could be false. A hypothesis is solid, definite. The negation of a hypothesis includes anything else. As a result, "try to falsify your hypothesis" is very vague advice. It doesn't help that the usual practice is to test against a null hypothesis. Dr. Y tests against the spike not being there, and thinks this sufficient.
Part of the problem here is that it should be very clear what could and could not have been predicted. There's an interaction between your general world knowledge, which is not explicitly articulated, and your scientific knowledge, which is.
If all of your knowledge was explicit scientific knowledge, many biases would disappear. You couldn't possibly have hindsight bias; each hypothesis would predict the observation with a precise probability, which you can calculate.
Similarly, the failure mode I'm describing would become impossible. You could easily notice that it's not really your new hypothesis doing the work of telling you which experimental result to expect; you would know exactly what other world-knowledge you're using to design your experiment.
I think this is part of why it is useful to orient toward gear-like models. If our understanding of a subject is explicit rather than implicit, we can do a lot more to correct our reasoning. However, we'll always have large amounts of implicit, fuzzy knowledge coming in to our reasoning process; so, we have to be able to deal with that.
Is "Sufficient Novelty" The Answer?
In some sense, the problem is that Dr. Y's experimental result isn't novel enough. It might be a "novel prediction" in the sense that it hasn't been explicitly predicted by anyone, but it is a prediction that could have been made without Dr. Y's new hypothesis. Extraordinary claims require extraordinary evidence, right? It isn't enough that a hypothesis makes a prediction which is new. The hypothesis should make a prediction which is really surprising.
But, this rule wouldn't be any good for practical science. How surprising something is is too subjective, and it is too easy for hindsight bias to make it feel as if the result of the experiment could have been predicted. Besides: if you want science to be able to provide compelling evidence to skeptics, you can't throw out experiments as unscientific just because most people can predict their outcome.
Method of Multiple Hypotheses
So, how could Dr. Y have avoided the mistake?
It is meaningless to confirm or falsify a hypothesis in isolation; all you can really do is provide evidence which helps distinguish between hypotheses. This will guide you away from "mundane" tests where you actually could have predicted the outcome without your hypothesis, because there will likely be many other hypotheses which would be able to predict the outcome of that test. It guides you toward corner cases, where otherwise similar hypotheses make very different predictions.
We can unpack "try to falsify" as "come up with as many plausible alternative hypotheses as you can, and look for experiments which would rule out the others." But actually, "come up with alternative hypotheses" is more than an unpacking of "try to falsify"; it shifts you to trying to distinguish between many hypotheses, rather than focusing on "your" hypothesis as central.
The actual, exactly correct criteria for an experiment is its value-of-information. "Try to falsify your hypothesis" is a lousy approximation of this, which judges experiments by how likely they are to provide evidence against your hypothesis, or the likelihood ratio against your hypothesis in the case where the experiment doesn't go as your hypothesis predicts, or something. Don't optimize for the wrong metric; things'll tend to go poorly for you.
Some might object that trying-to-falsify is a good heuristic, since value of information is too difficult to compute. I'd say that a much better heuristic is to pretend distinguishing the right hypothesis is equally valuable in all cases, and look for experiments that allow you to maximally differentiate between them. Come up with as many possibilities as you can, and try to differentiate between the most plausible ones.
Given that the data was already very suggestive of a G-complex spike, Dr. Y would most likely generate other hypotheses which also involve a G-complex spike. This would make the experiment which tests for the spike uninteresting, and suggest other more illuminating experiments.
I think "coming up with alternatives" is a somewhat underrated debiasing technique. It is discussed more in Heuer's Psychology of Intelligence Analysis and Chamberlin's Method of Multiple Working Hypotheses.