Zachary Robertson

Comments

MDP models are determined by the agent architecture and the environmental dynamics

I don't understand your point in this exchange.

Play or exercise.

I explicitly said I was going to be pedantic. It seems like useful/necessary role to play if you, a domain expert, were confused and then switched your viewpoint. This usually is where being formal becomes useful. First, it uncovers potentially subtle hidden assumptions. Second, it may offer a general result. Third, it protects the reader (me) from 'catching' your confusion by constraining communication to just things that can be independently verified.

Having said that,

You used the word 'model' in both of your prior comments, and so the search-replace yields "state-abstraction-irrelevant abstractions." Presumably not what you meant?

This does not come off as friendly. I asked you to search for 'model-irrelevant' which is distinct from 'model'. It's just a type of state-abstraction.

That's not a "concrete difference."

I claim there is an additional alternative. Two does not equal three. Just because you don't understand something doesn't mean it's not concrete.

I suppose those comments are part of the natural breakdown of civility at the end of an internet exchange and I'm probably no better off myself. Anyway, I certainly hope you figure out your confusion, although I see it's a far stretch my commentary is going to help you :)

MDP models are determined by the agent architecture and the environmental dynamics

I don’t think it’s a good use of time to get into this if you weren’t being specific about your usage of ‘model’ or the claim you made previously because I already pointed out a concrete difference: I claim it’s reasonable to say there are three alternatives while you claim there are two alternatives.

(If it helps you, you can search-replace model-irrelevant to state-abstraction because I don’t use the term model in my previous reply anyway.)

MDP models are determined by the agent architecture and the environmental dynamics

This was why gave a precise definition of model-irrelevance. I'll step through your points using the definition,

  1. Consider the underlying environment (assumed Markovian)
  2. Consider different state/action encodings (model-irrelevant abstractions) we might supply the agent.
  3. For each, fix a reward function distribution
  4. See what the theory predict

The problem I'm trying to highlight lies in point three. Each task is a reward function you could have the agent attempt to optimize. Every abstraction/encoding fixes a set of rewards under which the abstraction is model-irrelevant. This means the agent can successfully optimize these rewards.

[I]f you say "the MDP has a different model", you're either disagreeing with (1) the actual dynamics, or claiming that we will physically supply the agent with a different state/action encoding (2).

My claim is that there is a third alternative: you may claim that the reward function given to the agent does not satisfy model-irrelevance. This can be the case even if the underlying dynamics are markovian and the abstraction of the transitions satisfies model-irrelevance.

I don't follow. Can you give a concrete example?

That may take a while. The argument above is a reasonable candidate for a lemma. A useful example would show that the third alternative exists. Do you agree this is the crux of your disagreement with my objection? If so, I might try to formalize it.

MDP models are determined by the agent architecture and the environmental dynamics

I still see room for reasonable objection.

An MDP model (technically, a rewardless MDP) is a tuple

I need to be pedantic. The equivocation here is where I think the problem is. To assign a reward function we need a map from the state-action space to the reals. It's not enough to just consider a 'rewardless MDP'.

When we define state and action encodings, this implicitly defines an "interface" between the agent and the environment.

As you note, the choice of state-action encoding is an implicit modeling assumption. It could be wrong, but to even discuss that we do have to be technical. To be concrete, perhaps we agree that there’s some underlying dynamics that is Markovian. The moment we give the agent sensors we create our state abstraction for the MDP. Moreover, say we agree that our state abstraction needs to be model-irrelevant. Given a 'true' MDP and state abstraction that operates on we'll say that is model-irrelevant if where and we have, Strictly speaking, model-irrelevance is at least as hard to satisfy for a collection of MDPs than for a single MDP. In other words, we may be able to properly model a single task with an MDP, but a priori there should be skepticism that all tasks can be modeled with a specific state-abstraction. Later on you seem to agree with this conclusion,

That's also a claim that we can, in theory, specify reward functions which distinguish between 5 googolplex variants of red-ghost-game-over. If that were true, then yes - optimal policies really would tend to "die" immediately, since they'd have so many choices.

Specifically, the agent architecture is an implicit constraint on available reward functions. I'd suspect this does generalize into a fragility/impossibility result any time the reward is given to the agent in a way that's decoupled from the agent's sensors which is really going to be the prominent case in practice. In conclusion, you can try to work with a variable/rewardless MDP, but then this argument will apply and severely limit the usefulness of the generic theoretical analysis.

The Variational Characterization of KL-Divergence, Error Catastrophes, and Generalization

Because . They are the same. Does that help?

SGD's Bias

I’m assuming we can indeed box the bias as “drift from high noise to low noise”. I wonder if flat minima necessarily have lower noise, under empirical approximation, than sharp minima. If that were the case then you could use this to conclude that SGD does bias towards generalizable minima.

I’d look at this, but I figure you understand the SGD framework better and may have an idea about this?

The Variational Characterization of KL-Divergence, Error Catastrophes, and Generalization

The term is meant to be a posterior distribution after seeing data. If you have a good prior you could take . However, note could be high. You want trade-off between the cost of updating the prior and the loss reduction.

Example, say we have a neural network. Then our prior would be the initialization and the posterior would be the distribution of outputs from SGD.

(Btw thanks for the correction)

Open and Welcome Thread - May 2021

They don't speak about a having a PhD but ability to get a into a top 5 graduate program.

Yes they do. On the same page,

The first step on this path is usually to pursue a PhD in machine learning at a good school. It’s possible to enter without a PhD, but it’s close to a requirement in research roles at the academic centres and DeepMind, which represent a large fraction of the best positions.

Certainly there’s a bottleneck on ‘good’ schools also, but then we can strengthen the claim using what they say later about ‘top’ schools being a proxy for success.

Open and Welcome Thread - May 2021

They do say that a PhD from a top 5 program is a reasonable proxy for an AI research center. These are supply limited. Therefore, they are implying that top PhDs are a bottleneck. This is far upstream of everything else so it does seem that a top PhD is a reasonable proxy for the bottleneck.

NTK/GP Models of Neural Nets Can't Learn Features

I think that 'universal function approximation' and 'feature learning' are basically unrelated dimensions along which a learning algorithm can vary.

We may have reached the crux here. Say you take a time series and extract the Fourier features. By universal approximation, these features will be sufficient for any downstream learning task. So the two are related. I agree that there is no learning taking place and that such a method may be inefficient. However, that goes beyond my original objection.

This issue of 'embedding efficiency' seems only loosely related to the universal approximation property.

This is not a trivial question. In the paper I referenced the authors show that approximation efficiency of NTK for deep and shallow are equivalent. However, infinitely differentiable activations can only approximate smooth functions. On the other hand, ReLU seems capable of approximating a larger class of potentially non-smooth functions.

Load More