crabman

crabman's Comments

How to Throw Away Information in Causal DAGs

Instead of saying " contains all information in relevant to ", it would be better to say that, contains all information in that is relevant to if you don't condition on anything. Because it may be the case that if you condition on some additional random variable , no longer contains all relevant information.

Example:

Let be i.i.d. binary uniform random variables, i.e. each of the variables takes the value 0 with probability 0.5 and the value 1 with probability 0.5. Let be a random variable. Let be another random variable, where is the xor operation. Let be the function .

Then contains all information in that is relevant to . But if we know the value of , then no longer contains all information in that is relevant to .

ozziegooen's Shortform

It's definitely the first. The second is bizarre. The third can be steelmanned as "Given my evidence, an ideal thinker would estimate the probability to be 20%, and we all here have approximately the same evidence, so we all should have 20% probabilities", which is almost the same as the first.

Understanding Machine Learning (I)

Two nitpicks:

like if you want it to recognize spam emails, but you only show it aspects of the emails such that there is at best a statistically weak correlation between them and whether the email is spam or not-spam

Here "statistically weak correlation" should be "not a lot of mutual information", since correlation is only about linear dependence between random variables.

i.d.d.

Should be i.i.d.

[Personal Experiment] One Year without Junk Media

What about videogames?

[This comment is no longer endorsed by its author]Reply
crabman's Shortform

In my understanding, here are the main features of deep convolutional neural networks (DCNN) that make them work really well. (Disclaimer: I am not a specialist in CNNs, I have done one masters level deep learning course, and I have worked on accelerating DCNNs for 3 months.) For each feature, I give my probability, that having this feature is an important component of DCNN success, compared to having this feature to the extent that an average non-DCNN machine learning model has it (e.g. DCNN has weight sharing, an average model doesn't have weight sharing).

  1. DCNNs heavily use transformations, which are the same for each window of the input - 95%
  2. For any set of pixels of the input, large distances between pixels in the set make the DCNN model interactions between these pixels less accurately - 90% (perhaps usage of dilution in some DCNNs is a counterargument to this)
  3. Large depth (together with the use of activation functions) lets us model complicated features, interactions, logic - 82%
  4. Having a lot of parameters lets us model complicated features, interactions, logic - 60%
  5. Given 3 and 4, SGD-like optimization works unexpectedly fast for some reason - 40%
  6. Given 3 and 4, SGD-like optimization with early stopping doesn't overfit too much for some reason - 87% (I am not sure if S in SGD is important, and how important is early stopping)
  7. Given 3 and 4, ReLu-like activation function works really well (compared to, for example, sigmoid).
  8. Modern deep neural network libraries are easy to use compared to the baseline of not having specific well-developed libraries - 60%
  9. Deep neural networks work really fast, when using modern deep neural network libraries and modern hardware - 33%
  10. DCNNs find such features in photos, which are invisible to the human eye and to most ML algorithms - 20%
  11. Dropout helps reducing overfitting a lot - 25%
  12. Batch normalization improve quality of the model a lot for some reason - 15%
  13. Batch normalization makes the optimization much faster - 32%
  14. Skip connections (or residual connections, I am not sure if there's a difference) help a lot - 20%

Let me make it more clear how I was assigning the probabilities and why I created this list. I am trying to come up with a tensor network based machine learning model, which would have the main advantages of DCNNs, but which would not, itself, be a deep relu neural network. So I decided to make this list to see which important components my model has.

cousin_it's Shortform

Would it be correct to say that (2) and (3) can be replaced with just "apply any linear operator"?

Also, what advantages does working with amplitudes have compared to working with probabilities? Why don't we just use probability theory?

Raemon's Scratchpad

Talk to your roommates and make an agreement, that each of you, in round robin order, orders apartment cleaning service, with period equal to X weeks. This will alleviate part of the problem.

What's going on with "provability"?

So it is perfectly okay to have a statement that is obviously true, but still cannot be proved using some set of axioms and rules.

The underlying reason is that if you imagine a Platonic realm where all abstractions allegedly exist, the problem is that there are actually multiple abstractions ["models"] compatible with ZF, but different from each other in many important ways.

So, when you say Godel's sentence is obviously true, in which "abstraction" is it true?

What funding sources exist for technical AI safety research?

Are you interested in AI safety jobs, i.e. to be hired by a company and work in their office?

The first step of rationality

The article's title is misleading. He didn't harass or rape anyone. He had sex with prostitutes and hid that from his wife.

Load More