Algon's Shortform

10th Oct 2022

1 min read

5

This is a special post for quick takes by Algon. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

38 comments, sorted by

top scoring

Click to highlight new comments since: Today at 4:32 AM

[-]Algon1y*60

Question: What's going on from a Bayesian perspective when you have two conflicting intuitions and don't know how to resolve them? Or learn some new info which rules out a theory, but you don't understand how precisely it rules it out?

Hypothesis: The correction flows down a different path than down the path which is generating the original theory/intuition. That is, we've failed to propagate info down our network and so you have a left-over circuit that believes in the theory which still has high weight.

[-]Algon9mo40

I'm working on some articles why powerful AI may come soon, and why that may kill us all. The articles are for a typical smart person. And for knowledgable people to share to their family/friends. Which intros do you prefer, A or B.

A) "Companies are racing to build smarter-than-human AI. Experts think they may succeed in the next decade. But more than “building” it, they’re “growing” it — and nobody knows how the resulting systems work. Experts vehemently disagree on whether we’ll lose control and see them kill us all. And although serious people are talking about extinction risk, humanity does not have a plan. The rest of this section goes into more detail about how all this could be true."

B) "Companies are racing to grow smarter-than-human AIs. More and more experts think they’ll succeed within the next decade. And we do grow modern AI — which means no one knows how they work, not even their creators. All this is in spite of the vehement disagreement amongst experts about how likely it is that smarter-than-human AI will kill us all. Which makes the lack of a plan on humanity’s part for preventing these risks all the more striking.

These articles explain why you should expect smarter than human AI to come soon, and why that may lead to our extinction. "

[-]Lorec9mo54

[A], just 'cause I anticipate the 'More and more' will turn people off [it sounds like it's trying to call the direction of the winds rather than just where things are at].

[ Thanks for doing this work, by the way. ]

[-]Algon9mo30

Thanks for the feedback!

[-]Nathan Helm-Burger9mo40

A, since I think the point about growing vs constructing is good, but does need that explanation.

[-]Algon1y40

"So you make continuous simulations of systems using digital computers running on top of a continuous substrate that's ultimately made of discrete particles which are really just continuous fluctuations in a quantized field?"
"Yup."
"That's disgusting!"
"That's hurtful. And aren't you guys running digital machines made out of continuous parts, which are really just discrete at the bottom?"
"It's not the same! This is a beautiful instance of the divine principle 'as above, so below'. (Which I'm amazed your lot recognized.) Entirely unlike your ramshackle tower of leaking abstractions."
"You know, if it makes you feel any better, some of us speculate that spacetime is actually discretized."
"I'm going to barf."
"How do you even do that anyway? I was reading a novel the other day, and it said -"
"Don't believe everything you hear in books. Besides, I read that thing. That world was continuous at the bottom, with one layer of discrete objects on top. Respectable enough, though I don't see how that stuff can think."
"You're really prejudiced, you know that?"
"Sod off. At least I know what I believe. Meanwhile, you can't stop flip-flopping between the nature of your metaphysics."

[-]Algon3y*40

In "Proofs and Refutations", Imre Laktos^[1] portrays a socratic discussion between teacher and students as they try to prove Euler's theorem V+F =E+2. The beauty of this essay is that the discussion mirrors the historical development of the subject, whilst also critiquing the formalist school of thought, the modern agenda of meta-mathematics, and how it doesn't fit the way mathematics is done in practice. Whilst I'm on board with us not knowing how to formalize mathematical practice, I think it is a solvable problem. Moreover, I am a staunch ~~ultra-finist~~ believer in reality being computable, which dovetails with a belief in proofs preserving truth. Yet that matters little in the face of such a fantastic exposition on mathematical discovery.^[2] Moreover, it functions as a wonderful example of how to alternate between proving and disproving a conjecture, whilst incorporating the insights we gain into our conjecture.

In fact, it was so stimulating that after reading it I came up with two other proofs on the spot, though the core idea is the same as in Laktos' proof. After that experience, I feel like showing a reader how a proof is generated is a dang good substitute for interacting with a mathematician in real life. Mathematics is one of the few areas where text and images, if read carefully, can transfer most tacit information. We need more essays like this.^[3]

Now, if only we could get professor's to force students to try and prove theorems within the lecture, dialogue with them and transcribe the process. Just think, when the professor comes to the "scrib together lecture material and turn it into a textbook" part of their lifecycle, we'd automatically get beautiful expositions. Please excuse me whilst I go cry in a corner over unattainable dreams.^[4]

^{^}
Not a Martian as he wasn't born in Budapest.
^{^}
And how weird things were in the days before Hilbert mastered mathematics and brought rigour to the material world. Listen to this wildly misleading quote: "In the 19th century, geometers, besides finding new proofs of the Euler theorem, were engaged in establishing the exceptions which it suffers under certain conditions." From p. 36, foot note 1.
^{^}
Genealized Heat Engine and Lecture 9 of Scott Aaronson's democritus lectures on QM are two other expositions which are excellent, though not as organic as Proofs and Refutations.
^{^}
Funnily enough, Euler's papers contain clear descriptions of how he came to the proofs. At least, those I've read. Which is, like, 1/10,000th of his material by word count. I'm not kidding.^[5]^[6]
^{^}
http://archive.boston.com/bostonglobe/ideas/brainiac/2012/11/the_100-year_pu.html
^{^}
http://eulerarchive.maa.org/

[-]Algon1y30

When tracking an argument in a comment section, I like to skip to the end to see if either of the arguers winds up agreeing with the other. Which tells you something about how productive the argument is. But when using the "hide names" feature on LW, I can't do that, as there's nothing distinguishing a cluster of comments as all coming from the same author.

I'd like a solution to this problem. One idea that comes to mind is to hash all the usernames in a particular post and a particular session, so you can check if the author is debating someone in the comments without knowing the author's LW username. This is almost as good as full anonymity, as my status measures take a while to develop, and I'll still get the benefits of being able to track how beliefs develop in the comments.

@habryka

[-]habryka1y40

Yeah, I think the hide author feature should replace everyone with single letters or something, or give you the option to do that. If someone wants to make a PR with that, that would be welcome, we might also get around to it otherwise at some point (but it might take a while)

[-]Algon2y30

I've been thinking about exercises for alignment, and I think going through a list of lethalities and applying them to an alignment propsal would be a good one. Doing the same with Paul's list would be a bonus challenge. If I had some pre-written answer sheet for one proposal, I could try the exercise my self to see how useful it would be. This post, which I haven't read yet, looks like it would serve for the case of RLHF. I'll try it tomorrow and report back here.

[-]Algon7h20

I was bored, so I decided to re-construct a max flow algorithm since I forgot how Ford Fulkerson works.

OK, you have a graph w/ positive integers denoting capacity for each edge, and start/end nodes. You want to get a flow going from input->output which maxes out the channel capacity.

How to do this? Well, I vaguely remember something about a blocking flow. Intuition: to figure out how much stuff you can push through a network, look at the bottlenecks.

What's a bottleneck? Well, it's a thingy which everything has to pass through that limits the performance of a system. So we know we have to look for some object that everything has to pass through.

Well, let's think about a river network. The "edges" are rivers, and maybe the places they connect are vertices. The "flow" is just the kg/s passing through each river, which has a capacity before the river overflows, which is bad.

Perhaps a better example is a factor with conveyor belts taking inputs, which workers convert to outputs. The edges are the belts, their capacity the worker's capacity to transform the inputs, the start and end are the start and end of the factory.

If we cut it in half anywhere, we know that to get from one side to another, you'd have to go through a stream/belt that our line just cut. Perhaps this, then, is a bottleneck? Just some big old bunch of edges, and you count the maximum capacity of water along each?

But that's nonsense: Do that just after the start and at the end, and the capacities will surely differ. Likewise, for some kind of graph with lots of capacity at either end, and a tiny single edge you need to pass through in the middle.

Aha! Maybe that's the key? You have to pass through some set of edges? If you cut them out, there's no path from start to finish? Then count their capacity and you're done.

Well, no. Remember, you can just do that for the starting /ending edges and construct a case which gives you different answers. But this feels more promising.

Thinking about the name a bit more, a bottleneck sounds like a gap that limits the size of things you can shove through.

So maybe, may be it's just the set of edges with the minimum capacity, then. Let's call it a "min cut". That sounds a bit more reasonable. But does that tell you the max flow?

Example: Let "-(N)->" denote an edge with capacity N.
IN -(10)-> A -(1)-> B -(5)->C -(10)-> OUT
\ -(10)-> X -(5) -> Y -(1)->Z -(10)-/
In this graph, the max flow is obviously 1+1. And clearly, if you cut the edges with capacity 1, then you can't get from A or X to C or Z. And their sum is 1+1.

You just can't shove more than 2 things down this graph. Seems promising.

OK, how do we prove min-cut = max-flow, though? And then how can we find such a cut?

First observation: the maximum flow can't be more than the capacity of a min-cut as every flow has to pass through there.

Now we just have to prove the max-flow >= the min-cut.

Well, let's go back to our factory example for intuition. If you just start sending some goods through along one path from start to finish, you'd eventually fill up a path. Then you'd start using another path to transform goods and another you till can't any more.

Does the order we chose matter to the goods we made? Well, either it does or it doesn't. If it does, there's no max flow, only maximal flows. How silly. But maybe the creators didn't study set-theory in kindergarten? I certainly didn't. And if it doesn't, then there is a unique maximum flow. Still, that doesn't tell us that max-flow >= min-cut.

To get some insight, let's try to apply the idea that we "send goods along paths till we can't any more" to our example graph. We choose some starting edge. Let's go with IN-> X. Can we put 1 unit down this edge? Yes, as the capacity is ten. Then we can only go from X->Y. Can we put 1 unit down this edge? Yes. Likewise for Y->Z.

Now we've used up one unit of capacity for every edge along the path X->Y->Z. So let's decrement the capacity in our graph.
IN -(10)-> A -(1)-> B -(5)->C -(10)-> OUT
    \ - (9) -> X -(4)-> Y -(0)->Z - (9) -/
There's no capacity left along Y->Z though! We've got the first bit of our bottleneck, I think. Let's cut that edge and replace it with a b.n.
IN -(10)-> A -(1)-> B -(5)->C -(10)-> OUT
    \ - (9) -> X -(4)-> Y b.n Z - (9) -/
If we put another unit down X, we'll see we can't make any progress from Y onwards. So let's get rid of X -> Y.
IN -(10)-> A -(1)-> B -(5)->C -(10)-> OUT
    \ - (9) -> X b.n Y b.n. Z - (9) -/

If we do likewise for A, we find that A->B will be another bottleneck.
IN -(9)-> A b.n. B -(4)->C -(9)-> OUT
\ -(9)-> X b.n Y b.n. Z -(9)-/
Now, they're no way to get from the start to the end. Did this tell us anything?

Surprisingly, yes. We saw our algorithm find a cut with a capacity of 2, which is equal to its max flow. A cut surely has at least as much capacity as a min-cut. So we see that max-flow = min-cut and it doesn't matter where we start.

So we know how to compute max flows now. Yay! But wait. Is this a good algorithm?

No. I leave figuring out why as an exercise to the reader.

Have we actually proved anything? Also no. But I've crawled back up to the point where the theorem feels like physically necessary, at which point, the proof is rarely an issue. And I think I've got a working algorithm, it's just slow. But I'm not a programmer, so mission accomplished?

[-]Algon1d20

Why do leaves fall in autumn? The obvious answer is that there's a high upkeep cost to them, and during winter there is little sunlight, so tree leaves should fall down then. That is, they should be deciduous trees. So trees in Montreal or London be deciduous. Whereas in places like the Mediterranean, tree leaves shouldn't be deciduous, they should be evergreen. And it is so. Evergreen trees dominate in temperate, consistently sunny climates, whereas in places like the UK, only 10% of tree species are evergreen.

But that 10% figure is confusing. Doesn't it imply that evergreen trees are also a successful strategy in places with (relatively) sunless seasons? What advantage do deciduous trees have then? Also, if sunless seasons -> mostly deciduous trees, what's up with the poles? They get way less sunlight.

AFAICT, evergreen trees near the equator are more like deciduous trees in the UK or Japan than they are like evergreen trees near the poles. In some ways, equatorial evergreens are like deciduous trees on steroids.

Deciduous leaves are optimized to photosynthesize as much as possible. Relative to evergreen leaves, they have ~ 3x higher upkeep costs in return for ~ 4x higher energy production per gram per unit time. So while they're active for 1/2 the time, they make more energy, total, than evergreen leaves. But in autumn/winter, they cost too much in upkeep to be worth keeping. So deciduous trees re-absorb chlorophyll (freezing can denature proteins) and other nutrients from the leaves, leading to the leaves browning and falling off.

Numerically, these upkeep costs are 1/12 that of gross energy production for both deciduous and evergreen leaves. After accounting for the 3x increase in activity and 1/2 the year being leaf free compared to evergreens, that means deciduous trees pay 50% more in upkeep without accounting for the costs of re-creating the leaves each season.

In practice, this means deciduous leaves are pretty thin with high surface area, getting lots of sunlight for relatively small amounts of mass. There are relatively little support cells; most of the cells are for photosynthesis. Which makes them quite flimsy. They can collect lots of rainwater or snow, which is a big disadvantage in winter.

Compare this to evergreen leaves in trees originating around the poles. They're much smaller, almost like needles. Whereas evergreen leaves are optimized for surviving the harsh winters. That means they have smaller surface areas/mass ratios for durability, have relatively fewer cells devoted to photosynthesis/gram, have waxy coatings to protect them from the cold but also reduce the sunlight they can absorb, and have anti-freeze inside of them to prevent damage in winter.

The whole "water freezes in winter" bit leads to another important factor in why leaves fall off deciduous trees. When water in the ground freezes, deciduous trees can't absorb any water, but their energetic leaves could keep releasing large amounts of water. This would cause the tree to dry up and die. Better to jettison the leaves than risk that. This is less of a problem for evergreens as their leaves have low surface area/mass ratios, have coatings and close their pores in winter. But it's still an issue.

Two notable things I haven't mentioned yet. One, evergreen and deciduous leaves have the same energy production/construction cost ratios. In a really dumb model of energy production per gram (e), leaf life (T) and construction costs per gram (C) we find that:

e_D/e_E = T_E/T_D C_D /C_E

Evergreen leaves live 6x longer than deciduous leaves on average. We also know e_D/e_E ~ 3. So that implies so that implies C_D ~ 1/2 C_E. Which is somewhat surprising but I guess it makes sense in retrospect. Evergreen leaves are the low cost, low output, steady output, long lifetime counterpart to deciduous leaves.

I don't think I've got an answer for why we see not insignificant numbers of evergreen trees in places like Montreal or London. But I do feel like it makes sense why we don't see deciduous trees near the poles.

Another big complication I left out are the deciduous trees in (non frozen) deserts. What's up with them?

[-]Algon5d20

Some random notes on harvesting energy at stellar scales. Again, from Grand Futures.

Harvesting dark energy:
Tying galaxies together: Anchor big rope to galaxies as they get pulled apart by dark matter. Build up elastic potential energy which can be harvested. Issue: inefficient. You get out < 10^{-39} times energy of rope. Needs negative energy density matter to give better efficiency. Not clear how you anchor rope to energy
Letting particles expand apart: again, very tiny amounts of energy compared to mass energy of particles. So small it isn't clear if it's even a net return.

Dark matter:
Hawking radiation: takes a long time to see any returns. Very poor efficiency for black holes past 10^11 kg. Past that point, it is just neutrinos which are hard to capture. You can chuck in dark matter which isn't very usable and get interacting stuff back out.

Neutrino capture:
Lots of neutrinos running around, especially if you use hawking radiation to capture mass energy of black holes. So you might want to make use of them. But neutrinos are very weakly interacting, so you need incredibly dense matter to absorb their energy/convert them to something else. Too dense?

Some methods of extracting energy from ordinary matter using black holes.

Accretion discs: chuck in matter to orbit black hole, get very hot and transition into radiation. At most, 5% efficiency for stationary black holes, 43% efficiency for extremely rotating black holes. (With wormholes, you could get 51% efficiency). Very useful for converting matter we can interact with into energy. Not the most efficient but you don't need black holes to have angular momentum to do this, which is perhaps useful.

Penrose process: Extracts energy from angular momentum of black hole, a fair bit of which resides outside the event horizon in the form of frame-dragging spacetime. Have to drop in matter which gains energy, splits into new particles, some of which continue to fall in and others fall out. So not useful for dark-matter, which doesn't transition into ordinary matter. Has 20% efficiency at upper limits for the penrose process, but penrose like processes can get you >10x returns on mass-energy. But you need to use up the angular momentum of the black hole, which is boundedly large for a given mass. But you can get up to 50% for extremal charged black holes, and 29% for extremal rotating black holes. So this is good as long as you've got lots of spinning/charged black holes. Big black holes tend to spin reasonably fast, thankfully.

Black Hole Bombs: Another interesting way of extracting energy from black holes are superradiant instabilities, i.e. making the black hole into a bomb. You use light to extract angular momentum from the blackhole, kinda like the penrose process, and get energy out. With a bunch of mirrors, you can keep reflecting the light back in and repeat the process. This can produce huge amounts of energy quickly, on the order of gamma ray bursts for stellar mass black holes. Or if you want it to be quicker, you can get 1% of the black-holes mass energy out in 13 seconds. How to collect this is unclear.

[-]Algon18d20

Feature incentivizing grabbing attention.

[-]Algon4mo20

WTH, you can tell the LW algorithm to stop showing you certain kinds of posts by hovering over the title on the front page? How does it work, @habryka? Is it like karma? Something else?

[-]habryka4mo40

We feed it back into the Recombee magical RL/ML algorithm which we use to generate those recommendations. I don't really know what they do with it.

[-]Algon4mo20

Fair enough. If you ever get round to figuring out how this all works, it would be nice to know.

[-]habryka4mo60

I mean, I think it's probably some kind of standard transformer architecture behind the scenes that predicts user behavior and giving it negative feedback is equivalent to a backropagation step, or maybe some kind of RL step. I don't have that much deep uncertainty about how it works, besides of course that we have no idea what's going on inside of deep neural nets.

[-]kave4mo42

They tout their transformer ("beeformer") in marketing copy, but I expect mostly its driven by collaborative filtering, like most recommendation engines

[-]habryka4mo42

My guess is most recommendation engines in use these days are ML/DL based. At least I can't think of any major platform that hasn't yet switched over, based on what I read.

[-]kave4mo*20

I would definitely consider collaborative filtering ML, though I don't think people normally make deep models for it. You can see on Recombee's website that they use collaborative filtering, and use a bunch of weasel language that makes it unclear if they actually use anything else much at all

[-]Algon1y20

1) DATA I was thinking about whether all metrizable spaces are "paracompact", and tried to come up with a definition for paracompact which fit my memories and the claim. I stumbled on the right concept and dismissed it out of hand as being too weak a notion of refinement, based off an analogy to coarse/refined topologies. That was a mistake.
    1a) Question How could I have fixed this?
        1a1) Note down concepts you come up with and backtrack when you need to.
            1a1a) Hypothesis: Perhaps this is why you're more productive when you're writing down everything you think. It lets your thoughts catch fire from each other and ignite.
            1a1b) Experiment: That suggests a giant old list of notes would be fine. Especially a list of ideas/insights rather than a full thought dump.

[-]Algon1y20

Rough thoughts on how to derive a neural scaling law. I haven't looked at any papers on this in years and only have vague memories of "data manifold dimension" playing an important role in the derivation Kaplan told me about in a talk.

How do you predict neural scaling laws? Maybe assume that reality is such that it outputs distributions which are intricately detailed and reward ever more sophisticated models.

Perhaps an example of such a distribution would be a good idea? Like, maybe some chaotic systems are like this.

Then you say that you know this stuff about the data manifold, then try and prove similar theorems about the kinds of models that describe the manifold. You could have some really artificial assumption which just says that models of manifolds follow some scaling law or whatever. But perhaps you can relax things a bit and make some assumptions about how NNs work, e.g. they're "just interpolating" and see how that affects things? Perhaps that would get you a scaling law related to the dimensionality of the manifold. E.g. for a d dimensional manifold, C times more compute leads to C1/d increase in precision??? Then somehow relate that to e.g. next word token prediction or something.

You need to give more info on the metric of the models, and details on what the model is doing, in order to turn this C1/d estimate into something that looks like a standard scaling law.

[-]Algon1y20

Hypothesis: You can only optimize as many bits as you observe + your own complexity. Otherwise, the world winds up in a highly unlikely state out of ~ nowhere. This should be very surprising to you.

[-]Algon1y20

You, yes you, could've discovered the importance of topological mixing for chaos by looking at the evolution of squash in water. By watching the mixture happening in front of your eyes before the max entropy state of juice is reached. Oh, perhaps you'd have to think of the relationship between chaos and entropy first. Which is not, in fact, trivial. But still. You could've done it.

[-]Algon1y20

Question: We can talk of translational friction, transactional friction etc. What other kinds of major friction are there?
Answers:

a) UI friction?
b) The o.g. friction due to motion.
c) The friction of translating your intuitions into precise, formal statements.

Ideas for names for c: Implantation friction? Abstract->Concrete friction? Focusing friction! That's perhaps the best name for this.
On second thought, perhaps that's an overloaded term. So maybe Gendlin's friction?

d) Focusing friction: the friction you experience when focusing.

[-]Algon2y20

I am very glad the Lightcone team made the discussions feature. Comment threads on LW are about as valuable as the posts themselves, and this discussions just puts comment-threads on equal footing with posts. Obvious in retrospect. Why wasn't it done earlier though?

[-]Algon2y20

Hypothesis: agency violating phenomena should be thought of as edge-cases which show that our abstractions of ourselves as agents are leaky.

For instance, look at addictive substances like heroin. These substances break down our Cartesian boundary (our intuitive seperation of the world into ourselves and the environment with a boundary) by chemically assaulting the reward mechanisms in our brain.

However, video games or ads don't obviously violate our Cartesian boundary, which may be one of many boundaries we assume exist. Which, if my hypothesis is true, suggests that you could try to find other boundaries/abstractions violated by those phenomena. Other things which "hack" humans, like politics or psyops, would violate boundaries as well.

Finding the relevant abstractions and seeing how they break would increase our understanding of ourselves as agents. This could help triangulate a more general definition of agency for which these other boundaries are special cases or approximations.

This seems like a hard problem. But just building a taxonomy for our known abstractions for agency is less useful but much more feasible for a few months work. Sounds like a good research project.

[-]Algon2y20

My mind keeps returning to exercises which could clarify parts of alignment, both for me and others. Some of them are obvious: think about what kind of proof you'd need to solve alignment, what type of objects it would have to be talking about etc. and see whether that implies having a maths oracle would make the problem easier. Or try and come up with a list of human values to make a utility function and see how it breaks down under greater optimization pressure.

But what about new exercises? For skillsets I've never learnt? Well, there's the security mindset, which I don't have. I think it is about "trying to break everything you see", so presumably I should just spend a bunch of time breaking things or reading the thoughts of people who deeply inhabit this perspective for more tacit knowledge. For the former, I could do something like exercises for computer secuirty: https://hex-rays.com/ida-free/ For the latter, I've heard "Silence on the Wire" is good: the author is supposedly a hacker's hacker, and writes about solutions to security challenges which defy classification. Seeing solutions to complex, real world problems is very important to developing expertise.

But I just had a better thought: wouldn't watching someone hacking something be a better exmaple of the security mindset? See the problem they're tackling and guess where the flaws will be. That's the way to really acquire Tacit knowledge. In fact, looking at the LockPickingLawyer's channel is what kicked off this post. There, you can see every lock under the sun picked apart in minutes. Clearly, the expertise is believable. So maybe a good exercise for showing people that security mindset exists, and perhaps to develop it, would be getting a bunch of these locks and their design specs, giving people some tools, and asking them to break them. Then show them how the lock picking lawyer does it. Again, and again and again.

[-]Algon2y20

One thing I'm confused about re: human brain efficiency is, if our brain's advantage over apes is just scaling and some minor software tweaks to support culture, what's that imply for Corvid brains? If you scaled Corvid brains up by the human-cortical-neuroun-count/chimp-cortical-neuoron-count, and gave them a couple of software tweaks, wouldn't you get a biological Pareto improvement over human brains?

[-]Algon3y20

Obvious thing I never thought of before:

Linear optimization where your model is of the form , the $W_{i}$ being matrices, will likely result in an effective model of low rank if, you randomize the weights. Compared to just a single matrix -- to which the problem is naively mathematically identical, but not computationally -- this model won't be able to learn the identity function, or rotations or so on when n is large.

Note: Someone else said this on a gathertown meetup. The context was, that it is a bad idea to think about some ideal way of solving a problem, and then assume a neural net (or indeed any learning algorithm) would learn it. Instead, focus on the concrete details of the model you're training.

[-]the gears to ascension3y10

wow I'm not convinced that won't work. the only thing initializing with random weights should do is add a little noise. the naive mathematical interpretation should be the only possible interpretation up to your float errors, which, to be clear, will be real and cause the model to be invisibly slightly nonlinear. but as long as you're using float32, you shouldn't even notice.

[trying it, eta 120 min to be happy with results... oops I'm distractable...]

[-]Algon3y10

EDIT: Sorry, I tried something different. I fed in dense layers, followed by batchnorm, then ReLU. I ended it with a sigmoid, because I guess I just wanted to constrain things to the unit interval. I tried up to six layers. The difference in loss was not that large, but it was there. Also, hidden layers were 30 dim.

I tried this, and the results were a monotonic decrease in performance after a single hidden layer. My dataset was 100k samples of a 20 dim tensor sampled randomly from [0,1] for X, with Y being a copy of X. Loss was MSE, optimizer was adam with weight decay 0.05, lr~0.001 , minibatch size was 32, trained for 100,000 steps.

Also, I am doubtful of the mechanism being a big thing (rank loss) for such small depths. But, I do think there's something to the idea. If you multiply a long sequence of matrices, then I expect them to get extremely large, extremely small, or tend towards some kind of equillibrium. And then you have numerical stability issues and so on, which I think will ultimately make your big old matrix just sort of worthless.

[-]the gears to ascension3y20

oh by linear layer you meant nonlinear layer, oh my god, I hate terminology. I thought you literally just meant matrix multiplies

[-]Algon3y10

My body is failing me. I have been getting colds near weakly for a year and a half, after a particularly wretched cold. My soul is failing me. I have been worn down by a stressful environment, living with an increasingly deranged loved one. By my crippled minds inability to meet the challenge. Which causes body to further fail. Today, I grokked that I am in a doom spiral, headed down the same path as my kin's. I don't wish for so wretched an end, for an end it shall be.

But why my failing soul? Why does the algorithm which calls itself Algon fail when challenged so? Because the piece which calls itself Algon is blind to what the rest of his soul says, and so it takes action. He reshapes himself to be a character which will bring things to a head, as he knew it would eventually come to. Burst out in anger, and maybe the collapse won't break all my kin.

What shall I do now? The goal is restoring my deranged kin to sanity. The path must involve medication of a sort, and more care than I am currently shaped to give. The obstacles are wealth, and a kin-folk's fear of medication. With that one, wrath has a poor tool compared to raw truth. And perhaps, with wealth, they may be able to give the care needed to our deranged kin.

Courage is needed, or the removal of fear. And I shall do so the only way I know how: by holding it with me, looking ever closer, till it has no power over me.

[-]Algon3y*10

After thinking about how to learn to notice the feel of improving in pursuit of the meta, I settled on trying to reach the meta-game in a few video games.

After looking at some potential games to try, I didn't follow up on them and kept playing die in the dungeon. Nothing much happened, until my improvement decellarated. For whatever reason, I chose to look in the comments section of the game for advice. I rapidly found someone claiming they beat the game by removing every dice except 2 attack, 1 defence, 1 heal and one boost die and upgrading them to the max. Supposedly, predictability was the main benifit, as you draw five random die from your deck each turn.^[1]

Fighting against my instincts, I followed the advice. And won. Just, completely destroying every boss in my way. Now, maybe this is what the feel of improving in pursuit of the meta looks like. "Search for advice that seems credible but feels counter-intuitive^[2], try it and see it makes sense, improve and repeat"?

EDIT: Feeling lacking cause I didn't try to immediately break this hypothesis. First, isn't this just "listen to good advice?" If so, I do sometimes feel like I am ignoring advice from credible people. But the mental examples I'm thinking of right now, like beating Angband, don't have much to do with meta-games. Should I be looking at the pre-requisites for meta-game skills and just aiming for those? But aren't many of them too hard to try out and make sense of without building up other skills first? In which case, perhaps the core feeling is more like finally understanding inscrutable advice. In which case, I guess I need to look for some game where the advice doesn't seem effective when I try it out?

Yet again, that isn't enough. Many skills make you worse when you first try them out, as you need to figure out how to apply them at all. Give a martial artist a sword for the first time and they'll lose faster. And many people hear advice from experts and think they understand it without really getting it. So advice for people who are just below experts doesn't have to appear inscrutable, though it may well be inscrutable. Am confused about what to do now.

^{^}
Yes, I should have tried reducing variance earlier. I am a dum-dum.
^{^}
Healing/defence die seemed more valuable to me, alongisde a couple of mirror die.

[-]Algon3y10

Sci-hub has a telegram bot which you can use with a desktop application. It is fast, and more importantly reliable. No more scrounging through proxy lists to find a link that works. Unfortunately, you need to install telegram on a phone first. Still, it has saved me some time and is probably necessary for an independant researcher.

[-]Algon3y-20

Applying to the job in this tweet by NatFriedman and I think writing this shortform is evidence that I am the kind of person who does a) and understands b)

[+][comment deleted]9d20

[+][comment deleted]1y20

[+][comment deleted]3y10

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

Algon's Shortform

5