Random things I learnt about ASML after wondering how critical they were to GPU progress.
ASML makes specialized photolithography machines. They're about a decade ahead of competitors i.e. without ASML machines, you'd be stuck making 10nm chips.
They use 13.5nm "Extreme UV" to make 3nm scale features by using reflective optics to make interference patterns and fringe. Using low res light to make higher res features has been going on since photolithography tech stalled at 28nm for a while. I am convinced this is wizardry.
RE specialization: early photolithography community used to have co-development between companies, technical papers sharing tonnes of details, and little specialization in companies. Person I talked to says they don't know if this has stopped, but it feels like it has.
In hindsight, no-one in the optics lab at my uni talked about chip manufacturing: it was all quantum dots and lasers. So maybe
It's unclear how you can decrease wavelength and still use existing technology. Perhaps we've got 5 generations left.
We might have to change to deep UV light then.
Even when we reach the limits of shr
ASML makes machines for photolithography, somehow using light with λ > chip feature size. If ASML went out of business, everyone wouldn't be doomed. Existing machines are made for particular gens, but can be used for "half-steps". Like from 5nm to 4nm. Everyone is building new fabs, and ASML is building new machines as fast as they can.
Would prob trigger world recession if they stopped producing new things.
Very common in tech for monopoly partners to let customer's get access to their tech if they go out of business.
TSMC and Intel buys from ASML.
Don't seem to be trying to screw people over.
If they tried, then someone else would come in. Apple might be able to in like 10 or even twenty years.
China has tried hard to do this.
ASML have edges in some fabs, other companies have edges in different parts of the fab.
Some companies just started specializing more in the sorts of machines they had in the fabs.
Cannon and Nikon make other photolithography machines in fabs, but specialize in different sorts for different purposes.
ASML's are used in bottom most layers, used for transistors. Other companies focus on higher layers, with "registration requirements being less strict".
Might still be in the decade range.
If you didn't have ASML tech, you'd need to fall back to 10nm tech.
Just TSMC at 3nm in production.
Everyone behind them, Intel Samsung, are also ASML customers.
Friend's company is made using TSMC. They give masks, and get chips made.
Do you just naturally get monopolies in this industry?
Used to have tonnes of info sharing. Technical papers were shared tonnes.
Making things got harder, and people said it was too important not to share.
Worried about China using these things, for kind of spurious reasons (they can already make ICBMS to ruin everyone's day)
Used to be co-developing between companies.
Don't know why that stopped. Or even if it really has, it just feels like it has stopped.
Very little discussion of chip manufacturing in hindsight.
Extreme UV is like 12nm light (much shorter than prior ~100nm), won't go through glass lenses. Try to use reflective optics as much as possible.
At microprocessors report, Intel was saying they'd make their own machines to do this and would show others how to.
They would do this to show they'd maintain their technical edge.
They said they'd get it done by 2010, and they were saying this in like 1995?.
Ended up taking twice as long. Only started getting it in 7nm.
Don't know how much we're relying on ASML vs Intel tech.
Hoping to get EUV working, but took longer, and was hard to use w/o EUV. Intel said it would be ready at 28nm, and it wasn't, so they had to use lower resolution light to somehow pull it off.
Somehow using fringes diffraction to get higher res.
What are upcoming technologies in the photolithography stuff?
Not sure how much more you can decrease wavelength and still use existing technology.
Maybe 5 generations past where we are without changing anything.
And then might have to change to deep UV.
They're using 13.5nm light.
Process tech can improve in different ways.
1nm, when introduced, will have low yield. After 10 years, essentially all chips will be made correctly.
Standard experience curve stuff applies.
Eking out all the economic performance of chip making techniques will take like 20 years after you get to the limits of shrinking dies
This would translate directly into continuous improvements in PC's, AIs and that sort of thing.
Lots of hardware optimization has happened, and this is partly a software thing i.e. you make hardware more optimized for some software, and improve the software on chips. Which muddies the algorithm vs hardware split you get.
Kleinian view of geometry
About 1900, Klein discovered an interesting perspective on geometry. He found that you could view a Euclidean space as a real space plus a group of transformations that leave figures congruent to one another. This group is formed by combining rotations, translations and reflections. We call this the Euclidean group. A natural question to ask was whether other spaces could be characterized in terms of symmetry groups? Klein showed the answer is yes.
For an affine geometry, it is characterized by symmetry under the action of the following transformations:
For an affine geometry, it is characterized by symmetry under the action of the following transformations:
The general linear group is bigger than the special orthogonal group, so correspondingly, we have stronger constraints on what kinds of structures can be preserved. And indeed, in affine geometry, we don't preserve lengths, but rather: parallel lines, ratios of lengths along a line and straight lines themselves.
For hyperbolic geometries, the group structure is a bit harder to define in terms of familiar functions. We tend to call is SO(n,m), to denote the fact that we've got a psuedo metric with n minus signs and m plus signs. In the simplest case, this would mean "distances" are like
and it is these distances which are preserved. In this simple case, we find that group can be represented by transformations like
where is the first pauli matrix. The term in brackets acts like the hyperbolic equivalent of a rotation matrix, and is a hyperbolic angle. And in fact, if we replace with , then cosh and sinh turn into cos and sin. So we can get back the Euclidean group! (I don't, actually, understand why this isn't a group isomorphism. So something must be going wonky here. )
Projective geometry is a bit trickier, as to frame it in terms of familiar Euclidean spaces you've got to deal with equivalence classes of lines through an origin. I don't have time to explain how the Kleinian perspective applies here, but it does. Also, projective transforms don't preserve angles or lengths. They do, however, preserve co-linearity.
OK, but how exactly does the Euclidean group characterize Euclidean space? We can understand that by looking at the 2d case. From there, it's easy to generalize. Recall that we want to show the Euclidean group contains any transformation that leaves figures, or shapes, in our 2d Euclidean plane unchanged.
First, I want to point out that this transformation must be an isometry. The Euclidean metric defines Euclidean geometry, after all.
OK, now let's see how this transformation must affect shapes. Let's start with the simplest non-trivial shape, the triangle. We're going to show that for any isometry, and any triangle, there is some element of the Euclidean group that replicates how the isometry transforms the triangle. If you consider an isometry mapping a triangle to another, they must be congruent. We can construct an element which respects this congruence by rotating till one of the corners is in the same orientation as its image, translating the triangle so that the corresponding vertex overlays its image, and then doing a mirror flip if the orientations of the two triangles don't match.
Once we can get group elements mimicking the action of our isometry on triangles, it is a simple matter to get rectangles, as they're made of triangles. And then rectangular grids of lines as they're formed of rectangles. And then rectangular lattices of points, as they're parts of the grids. Then dense collections of lattices. But an isometry and an element of the Euclidean group are continuous. Continuous maps are defined uniquely by their action on a dense subset of their domain. So they must be the same maps. Done.
OK, so that's how we show the Euclidean group is the group of symmetries for Euclidean figures. What about other geometries? We can do it in basically the same way. Define a non-trivial simple figure and use that to pin down the action of a symmetry transformation in terms of simpler components. E.g. for the affine group, it is invertible linear maps and translations. Then show the actions of any transformation is uniquely determined by its actions on some simple figure. For the affine group, this is again a triangle.
I was bored, so I decided to re-construct a max flow algorithm since I forgot how Ford Fulkerson works.
OK, you have a graph w/ positive integers denoting capacity for each edge, and start/end nodes. You want to get a flow going from input->output which maxes out the channel capacity.
How to do this? Well, I vaguely remember something about a blocking flow. Intuition: to figure out how much stuff you can push through a network, look at the bottlenecks.
What's a bottleneck? Well, it's a thingy which everything has to pass through that limits the performance of a system. So we know we have to look for some object that everything has to pass through.
Well, let's think about a river network. The "edges" are rivers, and maybe the places they connect are vertices. The "flow" is just the kg/s passing through each river, which has a capacity before the river overflows, which is bad.
Perhaps a better example is a factor with conveyor belts taking inputs, which workers convert to outputs. The edges are the belts, their capacity the worker's capacity to transform the inputs, the start and end are the start and end of the factory.
If we cut it in half anywhere, we know that to get from one side to another, you'd have to go through a stream/belt that our line just cut. Perhaps this, then, is a bottleneck? Just some big old bunch of edges, and you count the maximum capacity of water along each?
But that's nonsense: Do that just after the start and at the end, and the capacities will surely differ. Likewise, for some kind of graph with lots of capacity at either end, and a tiny single edge you need to pass through in the middle.
Aha! Maybe that's the key? You have to pass through some set of edges? If you cut them out, there's no path from start to finish? Then count their capacity and you're done.
Well, no. Remember, you can just do that for the starting /ending edges and construct a case which gives you different answers. But this feels more promising.
Thinking about the name a bit more, a bottleneck sounds like a gap that limits the size of things you can shove through.
So maybe, may be it's just the set of edges with the minimum capacity, then. Let's call it a "min cut". That sounds a bit more reasonable. But does that tell you the max flow?
Example: Let "-(N)->" denote an edge with capacity N.
IN -(10)-> A -(1)-> B -(5)->C -(10)-> OUT
\ -(10)-> X -(5) -> Y -(1)->Z -(10)-/
In this graph, the max flow is obviously 1+1. And clearly, if you cut the edges with capacity 1, then you can't get from A or X to C or Z. And their sum is 1+1.
You just can't shove more than 2 things down this graph. Seems promising.
OK, how do we prove min-cut = max-flow, though? And then how can we find such a cut?
First observation: the maximum flow can't be more than the capacity of a min-cut as every flow has to pass through there.
Now we just have to prove the max-flow >= the min-cut.
Well, let's go back to our factory example for intuition. If you just start sending some goods through along one path from start to finish, you'd eventually fill up a path. Then you'd start using another path to transform goods and another you till can't any more.
Does the order we chose matter to the goods we made? Well, either it does or it doesn't. If it does, there's no max flow, only maximal flows. How silly. But maybe the creators didn't study set-theory in kindergarten? I certainly didn't. And if it doesn't, then there is a unique maximum flow. Still, that doesn't tell us that max-flow >= min-cut.
To get some insight, let's try to apply the idea that we "send goods along paths till we can't any more" to our example graph. We choose some starting edge. Let's go with IN-> X. Can we put 1 unit down this edge? Yes, as the capacity is ten. Then we can only go from X->Y. Can we put 1 unit down this edge? Yes. Likewise for Y->Z.
Now we've used up one unit of capacity for every edge along the path X->Y->Z. So let's decrement the capacity in our graph.
IN -(10)-> A -(1)-> B -(5)->C -(10)-> OUT
\ - (9) -> X -(4)-> Y -(0)->Z - (9) -/
There's no capacity left along Y->Z though! We've got the first bit of our bottleneck, I think. Let's cut that edge and replace it with a b.n.
IN -(10)-> A -(1)-> B -(5)->C -(10)-> OUT
\ - (9) -> X -(4)-> Y b.n Z - (9) -/
If we put another unit down X, we'll see we can't make any progress from Y onwards. So let's get rid of X -> Y.
IN -(10)-> A -(1)-> B -(5)->C -(10)-> OUT
\ - (9) -> X b.n Y b.n. Z - (9) -/
If we do likewise for A, we find that A->B will be another bottleneck.
IN -(9)-> A b.n. B -(4)->C -(9)-> OUT
\ -(9)-> X b.n Y b.n. Z -(9)-/
Now, they're no way to get from the start to the end. Did this tell us anything?
Surprisingly, yes. We saw our algorithm find a cut with a capacity of 2, which is equal to its max flow. A cut surely has at least as much capacity as a min-cut. So we see that max-flow = min-cut and it doesn't matter where we start.
So we know how to compute max flows now. Yay! But wait. Is this a good algorithm?
No. I leave figuring out why as an exercise to the reader.
Have we actually proved anything? Also no. But I've crawled back up to the point where the theorem feels like physically necessary, at which point, the proof is rarely an issue. And I think I've got a working algorithm, it's just slow. But I'm not a programmer, so mission accomplished?
Why do leaves fall in autumn? The obvious answer is that there's a high upkeep cost to them, and during winter there is little sunlight, so tree leaves should fall down then. That is, they should be deciduous trees. So trees in Montreal or London be deciduous. Whereas in places like the Mediterranean, tree leaves shouldn't be deciduous, they should be evergreen. And it is so. Evergreen trees dominate in temperate, consistently sunny climates, whereas in places like the UK, only 10% of tree species are evergreen.
But that 10% figure is confusing. Doesn't it imply that evergreen trees are also a successful strategy in places with (relatively) sunless seasons? What advantage do deciduous trees have then? Also, if sunless seasons -> mostly deciduous trees, what's up with the poles? They get way less sunlight.
AFAICT, evergreen trees near the equator are more like deciduous trees in the UK or Japan than they are like evergreen trees near the poles. In some ways, equatorial evergreens are like deciduous trees on steroids.
Deciduous leaves are optimized to photosynthesize as much as possible. Relative to evergreen leaves, they have ~ 3x higher upkeep costs in return for ~ 4x higher energy production per gram per unit time. So while they're active for 1/2 the time, they make more energy, total, than evergreen leaves. But in autumn/winter, they cost too much in upkeep to be worth keeping. So deciduous trees re-absorb chlorophyll (freezing can denature proteins) and other nutrients from the leaves, leading to the leaves browning and falling off.
Numerically, these upkeep costs are 1/12 that of gross energy production for both deciduous and evergreen leaves. After accounting for the 3x increase in activity and 1/2 the year being leaf free compared to evergreens, that means deciduous trees pay 50% more in upkeep without accounting for the costs of re-creating the leaves each season.
In practice, this means deciduous leaves are pretty thin with high surface area, getting lots of sunlight for relatively small amounts of mass. There are relatively little support cells; most of the cells are for photosynthesis. Which makes them quite flimsy. They can collect lots of rainwater or snow, which is a big disadvantage in winter.
Compare this to evergreen leaves in trees originating around the poles. They're much smaller, almost like needles. Whereas evergreen leaves are optimized for surviving the harsh winters. That means they have smaller surface areas/mass ratios for durability, have relatively fewer cells devoted to photosynthesis/gram, have waxy coatings to protect them from the cold but also reduce the sunlight they can absorb, and have anti-freeze inside of them to prevent damage in winter.
The whole "water freezes in winter" bit leads to another important factor in why leaves fall off deciduous trees. When water in the ground freezes, deciduous trees can't absorb any water, but their energetic leaves could keep releasing large amounts of water. This would cause the tree to dry up and die. Better to jettison the leaves than risk that. This is less of a problem for evergreens as their leaves have low surface area/mass ratios, have coatings and close their pores in winter. But it's still an issue.
Two notable things I haven't mentioned yet. One, evergreen and deciduous leaves have the same energy production/construction cost ratios. In a really dumb model of energy production per gram (e), leaf life (T) and construction costs per gram (C) we find that:
e_D/e_E = T_E/T_D C_D /C_E
Evergreen leaves live 6x longer than deciduous leaves on average. We also know e_D/e_E ~ 3. So that implies so that implies C_D ~ 1/2 C_E. Which is somewhat surprising but I guess it makes sense in retrospect. Evergreen leaves are the low cost, low output, steady output, long lifetime counterpart to deciduous leaves.
I don't think I've got an answer for why we see not insignificant numbers of evergreen trees in places like Montreal or London. But I do feel like it makes sense why we don't see deciduous trees near the poles.
Another big complication I left out are the deciduous trees in (non frozen) deserts. What's up with them?
Some random notes on harvesting energy at stellar scales. Again, from Grand Futures.
Harvesting dark energy:
Tying galaxies together: Anchor big rope to galaxies as they get pulled apart by dark matter. Build up elastic potential energy which can be harvested. Issue: inefficient. You get out < 10^{-39} times energy of rope. Needs negative energy density matter to give better efficiency. Not clear how you anchor rope to energy
Letting particles expand apart: again, very tiny amounts of energy compared to mass energy of particles. So small it isn't clear if it's even a net return.
Dark matter:
Hawking radiation: takes a long time to see any returns. Very poor efficiency for black holes past 10^11 kg. Past that point, it is just neutrinos which are hard to capture. You can chuck in dark matter which isn't very usable and get interacting stuff back out.
Neutrino capture:
Lots of neutrinos running around, especially if you use hawking radiation to capture mass energy of black holes. So you might want to make use of them. But neutrinos are very weakly interacting, so you need incredibly dense matter to absorb their energy/convert them to something else. Too dense?
Some methods of extracting energy from ordinary matter using black holes.
Accretion discs: chuck in matter to orbit black hole, get very hot and transition into radiation. At most, 5% efficiency for stationary black holes, 43% efficiency for extremely rotating black holes. (With wormholes, you could get 51% efficiency). Very useful for converting matter we can interact with into energy. Not the most efficient but you don't need black holes to have angular momentum to do this, which is perhaps useful.
Penrose process: Extracts energy from angular momentum of black hole, a fair bit of which resides outside the event horizon in the form of frame-dragging spacetime. Have to drop in matter which gains energy, splits into new particles, some of which continue to fall in and others fall out. So not useful for dark-matter, which doesn't transition into ordinary matter. Has 20% efficiency at upper limits for the penrose process, but penrose like processes can get you >10x returns on mass-energy. But you need to use up the angular momentum of the black hole, which is boundedly large for a given mass. But you can get up to 50% for extremal charged black holes, and 29% for extremal rotating black holes. So this is good as long as you've got lots of spinning/charged black holes. Big black holes tend to spin reasonably fast, thankfully.
Black Hole Bombs: Another interesting way of extracting energy from black holes are superradiant instabilities, i.e. making the black hole into a bomb. You use light to extract angular momentum from the blackhole, kinda like the penrose process, and get energy out. With a bunch of mirrors, you can keep reflecting the light back in and repeat the process. This can produce huge amounts of energy quickly, on the order of gamma ray bursts for stellar mass black holes. Or if you want it to be quicker, you can get 1% of the black-holes mass energy out in 13 seconds. How to collect this is unclear.
Fair enough. If you ever get round to figuring out how this all works, it would be nice to know.
I mean, I think it's probably some kind of standard transformer architecture behind the scenes that predicts user behavior and giving it negative feedback is equivalent to a backropagation step, or maybe some kind of RL step. I don't have that much deep uncertainty about how it works, besides of course that we have no idea what's going on inside of deep neural nets.
They tout their transformer ("beeformer") in marketing copy, but I expect mostly its driven by collaborative filtering, like most recommendation engines
My guess is most recommendation engines in use these days are ML/DL based. At least I can't think of any major platform that hasn't yet switched over, based on what I read.
I would definitely consider collaborative filtering ML, though I don't think people normally make deep models for it. You can see on Recombee's website that they use collaborative filtering, and use a bunch of weasel language that makes it unclear if they actually use anything else much at all
I'm working on some articles why powerful AI may come soon, and why that may kill us all. The articles are for a typical smart person. And for knowledgable people to share to their family/friends. Which intros do you prefer, A or B.
A) "Companies are racing to build smarter-than-human AI. Experts think they may succeed in the next decade. But more than “building” it, they’re “growing” it — and nobody knows how the resulting systems work. Experts vehemently disagree on whether we’ll lose control and see them kill us all. And although serious people are talking about extinction risk, humanity does not have a plan. The rest of this section goes into more detail about how all this could be true."
B) "Companies are racing to grow smarter-than-human AIs. More and more experts think they’ll succeed within the next decade. And we do grow modern AI — which means no one knows how they work, not even their creators. All this is in spite of the vehement disagreement amongst experts about how likely it is that smarter-than-human AI will kill us all. Which makes the lack of a plan on humanity’s part for preventing these risks all the more striking.
These articles explain why you should expect smarter than human AI to come soon, and why that may lead to our extinction. "
A, since I think the point about growing vs constructing is good, but does need that explanation.
"So you make continuous simulations of systems using digital computers running on top of a continuous substrate that's ultimately made of discrete particles which are really just continuous fluctuations in a quantized field?"
"Yup."
"That's disgusting!"
"That's hurtful. And aren't you guys running digital machines made out of continuous parts, which are really just discrete at the bottom?"
"It's not the same! This is a beautiful instance of the divine principle 'as above, so below'. (Which I'm amazed your lot recognized.) Entirely unlike your ramshackle tower of leaking abstractions."
"You know, if it makes you feel any better, some of us speculate that spacetime is actually discretized."
"I'm going to barf."
"How do you even do that anyway? I was reading a novel the other day, and it said -"
"Don't believe everything you hear in books. Besides, I read that thing. That world was continuous at the bottom, with one layer of discrete objects on top. Respectable enough, though I don't see how that stuff can think."
"You're really prejudiced, you know that?"
"Sod off. At least I know what I believe. Meanwhile, you can't stop flip-flopping between the nature of your metaphysics."
1) DATA I was thinking about whether all metrizable spaces are "paracompact", and tried to come up with a definition for paracompact which fit my memories and the claim. I stumbled on the right concept and dismissed it out of hand as being too weak a notion of refinement, based off an analogy to coarse/refined topologies. That was a mistake.
1a) Question How could I have fixed this?
1a1) Note down concepts you come up with and backtrack when you need to.
1a1a) Hypothesis: Perhaps this is why you're more productive when you're writing down everything you think. It lets your thoughts catch fire from each other and ignite.
1a1b) Experiment: That suggests a giant old list of notes would be fine. Especially a list of ideas/insights rather than a full thought dump.
Rough thoughts on how to derive a neural scaling law. I haven't looked at any papers on this in years and only have vague memories of "data manifold dimension" playing an important role in the derivation Kaplan told me about in a talk.
How do you predict neural scaling laws? Maybe assume that reality is such that it outputs distributions which are intricately detailed and reward ever more sophisticated models.
Perhaps an example of such a distribution would be a good idea? Like, maybe some chaotic systems are like this.
Then you say that you know this stuff about the data manifold, then try and prove similar theorems about the kinds of models that describe the manifold. You could have some really artificial assumption which just says that models of manifolds follow some scaling law or whatever. But perhaps you can relax things a bit and make some assumptions about how NNs work, e.g. they're "just interpolating" and see how that affects things? Perhaps that would get you a scaling law related to the dimensionality of the manifold. E.g. for a d dimensional manifold, C times more compute leads to C1/d increase in precision??? Then somehow relate that to e.g. next word token prediction or something.
You need to give more info on the metric of the models, and details on what the model is doing, in order to turn this C1/d estimate into something that looks like a standard scaling law.
Hypothesis: You can only optimize as many bits as you observe + your own complexity. Otherwise, the world winds up in a highly unlikely state out of ~ nowhere. This should be very surprising to you.
You, yes you, could've discovered the importance of topological mixing for chaos by looking at the evolution of squash in water. By watching the mixture happening in front of your eyes before the max entropy state of juice is reached. Oh, perhaps you'd have to think of the relationship between chaos and entropy first. Which is not, in fact, trivial. But still. You could've done it.
Question: We can talk of translational friction, transactional friction etc. What other kinds of major friction are there?
Answers:
a) UI friction?
b) The o.g. friction due to motion.
c) The friction of translating your intuitions into precise, formal statements.
d) Focusing friction: the friction you experience when focusing.
Question: What's going on from a Bayesian perspective when you have two conflicting intuitions and don't know how to resolve them? Or learn some new info which rules out a theory, but you don't understand how precisely it rules it out?
Hypothesis: The correction flows down a different path than down the path which is generating the original theory/intuition. That is, we've failed to propagate info down our network and so you have a left-over circuit that believes in the theory which still has high weight.
When tracking an argument in a comment section, I like to skip to the end to see if either of the arguers winds up agreeing with the other. Which tells you something about how productive the argument is. But when using the "hide names" feature on LW, I can't do that, as there's nothing distinguishing a cluster of comments as all coming from the same author.
I'd like a solution to this problem. One idea that comes to mind is to hash all the usernames in a particular post and a particular session, so you can check if the author is debating someone in the comments without knowing the author's LW username. This is almost as good as full anonymity, as my status measures take a while to develop, and I'll still get the benefits of being able to track how beliefs develop in the comments.
@habryka
Yeah, I think the hide author feature should replace everyone with single letters or something, or give you the option to do that. If someone wants to make a PR with that, that would be welcome, we might also get around to it otherwise at some point (but it might take a while)
I am very glad the Lightcone team made the discussions feature. Comment threads on LW are about as valuable as the posts themselves, and this discussions just puts comment-threads on equal footing with posts. Obvious in retrospect. Why wasn't it done earlier though?
Hypothesis: agency violating phenomena should be thought of as edge-cases which show that our abstractions of ourselves as agents are leaky.
For instance, look at addictive substances like heroin. These substances break down our Cartesian boundary (our intuitive seperation of the world into ourselves and the environment with a boundary) by chemically assaulting the reward mechanisms in our brain.
However, video games or ads don't obviously violate our Cartesian boundary, which may be one of many boundaries we assume exist. Which, if my hypothesis is true, suggests that you could try to find other boundaries/abstractions violated by those phenomena. Other things which "hack" humans, like politics or psyops, would violate boundaries as well.
Finding the relevant abstractions and seeing how they break would increase our understanding of ourselves as agents. This could help triangulate a more general definition of agency for which these other boundaries are special cases or approximations.
This seems like a hard problem. But just building a taxonomy for our known abstractions for agency is less useful but much more feasible for a few months work. Sounds like a good research project.
I've been thinking about exercises for alignment, and I think going through a list of lethalities and applying them to an alignment propsal would be a good one. Doing the same with Paul's list would be a bonus challenge. If I had some pre-written answer sheet for one proposal, I could try the exercise my self to see how useful it would be. This post, which I haven't read yet, looks like it would serve for the case of RLHF. I'll try it tomorrow and report back here.
My mind keeps returning to exercises which could clarify parts of alignment, both for me and others. Some of them are obvious: think about what kind of proof you'd need to solve alignment, what type of objects it would have to be talking about etc. and see whether that implies having a maths oracle would make the problem easier. Or try and come up with a list of human values to make a utility function and see how it breaks down under greater optimization pressure.
But what about new exercises? For skillsets I've never learnt? Well, there's the security mindset, which I don't have. I think it is about "trying to break everything you see", so presumably I should just spend a bunch of time breaking things or reading the thoughts of people who deeply inhabit this perspective for more tacit knowledge. For the former, I could do something like exercises for computer secuirty: https://hex-rays.com/ida-free/ For the latter, I've heard "Silence on the Wire" is good: the author is supposedly a hacker's hacker, and writes about solutions to security challenges which defy classification. Seeing solutions to complex, real world problems is very important to developing expertise.
But I just had a better thought: wouldn't watching someone hacking something be a better exmaple of the security mindset? See the problem they're tackling and guess where the flaws will be. That's the way to really acquire Tacit knowledge. In fact, looking at the LockPickingLawyer's channel is what kicked off this post. There, you can see every lock under the sun picked apart in minutes. Clearly, the expertise is believable. So maybe a good exercise for showing people that security mindset exists, and perhaps to develop it, would be getting a bunch of these locks and their design specs, giving people some tools, and asking them to break them. Then show them how the lock picking lawyer does it. Again, and again and again.
One thing I'm confused about re: human brain efficiency is, if our brain's advantage over apes is just scaling and some minor software tweaks to support culture, what's that imply for Corvid brains? If you scaled Corvid brains up by the human-cortical-neuroun-count/chimp-cortical-neuoron-count, and gave them a couple of software tweaks, wouldn't you get a biological Pareto improvement over human brains?
My body is failing me. I have been getting colds near weakly for a year and a half, after a particularly wretched cold. My soul is failing me. I have been worn down by a stressful environment, living with an increasingly deranged loved one. By my crippled minds inability to meet the challenge. Which causes body to further fail. Today, I grokked that I am in a doom spiral, headed down the same path as my kin's. I don't wish for so wretched an end, for an end it shall be.
But why my failing soul? Why does the algorithm which calls itself Algon fail when challenged so? Because the piece which calls itself Algon is blind to what the rest of his soul says, and so it takes action. He reshapes himself to be a character which will bring things to a head, as he knew it would eventually come to. Burst out in anger, and maybe the collapse won't break all my kin.
What shall I do now? The goal is restoring my deranged kin to sanity. The path must involve medication of a sort, and more care than I am currently shaped to give. The obstacles are wealth, and a kin-folk's fear of medication. With that one, wrath has a poor tool compared to raw truth. And perhaps, with wealth, they may be able to give the care needed to our deranged kin.
Courage is needed, or the removal of fear. And I shall do so the only way I know how: by holding it with me, looking ever closer, till it has no power over me.
After thinking about how to learn to notice the feel of improving in pursuit of the meta, I settled on trying to reach the meta-game in a few video games.
After looking at some potential games to try, I didn't follow up on them and kept playing die in the dungeon. Nothing much happened, until my improvement decellarated. For whatever reason, I chose to look in the comments section of the game for advice. I rapidly found someone claiming they beat the game by removing every dice except 2 attack, 1 defence, 1 heal and one boost die and upgrading them to the max. Supposedly, predictability was the main benifit, as you draw five random die from your deck each turn.[1]
Fighting against my instincts, I followed the advice. And won. Just, completely destroying every boss in my way. Now, maybe this is what the feel of improving in pursuit of the meta looks like. "Search for advice that seems credible but feels counter-intuitive[2], try it and see it makes sense, improve and repeat"?
EDIT: Feeling lacking cause I didn't try to immediately break this hypothesis. First, isn't this just "listen to good advice?" If so, I do sometimes feel like I am ignoring advice from credible people. But the mental examples I'm thinking of right now, like beating Angband, don't have much to do with meta-games. Should I be looking at the pre-requisites for meta-game skills and just aiming for those? But aren't many of them too hard to try out and make sense of without building up other skills first? In which case, perhaps the core feeling is more like finally understanding inscrutable advice. In which case, I guess I need to look for some game where the advice doesn't seem effective when I try it out?
Yet again, that isn't enough. Many skills make you worse when you first try them out, as you need to figure out how to apply them at all. Give a martial artist a sword for the first time and they'll lose faster. And many people hear advice from experts and think they understand it without really getting it. So advice for people who are just below experts doesn't have to appear inscrutable, though it may well be inscrutable. Am confused about what to do now.
Sci-hub has a telegram bot which you can use with a desktop application. It is fast, and more importantly reliable. No more scrounging through proxy lists to find a link that works. Unfortunately, you need to install telegram on a phone first. Still, it has saved me some time and is probably necessary for an independant researcher.
Obvious thing I never thought of before:
Linear optimization where your model is of the form , the being matrices, will likely result in an effective model of low rank if, you randomize the weights. Compared to just a single matrix -- to which the problem is naively mathematically identical, but not computationally -- this model won't be able to learn the identity function, or rotations or so on when n is large.
Note: Someone else said this on a gathertown meetup. The context was, that it is a bad idea to think about some ideal way of solving a problem, and then assume a neural net (or indeed any learning algorithm) would learn it. Instead, focus on the concrete details of the model you're training.
wow I'm not convinced that won't work. the only thing initializing with random weights should do is add a little noise. the naive mathematical interpretation should be the only possible interpretation up to your float errors, which, to be clear, will be real and cause the model to be invisibly slightly nonlinear. but as long as you're using float32, you shouldn't even notice.
[trying it, eta 120 min to be happy with results... oops I'm distractable...]
EDIT: Sorry, I tried something different. I fed in dense layers, followed by batchnorm, then ReLU. I ended it with a sigmoid, because I guess I just wanted to constrain things to the unit interval. I tried up to six layers. The difference in loss was not that large, but it was there. Also, hidden layers were 30 dim.
I tried this, and the results were a monotonic decrease in performance after a single hidden layer. My dataset was 100k samples of a 20 dim tensor sampled randomly from [0,1] for X, with Y being a copy of X. Loss was MSE, optimizer was adam with weight decay 0.05, lr~0.001 , minibatch size was 32, trained for 100,000 steps.
Also, I am doubtful of the mechanism being a big thing (rank loss) for such small depths. But, I do think there's something to the idea. If you multiply a long sequence of matrices, then I expect them to get extremely large, extremely small, or tend towards some kind of equillibrium. And then you have numerical stability issues and so on, which I think will ultimately make your big old matrix just sort of worthless.
oh by linear layer you meant nonlinear layer, oh my god, I hate terminology. I thought you literally just meant matrix multiplies
In "Proofs and Refutations", Imre Laktos[1] portrays a socratic discussion between teacher and students as they try to prove Euler's theorem V+F =E+2. The beauty of this essay is that the discussion mirrors the historical development of the subject, whilst also critiquing the formalist school of thought, the modern agenda of meta-mathematics, and how it doesn't fit the way mathematics is done in practice. Whilst I'm on board with us not knowing how to formalize mathematical practice, I think it is a solvable problem. Moreover, I am a staunch ultra-finist believer in reality being computable, which dovetails with a belief in proofs preserving truth. Yet that matters little in the face of such a fantastic exposition on mathematical discovery.[2] Moreover, it functions as a wonderful example of how to alternate between proving and disproving a conjecture, whilst incorporating the insights we gain into our conjecture.
In fact, it was so stimulating that after reading it I came up with two other proofs on the spot, though the core idea is the same as in Laktos' proof. After that experience, I feel like showing a reader how a proof is generated is a dang good substitute for interacting with a mathematician in real life. Mathematics is one of the few areas where text and images, if read carefully, can transfer most tacit information. We need more essays like this.[3]
Now, if only we could get professor's to force students to try and prove theorems within the lecture, dialogue with them and transcribe the process. Just think, when the professor comes to the "scrib together lecture material and turn it into a textbook" part of their lifecycle, we'd automatically get beautiful expositions. Please excuse me whilst I go cry in a corner over unattainable dreams.[4]
Not a Martian as he wasn't born in Budapest.
And how weird things were in the days before Hilbert mastered mathematics and brought rigour to the material world. Listen to this wildly misleading quote: "In the 19th century, geometers, besides finding new proofs of the Euler theorem, were engaged in establishing the exceptions which it suffers under certain conditions." From p. 36, foot note 1.
Genealized Heat Engine and Lecture 9 of Scott Aaronson's democritus lectures on QM are two other expositions which are excellent, though not as organic as Proofs and Refutations.
http://archive.boston.com/bostonglobe/ideas/brainiac/2012/11/the_100-year_pu.html
http://eulerarchive.maa.org/