Image not available

906x388

scaling.jpg

๐Ÿงต Untitled Thread

Anonymous No. 16593460

These three graphs are the most important graphs in the world right now. If you understand what they imply, you know everything is about to change.

Scale is all you need.

2027-2028.

Anonymous No. 16593474

>>16593460
Are you trying to tell me they can accurately draw text or cite a court case soon?

Anonymous No. 16593482

>>16593474
Yes.

Anonymous No. 16594406

>>16593460
Here's the problem... Arbitrary scale feels possible, but we're likely to run into walls in continuing to improve scale on all three graphs, and not even that distant future.

Compute scale is essentially inversely proportionate to the size of a transistor. The smaller the transistor, the more we can stuff into a fixed-size chunk of silicon. Computers have gotten so much more powerful over the last 40 years because we've kept making transistors smaller and smaller, and have therefore been able to stuff more and more of them into a computer chip. For most of that time, we've seen transistor counts double once every 2 years or so. However, if transistors get much smaller than they are now, you will run into quantum fuckery, where electrons will just tunnel all over the place. This tunneling would basically turn the transistor into junk. Now we occasionally make some engineering choices that speed things up considerably without making the transistors smaller --E.G. using GPUs instead of CPUs for massively parallel simple tasks, or putting the gpu directly on top of the cpu so they directly share memory and reduce data pipeline latency. But these are sparks of inspiration that don't come with the same regularity that moore's law had. Maybe we'll keep finding clever tricks, maybe we won't. But I think we're soon going to run into a situation where every gain will have to come from a clever trick.

Anonymous No. 16594413

>>16594406

The Dataset size is something we're already running into. It's not enough to have a lot of data. It has to be good data. State of the art LLMs have already been trained on basically all the good language data that exists and can be easily collected. Getting bigger datasets than we have already used is becoming increasingly difficult and expensive. The internet is now being flooded with synthetic data that was generated by previous LLMs. There isn't really a good way of reliably separating the real data from the synthetic stuff. Unfortunately when you train a generative model on synthetic data, it gets just a little bit worse: text-2-text models get a little jankier, image generation models get a little blurrier. Do this over and over again (each iteration training on the last generation's synthetic data) and eventually your model will be complete garbage.

Anonymous No. 16594416

>>16594413
Parameter counts are limited by both compute and dataset sizes. The more parameters a model has, the more data you have to train it on, and the more compute power it takes to train and run. Again, we find some improvements that don't just depend on scale. Low rank adaptation can reduce training and running costs, especially in a fine tuning job. We've seen situations where a model is made with specific hardware optimizations in mind and benefited from it (see the Mamba model). But if compute and/or dataset scaling run into a wall, every improvement will have to come from a clever trick, which may or may not keep coming.

Anonymous No. 16594480

>>16593460
Grok 3 had 15x the compute of Grok 2, but performs only marginally better, and I'm betting they have a bunch of improvements from other frontier models in it too.

Scaling is dead.

Anonymous No. 16594562

>>16593460
These graphs are just a pitch deck to get investors to hand over more money. These results have never been peer reviewed

Anonymous No. 16594733

>>16593460
Just a reformulation of complexity. nothing new here.

Anonymous No. 16594776

>Scale is all you need.

You don't understand these graphs. This is the whole problem in the first place that our current AI tech is making no improvement anymore except for economy of scale. This means we have hit another wall.

Anonymous No. 16594971

>>16594562
>These results have never been peer reviewed
Incorrect; the scaling hypothesis is remarkably consistent.
>>16594480
>Grok 3 had 15x the compute
Scaling pre-training still works, just not nearly as well as it did. But scaling pre-training is not the only scaling dimension. Scaling data quality/amount, scaling test-time compute, scaling compute length, all these still work. It's also an open question if pre-training scaling is dead dead, or if we haven't figured out the most efficient way for our current models to wring information from data. We almost certainly haven't, transformers aren't even a decade old.
>>16594406
>But I think we're soon going to run into a situation where every gain will have to come from a clever trick.
>The Dataset size is something we're already running into. It's not enough to have a lot of data. It has to be good data.
Yes. You understand. The frontier AI labs understand this too. That's why they're working on synthetic data generation; billions of dollars and tens of thousands of man-hours concentrated on it. They are not far away from cracking this problem.