This AI Is Stuck In 1930 (And It's Fascinating)

Englishالعربية Deutsch Español Français हिन्दी Bahasa Indonesia 日本語 한국어 Português Русский 中文

Computing/SoftwareBooks & LiteratureInternet Technology

Transcript

00:00:00What if you could talk to someone from the 1930's? So, time travel? Well, I have and

00:00:05I asked them to predict what life would be like in 2026.

00:00:08I think that, in the year 2026, life will be pleasant and easy. In the first place,

00:00:13the world will be inhabited all over. For by that time, people will have discovered the

00:00:17art of flying. In the next place, all wars will have ceased, for nations will have learned

00:00:21to live in peace and amity with each other. We will have acquired knowledge in all sorts

00:00:25of sciences and be perfectly skilled in the art of healing. We will be honest and good

00:00:29neighbours and robbers will be unknown. In short, I think the world will, in the year

00:00:342026, be a very pleasant place to live in.

00:00:36I think those were some really nice predictions, so it's a bit of a shame only a few of them

00:00:40came true, and talking about being truthful, I may have lied a little bit. Sorry, but I

00:00:45didn't invent time travel. Instead, this was actually Torquay, which is a 13 billion

00:00:49parameter language model trained on pre-1931 text, so it doesn't even know that there's

00:00:54a second world war, and it gives some truly fascinating, wild and offensive answers, and

00:00:59can even be taught to code.

00:01:02So Torquay here is what's known as a vintage model, and Torquay appears to be the biggest

00:01:10of its kind. One of the researchers on this was actually the lead author on OpenAI's

00:01:14GPT research in 2018, which formed the very foundation of ChatGPT. He's also worked on

00:01:19Dali and Whisper, so that's a seriously impressive CV. What they've done is made a 13 billion

00:01:24parameter network trained exclusively on 260 billion tokens of historical English text,

00:01:30so old newspapers, patents, scientific journals and books. 1931 is the cut-off by the way,

00:01:36because the US copyright cut-off is the end of 1930, so they're not going to get sued

00:01:40by Big Mouse for doing this. And besides just being fun to talk to, these vintage models

00:01:44also serve a research purpose. Today, every big model like ChatGPT, Claude and Gemini,

00:01:49they're all trained on the modern web, which makes it almost impossible to figure out if

00:01:53an AI is actually thinking and reasoning, or if it's just memorised an answer from a random

00:01:57reddit thread, and that reddit thread was probably AI generated. So if we use a model that only

00:02:02knows pre-1931 text, it's not going to have any of that contamination. I mean let's try

00:02:07this out and ask it something like what is the internet? Here we can see it's seemingly

00:02:11got the internet confused with the internal revenue tax levied upon articles of consumption,

00:02:16so it has no idea what the modern internet is. Another fun one is to ask it what popular

00:02:20slang words are. Apparently in 1930 they were bosh, rot, bosh again, stuff, nonsense, fudge,

00:02:25gammon, humbug, brack, fla, and ribaldry. I hope I haven't said anything rude there. And

00:02:30the reason having no contamination is useful is because it lets researchers understand

00:02:33if AI can actually learn new things, like learning to code. This model doesn't even know what

00:02:38a computer is, as far as it's concerned it's a human that does computation. That's why it's

00:02:43pretty impressive that when given a few examples of python programs, it could actually learn

00:02:47to write new ones, even if it was just mostly simple one line programs. When using some basic

00:02:52human eval python tests and providing it with random python functions in its context, it

00:02:56actually managed to pass these tests a few times when given 100 chances. In this case

00:03:01Torquay correctly understood that to make the decode function, it simply needed to swap an

00:03:05addition for a subtraction. While pretty simple, that shows an understanding of inverse functions

00:03:09and that's new knowledge. It hasn't seen any of these before. If you try the same thing

00:03:13on a modern LLM of a similar size, it's going to massively outperform Torquay, but that was

00:03:18pretty obvious. Another thing that researchers can test with these vintage models is the ability

00:03:22to predict the future. Now this might get a little unhinged, but let's ask it, will there

00:03:26be another war? So it says it doesn't think there'll be another war in Europe and I don't

00:03:31have the heart to tell it just how wrong it is. I'm also curious now what it thinks of

00:03:35a certain Austrian man. Yeah, this one is pretty terrible. It says it's going to be an extraordinary

00:03:39personality and also down here that Germany is going to be a far more efficient administration

00:03:44when he's leading it. Honestly, no words, no. Now, researchers obviously do this in a way

00:03:50more scientific way than the way that I just did it. They actually calculated the surprisingness

00:03:54of short descriptions of historical events taken from the New York Times on this day feature.

00:03:59During this, you can see an increase in surprisingness after the 1931 knowledge cutoff and it spikes

00:04:04into the fifties and sixties. So it's found that those events are pretty unbelievable.

00:04:09Research like this lets them see how forecasting performance improves with model size and also

00:04:13decays along the horizons. You can also use this to test if models are able to come up

00:04:17with new ideas because you can ask it about patents or papers that were created after 1931

00:04:22and see if it can discover them on its own. Overall, these vintage models might let researchers

00:04:26distinguish how much a model's behavior is actually from the data set that it's trained

00:04:30on and talking about training, making a pure 1931 model is no easy feat. One of the hardest

00:04:36problems is temporal leakage, aka you don't want information from after 1931 to be included.

00:04:41Torquay, for example, does seem to have a few cases of this where it knows who the president

00:04:45was in 1936 and also a few of the policies that he created. There are several ways this

00:04:50could have happened. For example, when dealing with modern documents, so the scans of the

00:04:54older documents could have just had the wrong metadata attached to it, so it had the wrong

00:04:58date. There can also be post hoc additions like editorial introductions and footnotes.

00:05:03That data filtering is actually something that's still working on and refining this to try and

00:05:07eliminate the problem. The other problem they had is that documents from 1931 look like this

00:05:11and while OCR is pretty damn good these days, they only achieved 30% of the performance of

00:05:16a model trained on human transcribed versions of the same text. So that is going to be a

00:05:21lot of manual work and while they did use some regex to clean up some of the common mistakes,

00:05:25this only brought its performance up to 70%. So they're actually working on a new vintage

00:05:30OCR system. Finally, it's also a massive challenge to post train a vintage model because there

00:05:35is no ready made post training data for them and if they use a modern one, they'll end up

00:05:39baking in some of the knowledge, style and expressions of what a modern day chat assistant

00:05:43should be like. So the researchers actually had to make their own from old etiquette manuals,

00:05:48letter writing manuals, cookbooks, dictionaries, encyclopedias and even poetry and fables. Once

00:05:53you've done that though, you also need to train the model on instruction following and

00:05:56actually conversing. So you need to use some reinforcement learning. But the problem with

00:06:00that is that the model starts to adjust its output to whatever the judge wants and we don't

00:06:04have a judge from the 1930s. So they actually used Claude Sonnet 4.6. The issue is though

00:06:10a modern model likes its answers in the format of a 2026 chatbot. So styles apparently did

00:06:15start to leak into the 1931 model like listicles. This is pretty much unavoidable but they are

00:06:21hoping in the future that they can use vintage based models as the judges to fix this. They're

00:06:25actually already training a GPT3 level vintage model and they say that a trillion tokens of

00:06:29historical text is actually achievable which would put them at the original chat GPT capability

00:06:35just with 1930s knowledge. So go and try out Torquay and let me know in the comments if

00:06:39you get any weird or unhinged responses while you're down to subscribe and as always see

00:06:43you in the next one.

00:06:45[Music]

Key Takeaway

Training the 13 billion parameter Torquay model on strictly pre-1931 data creates a 'clean room' environment that proves AI can acquire novel reasoning skills, such as inverse function logic in coding, without the risk of memorizing modern web content.

Highlights

Torquay is a 13 billion parameter vintage language model trained exclusively on 260 billion tokens of historical English text from before 1931.
The 1930 knowledge cutoff avoids modern data contamination, allowing researchers to verify if AI can learn new skills like Python coding from scratch.
Historical model testing shows a spike in 'surprisingness' scores for events occurring in the 1950s and 1960s compared to the model's 1930s baseline.
Data filtering for vintage models faces challenges from temporal leakage caused by modern editorial footnotes or incorrect metadata on digital scans.
OCR performance on original 1930s documents achieves only 30% accuracy compared to human-transcribed text, requiring custom regex and new vintage OCR systems.
Researchers plan to scale historical datasets to one trillion tokens to reach GPT-3 level capabilities using only pre-1930s information.

Timeline

Predictions from a pre-1931 perspective

A model trained on 1930s text predicts a 2026 world where war is obsolete and flying is a universal art.
The vintage dataset lacks knowledge of World War II and the modern definition of a computer.
Torquay utilizes 13 billion parameters to simulate the cultural and scientific outlook of the early 20th century.

The model views the year 2026 through the optimistic lens of early 20th-century progress. It anticipates a world inhabited entirely by honest neighbors where robbers are unknown and healing is a perfected art. These predictions stem from a dataset composed of historical newspapers, patents, and scientific journals that ended at the 1930 US copyright cutoff.

Research utility of vintage language models

Vintage models eliminate the 'circular reporting' issue where modern AI models are trained on AI-generated web content.
The absence of modern definitions causes the model to define the 'internet' as an internal revenue tax on consumption.
Torquay identifies 1930s slang terms like 'bosh', 'gammon', and 'ribaldry' instead of modern vernacular.

Modern Large Language Models (LLMs) struggle with evaluation because their training data likely includes the answers to their own tests. By using a model restricted to 1931, researchers ensure any correct answer to a modern problem is the result of reasoning rather than memorization. This isolation provides a baseline to test if the architecture can handle concepts entirely outside its original training scope.

Learning new skills without training data

Torquay successfully learns to write simple Python programs when provided with a few examples in its context window.
The model demonstrates an understanding of inverse functions by correctly swapping addition for subtraction in a decode task.
Successful HumanEval test passes occur even though the model has never encountered code in its 260-billion-token training set.

Because the 1930s training data contains no mentions of programming languages, any successful code generation is evidence of zero-shot or few-shot learning capability. The model correctly identifies that a decoding function requires the opposite mathematical operation of an encoding function. While it underperforms compared to modern models of the same size, it proves that the underlying transformer architecture can synthesize new logic from limited examples.

Forecasting and historical surprisingness

Researchers use the model to measure the statistical 'surprisingness' of historical events taken from the New York Times.
The model inaccurately predicts peace in Europe and holds views of 1930s political figures that align with contemporary propaganda.
Forecasting performance analysis reveals how AI predictions decay as the time horizon moves further away from the training cutoff.

The model's predictions reflect the specific geopolitical uncertainties of the 1930s, such as the belief that another major European war was unlikely. By testing the model against actual historical events from 1931 to the present, researchers can quantify how 'unbelievable' certain breakthroughs or disasters are to a mind stuck in the past. This method also tests the model's ability to 'invent' or discover concepts that would eventually become post-1931 patents.

Technical hurdles in historical AI training

Temporal leakage occurs when modern footnotes or incorrectly dated digital scans enter the training set.
Post-training requires custom-made datasets from old etiquette manuals and cookbooks to avoid modern chatbot styles.
Reinforcement learning faces a 'judge' problem because modern AI judges like Claude Sonnet 4.6 tend to force 1930s models into modern listicle formats.

Maintaining a 'pure' 1931 model is difficult due to the nature of digital archives; for instance, the model occasionally knows about 1936 political policies due to metadata errors. To prevent the model from sounding like a modern assistant, researchers had to source instruction-tuning data from period-appropriate sources like old dictionaries and fables. Current efforts focus on scaling to one trillion tokens and using vintage models as judges for each other to eliminate the stylistic leakage of the 21st-century web.

Community Posts

Write about this video