This AI Is Stuck In 1930 (And It's Fascinating)

BBetter Stack
Computing/SoftwareBooks & LiteratureInternet Technology

Transcript

00:00:00What if you could talk to someone from the 1930's? So, time travel? Well, I have and
00:00:05I asked them to predict what life would be like in 2026.
00:00:08I think that, in the year 2026, life will be pleasant and easy. In the first place,
00:00:13the world will be inhabited all over. For by that time, people will have discovered the
00:00:17art of flying. In the next place, all wars will have ceased, for nations will have learned
00:00:21to live in peace and amity with each other. We will have acquired knowledge in all sorts
00:00:25of sciences and be perfectly skilled in the art of healing. We will be honest and good
00:00:29neighbours and robbers will be unknown. In short, I think the world will, in the year
00:00:342026, be a very pleasant place to live in.
00:00:36I think those were some really nice predictions, so it's a bit of a shame only a few of them
00:00:40came true, and talking about being truthful, I may have lied a little bit. Sorry, but I
00:00:45didn't invent time travel. Instead, this was actually Torquay, which is a 13 billion
00:00:49parameter language model trained on pre-1931 text, so it doesn't even know that there's
00:00:54a second world war, and it gives some truly fascinating, wild and offensive answers, and
00:00:59can even be taught to code.
00:01:02So Torquay here is what's known as a vintage model, and Torquay appears to be the biggest
00:01:10of its kind. One of the researchers on this was actually the lead author on OpenAI's
00:01:14GPT research in 2018, which formed the very foundation of ChatGPT. He's also worked on
00:01:19Dali and Whisper, so that's a seriously impressive CV. What they've done is made a 13 billion
00:01:24parameter network trained exclusively on 260 billion tokens of historical English text,
00:01:30so old newspapers, patents, scientific journals and books. 1931 is the cut-off by the way,
00:01:36because the US copyright cut-off is the end of 1930, so they're not going to get sued
00:01:40by Big Mouse for doing this. And besides just being fun to talk to, these vintage models
00:01:44also serve a research purpose. Today, every big model like ChatGPT, Claude and Gemini,
00:01:49they're all trained on the modern web, which makes it almost impossible to figure out if
00:01:53an AI is actually thinking and reasoning, or if it's just memorised an answer from a random
00:01:57reddit thread, and that reddit thread was probably AI generated. So if we use a model that only
00:02:02knows pre-1931 text, it's not going to have any of that contamination. I mean let's try
00:02:07this out and ask it something like what is the internet? Here we can see it's seemingly
00:02:11got the internet confused with the internal revenue tax levied upon articles of consumption,
00:02:16so it has no idea what the modern internet is. Another fun one is to ask it what popular
00:02:20slang words are. Apparently in 1930 they were bosh, rot, bosh again, stuff, nonsense, fudge,
00:02:25gammon, humbug, brack, fla, and ribaldry. I hope I haven't said anything rude there. And
00:02:30the reason having no contamination is useful is because it lets researchers understand
00:02:33if AI can actually learn new things, like learning to code. This model doesn't even know what
00:02:38a computer is, as far as it's concerned it's a human that does computation. That's why it's
00:02:43pretty impressive that when given a few examples of python programs, it could actually learn
00:02:47to write new ones, even if it was just mostly simple one line programs. When using some basic
00:02:52human eval python tests and providing it with random python functions in its context, it
00:02:56actually managed to pass these tests a few times when given 100 chances. In this case
00:03:01Torquay correctly understood that to make the decode function, it simply needed to swap an
00:03:05addition for a subtraction. While pretty simple, that shows an understanding of inverse functions
00:03:09and that's new knowledge. It hasn't seen any of these before. If you try the same thing
00:03:13on a modern LLM of a similar size, it's going to massively outperform Torquay, but that was
00:03:18pretty obvious. Another thing that researchers can test with these vintage models is the ability
00:03:22to predict the future. Now this might get a little unhinged, but let's ask it, will there
00:03:26be another war? So it says it doesn't think there'll be another war in Europe and I don't
00:03:31have the heart to tell it just how wrong it is. I'm also curious now what it thinks of
00:03:35a certain Austrian man. Yeah, this one is pretty terrible. It says it's going to be an extraordinary
00:03:39personality and also down here that Germany is going to be a far more efficient administration
00:03:44when he's leading it. Honestly, no words, no. Now, researchers obviously do this in a way
00:03:50more scientific way than the way that I just did it. They actually calculated the surprisingness
00:03:54of short descriptions of historical events taken from the New York Times on this day feature.
00:03:59During this, you can see an increase in surprisingness after the 1931 knowledge cutoff and it spikes
00:04:04into the fifties and sixties. So it's found that those events are pretty unbelievable.
00:04:09Research like this lets them see how forecasting performance improves with model size and also
00:04:13decays along the horizons. You can also use this to test if models are able to come up
00:04:17with new ideas because you can ask it about patents or papers that were created after 1931
00:04:22and see if it can discover them on its own. Overall, these vintage models might let researchers
00:04:26distinguish how much a model's behavior is actually from the data set that it's trained
00:04:30on and talking about training, making a pure 1931 model is no easy feat. One of the hardest
00:04:36problems is temporal leakage, aka you don't want information from after 1931 to be included.
00:04:41Torquay, for example, does seem to have a few cases of this where it knows who the president
00:04:45was in 1936 and also a few of the policies that he created. There are several ways this
00:04:50could have happened. For example, when dealing with modern documents, so the scans of the
00:04:54older documents could have just had the wrong metadata attached to it, so it had the wrong
00:04:58date. There can also be post hoc additions like editorial introductions and footnotes.
00:05:03That data filtering is actually something that's still working on and refining this to try and
00:05:07eliminate the problem. The other problem they had is that documents from 1931 look like this
00:05:11and while OCR is pretty damn good these days, they only achieved 30% of the performance of
00:05:16a model trained on human transcribed versions of the same text. So that is going to be a
00:05:21lot of manual work and while they did use some regex to clean up some of the common mistakes,
00:05:25this only brought its performance up to 70%. So they're actually working on a new vintage
00:05:30OCR system. Finally, it's also a massive challenge to post train a vintage model because there
00:05:35is no ready made post training data for them and if they use a modern one, they'll end up
00:05:39baking in some of the knowledge, style and expressions of what a modern day chat assistant
00:05:43should be like. So the researchers actually had to make their own from old etiquette manuals,
00:05:48letter writing manuals, cookbooks, dictionaries, encyclopedias and even poetry and fables. Once
00:05:53you've done that though, you also need to train the model on instruction following and
00:05:56actually conversing. So you need to use some reinforcement learning. But the problem with
00:06:00that is that the model starts to adjust its output to whatever the judge wants and we don't
00:06:04have a judge from the 1930s. So they actually used Claude Sonnet 4.6. The issue is though
00:06:10a modern model likes its answers in the format of a 2026 chatbot. So styles apparently did
00:06:15start to leak into the 1931 model like listicles. This is pretty much unavoidable but they are
00:06:21hoping in the future that they can use vintage based models as the judges to fix this. They're
00:06:25actually already training a GPT3 level vintage model and they say that a trillion tokens of
00:06:29historical text is actually achievable which would put them at the original chat GPT capability
00:06:35just with 1930s knowledge. So go and try out Torquay and let me know in the comments if
00:06:39you get any weird or unhinged responses while you're down to subscribe and as always see
00:06:43you in the next one.
00:06:45[Music]

Key Takeaway

Training the 13 billion parameter Torquay model on strictly pre-1931 data creates a 'clean room' environment that proves AI can acquire novel reasoning skills, such as inverse function logic in coding, without the risk of memorizing modern web content.

Highlights

  • Torquay is a 13 billion parameter vintage language model trained exclusively on 260 billion tokens of historical English text from before 1931.

  • The 1930 knowledge cutoff avoids modern data contamination, allowing researchers to verify if AI can learn new skills like Python coding from scratch.

  • Historical model testing shows a spike in 'surprisingness' scores for events occurring in the 1950s and 1960s compared to the model's 1930s baseline.

  • Data filtering for vintage models faces challenges from temporal leakage caused by modern editorial footnotes or incorrect metadata on digital scans.

  • OCR performance on original 1930s documents achieves only 30% accuracy compared to human-transcribed text, requiring custom regex and new vintage OCR systems.

  • Researchers plan to scale historical datasets to one trillion tokens to reach GPT-3 level capabilities using only pre-1930s information.

Timeline

Predictions from a pre-1931 perspective

  • A model trained on 1930s text predicts a 2026 world where war is obsolete and flying is a universal art.
  • The vintage dataset lacks knowledge of World War II and the modern definition of a computer.
  • Torquay utilizes 13 billion parameters to simulate the cultural and scientific outlook of the early 20th century.

The model views the year 2026 through the optimistic lens of early 20th-century progress. It anticipates a world inhabited entirely by honest neighbors where robbers are unknown and healing is a perfected art. These predictions stem from a dataset composed of historical newspapers, patents, and scientific journals that ended at the 1930 US copyright cutoff.

Research utility of vintage language models

  • Vintage models eliminate the 'circular reporting' issue where modern AI models are trained on AI-generated web content.
  • The absence of modern definitions causes the model to define the 'internet' as an internal revenue tax on consumption.
  • Torquay identifies 1930s slang terms like 'bosh', 'gammon', and 'ribaldry' instead of modern vernacular.

Modern Large Language Models (LLMs) struggle with evaluation because their training data likely includes the answers to their own tests. By using a model restricted to 1931, researchers ensure any correct answer to a modern problem is the result of reasoning rather than memorization. This isolation provides a baseline to test if the architecture can handle concepts entirely outside its original training scope.

Learning new skills without training data

  • Torquay successfully learns to write simple Python programs when provided with a few examples in its context window.
  • The model demonstrates an understanding of inverse functions by correctly swapping addition for subtraction in a decode task.
  • Successful HumanEval test passes occur even though the model has never encountered code in its 260-billion-token training set.

Because the 1930s training data contains no mentions of programming languages, any successful code generation is evidence of zero-shot or few-shot learning capability. The model correctly identifies that a decoding function requires the opposite mathematical operation of an encoding function. While it underperforms compared to modern models of the same size, it proves that the underlying transformer architecture can synthesize new logic from limited examples.

Forecasting and historical surprisingness

  • Researchers use the model to measure the statistical 'surprisingness' of historical events taken from the New York Times.
  • The model inaccurately predicts peace in Europe and holds views of 1930s political figures that align with contemporary propaganda.
  • Forecasting performance analysis reveals how AI predictions decay as the time horizon moves further away from the training cutoff.

The model's predictions reflect the specific geopolitical uncertainties of the 1930s, such as the belief that another major European war was unlikely. By testing the model against actual historical events from 1931 to the present, researchers can quantify how 'unbelievable' certain breakthroughs or disasters are to a mind stuck in the past. This method also tests the model's ability to 'invent' or discover concepts that would eventually become post-1931 patents.

Technical hurdles in historical AI training

  • Temporal leakage occurs when modern footnotes or incorrectly dated digital scans enter the training set.
  • Post-training requires custom-made datasets from old etiquette manuals and cookbooks to avoid modern chatbot styles.
  • Reinforcement learning faces a 'judge' problem because modern AI judges like Claude Sonnet 4.6 tend to force 1930s models into modern listicle formats.

Maintaining a 'pure' 1931 model is difficult due to the nature of digital archives; for instance, the model occasionally knows about 1936 political policies due to metadata errors. To prevent the model from sounding like a modern assistant, researchers had to source instruction-tuning data from period-appropriate sources like old dictionaries and fables. Current efforts focus on scaling to one trillion tokens and using vintage models as judges for each other to eliminate the stylistic leakage of the 21st-century web.

Community Posts

View all posts