Transcript
00:00:00What if you could talk to someone from the 1930's? So, time travel? Well, I have and
00:00:05I asked them to predict what life would be like in 2026.
00:00:08I think that, in the year 2026, life will be pleasant and easy. In the first place,
00:00:13the world will be inhabited all over. For by that time, people will have discovered the
00:00:17art of flying. In the next place, all wars will have ceased, for nations will have learned
00:00:21to live in peace and amity with each other. We will have acquired knowledge in all sorts
00:00:25of sciences and be perfectly skilled in the art of healing. We will be honest and good
00:00:29neighbours and robbers will be unknown. In short, I think the world will, in the year
00:00:342026, be a very pleasant place to live in.
00:00:36I think those were some really nice predictions, so it's a bit of a shame only a few of them
00:00:40came true, and talking about being truthful, I may have lied a little bit. Sorry, but I
00:00:45didn't invent time travel. Instead, this was actually Torquay, which is a 13 billion
00:00:49parameter language model trained on pre-1931 text, so it doesn't even know that there's
00:00:54a second world war, and it gives some truly fascinating, wild and offensive answers, and
00:00:59can even be taught to code.
00:01:02So Torquay here is what's known as a vintage model, and Torquay appears to be the biggest
00:01:10of its kind. One of the researchers on this was actually the lead author on OpenAI's
00:01:14GPT research in 2018, which formed the very foundation of ChatGPT. He's also worked on
00:01:19Dali and Whisper, so that's a seriously impressive CV. What they've done is made a 13 billion
00:01:24parameter network trained exclusively on 260 billion tokens of historical English text,
00:01:30so old newspapers, patents, scientific journals and books. 1931 is the cut-off by the way,
00:01:36because the US copyright cut-off is the end of 1930, so they're not going to get sued
00:01:40by Big Mouse for doing this. And besides just being fun to talk to, these vintage models
00:01:44also serve a research purpose. Today, every big model like ChatGPT, Claude and Gemini,
00:01:49they're all trained on the modern web, which makes it almost impossible to figure out if
00:01:53an AI is actually thinking and reasoning, or if it's just memorised an answer from a random
00:01:57reddit thread, and that reddit thread was probably AI generated. So if we use a model that only
00:02:02knows pre-1931 text, it's not going to have any of that contamination. I mean let's try
00:02:07this out and ask it something like what is the internet? Here we can see it's seemingly
00:02:11got the internet confused with the internal revenue tax levied upon articles of consumption,
00:02:16so it has no idea what the modern internet is. Another fun one is to ask it what popular
00:02:20slang words are. Apparently in 1930 they were bosh, rot, bosh again, stuff, nonsense, fudge,
00:02:25gammon, humbug, brack, fla, and ribaldry. I hope I haven't said anything rude there. And
00:02:30the reason having no contamination is useful is because it lets researchers understand
00:02:33if AI can actually learn new things, like learning to code. This model doesn't even know what
00:02:38a computer is, as far as it's concerned it's a human that does computation. That's why it's
00:02:43pretty impressive that when given a few examples of python programs, it could actually learn
00:02:47to write new ones, even if it was just mostly simple one line programs. When using some basic
00:02:52human eval python tests and providing it with random python functions in its context, it
00:02:56actually managed to pass these tests a few times when given 100 chances. In this case
00:03:01Torquay correctly understood that to make the decode function, it simply needed to swap an
00:03:05addition for a subtraction. While pretty simple, that shows an understanding of inverse functions
00:03:09and that's new knowledge. It hasn't seen any of these before. If you try the same thing
00:03:13on a modern LLM of a similar size, it's going to massively outperform Torquay, but that was
00:03:18pretty obvious. Another thing that researchers can test with these vintage models is the ability
00:03:22to predict the future. Now this might get a little unhinged, but let's ask it, will there
00:03:26be another war? So it says it doesn't think there'll be another war in Europe and I don't
00:03:31have the heart to tell it just how wrong it is. I'm also curious now what it thinks of
00:03:35a certain Austrian man. Yeah, this one is pretty terrible. It says it's going to be an extraordinary
00:03:39personality and also down here that Germany is going to be a far more efficient administration
00:03:44when he's leading it. Honestly, no words, no. Now, researchers obviously do this in a way
00:03:50more scientific way than the way that I just did it. They actually calculated the surprisingness
00:03:54of short descriptions of historical events taken from the New York Times on this day feature.
00:03:59During this, you can see an increase in surprisingness after the 1931 knowledge cutoff and it spikes
00:04:04into the fifties and sixties. So it's found that those events are pretty unbelievable.
00:04:09Research like this lets them see how forecasting performance improves with model size and also
00:04:13decays along the horizons. You can also use this to test if models are able to come up
00:04:17with new ideas because you can ask it about patents or papers that were created after 1931
00:04:22and see if it can discover them on its own. Overall, these vintage models might let researchers
00:04:26distinguish how much a model's behavior is actually from the data set that it's trained
00:04:30on and talking about training, making a pure 1931 model is no easy feat. One of the hardest
00:04:36problems is temporal leakage, aka you don't want information from after 1931 to be included.
00:04:41Torquay, for example, does seem to have a few cases of this where it knows who the president
00:04:45was in 1936 and also a few of the policies that he created. There are several ways this
00:04:50could have happened. For example, when dealing with modern documents, so the scans of the
00:04:54older documents could have just had the wrong metadata attached to it, so it had the wrong
00:04:58date. There can also be post hoc additions like editorial introductions and footnotes.
00:05:03That data filtering is actually something that's still working on and refining this to try and
00:05:07eliminate the problem. The other problem they had is that documents from 1931 look like this
00:05:11and while OCR is pretty damn good these days, they only achieved 30% of the performance of
00:05:16a model trained on human transcribed versions of the same text. So that is going to be a
00:05:21lot of manual work and while they did use some regex to clean up some of the common mistakes,
00:05:25this only brought its performance up to 70%. So they're actually working on a new vintage
00:05:30OCR system. Finally, it's also a massive challenge to post train a vintage model because there
00:05:35is no ready made post training data for them and if they use a modern one, they'll end up
00:05:39baking in some of the knowledge, style and expressions of what a modern day chat assistant
00:05:43should be like. So the researchers actually had to make their own from old etiquette manuals,
00:05:48letter writing manuals, cookbooks, dictionaries, encyclopedias and even poetry and fables. Once
00:05:53you've done that though, you also need to train the model on instruction following and
00:05:56actually conversing. So you need to use some reinforcement learning. But the problem with
00:06:00that is that the model starts to adjust its output to whatever the judge wants and we don't
00:06:04have a judge from the 1930s. So they actually used Claude Sonnet 4.6. The issue is though
00:06:10a modern model likes its answers in the format of a 2026 chatbot. So styles apparently did
00:06:15start to leak into the 1931 model like listicles. This is pretty much unavoidable but they are
00:06:21hoping in the future that they can use vintage based models as the judges to fix this. They're
00:06:25actually already training a GPT3 level vintage model and they say that a trillion tokens of
00:06:29historical text is actually achievable which would put them at the original chat GPT capability
00:06:35just with 1930s knowledge. So go and try out Torquay and let me know in the comments if
00:06:39you get any weird or unhinged responses while you're down to subscribe and as always see
00:06:43you in the next one.
00:06:45[Music]