00:00:00But whether it was over the last 10,
00:00:0320, or 30 years,
00:00:04how many profitable patterns do you think existed during that time?
00:00:09The answer is infinite. And I can actually prove it.
00:00:12Hello, everyone. This is Wall Street Guy.
00:00:21Today, we're going to dive into the world of quant investing.
00:00:24I've received many comments and emails
00:00:26asking me to explain quant trading.
00:00:29Given my future goals and the channel's curriculum,
00:00:33I didn't plan on covering algorithmic trading
00:00:36or quant investing anytime soon.
00:00:38However, I noticed that about 15% of my subscribers
00:00:40are already practicing quant trading,
00:00:44and I've recently developed some concerns about the field,
00:00:48which is why I decided to make this video.
00:00:50Today, we'll start with a general overview
00:00:53of quant classifications and principles,
00:00:56and then discuss 10 things to watch out for in quant trading.
00:00:59If you keep these 10 points in mind,
00:01:04you can avoid wasting a significant amount of time
00:01:06and prevent yourself from using flawed backtesting methodologies
00:01:09that could lead to massive losses.
00:01:11I believe this can prevent many of those cases.
00:01:15These points are the absolute basics, yet it seems
00:01:18that many expensive paid courses on the market
00:01:21don't actually cover them in detail.
00:01:24In fact, many of those courses tend to over-glamorize
00:01:27backtesting and quant investing.
00:01:31By remembering these 10 things, you'll be able to stay safe
00:01:35and protect yourself regardless of what information you hear
00:01:39or which services you use.
00:01:43Before we begin, I must admit I made a mistake
00:01:47in a short, somewhat heated post I wrote a few days ago.
00:01:49I think I overreacted a bit.
00:01:51I wrote that the viral marketing and exaggerated ads
00:01:54for quant investing have gone too far.
00:01:55But then, some people started leaving malicious comments
00:01:59targeting specific individuals or companies.
00:02:02I deleted the post because I didn't want to cause any trouble.
00:02:06To be honest, what they're doing isn't illegal like
00:02:09unauthorized signal rooms or shady brokerage accounts.
00:02:12So, I might just be meddling in others' business.
00:02:16However, with illegal accounts, people usually know
00:02:19it's wrong but get involved out of greed.
00:02:22They bear some of that responsibility themselves.
00:02:25But the current discourse around quant investing
00:02:28worries me because it targets everyday people
00:02:33who are simply trying to work hard and improve their finances.
00:02:35They enter the world of quant investing with good intentions,
00:02:37but they could end up getting hurt.
00:02:40Because “quant” is associated with keywords like “science” and “statistics,”
00:02:46it can mislead people into seeing it as foolproof, even when it's not.
00:02:51Illegal rooms and accounts are obviously shady,
00:02:55so you can avoid them if you want to.
00:02:56But here, well-meaning people who are trying their best can become victims.
00:03:01Claims like “Anyone can become a quant expert in a few days,”
00:03:04or “This strategy is proven by decades of data,”
00:03:08or implying that a 20% compound annual return over 10 years
00:03:11guarantees the same performance in the future...
00:03:14While these might just be offhand remarks without malicious intent,
00:03:18novice investors can easily be misled.
00:03:20They might mistake these claims for absolute truth,
00:03:23waste countless hours on backtesting,
00:03:25and eventually suffer significant financial losses.
00:03:27This happens when you have blind faith in backtesting results.
00:03:32In fact, under SEC regulations in the U.S.,
00:03:35marketing a fund that way is strictly illegal.
00:03:38I wish those discussing quant investing
00:03:41would feel the weight of other people's money a bit more.
00:03:45I don't know how long I'll be doing YouTube,
00:03:47but I'm not saying this to play the “good guy.”
00:03:51I'm saying this because I struggled a lot with money
00:03:52in my mid-twenties, so I know how it feels.
00:03:56Since I often talk about managing the psychology of loss
00:03:58and share stories about my own failures in my youth,
00:04:01I seem to receive a lot of related inquiries.
00:04:05Every single week, I receive several emails
00:04:09from subscribers who have lost hundreds of thousands of dollars
00:04:14asking for advice.
00:04:16I believe YouTubers covering finance, stocks, and real estate
00:04:20should practice some self-reflection at times.
00:04:24Lately, while doing my “80-Day Investment Journey,”
00:04:26I felt like I was starting to sound like a “signal room” leader.
00:04:29I thought I should return to my original purpose once this downturn ends.
00:04:33Anyway, that's why I'm making this video.
00:04:37I'm not trying to attack any specific person or company.
00:04:40Those marketing quant investment
00:04:43might be unaware of certain aspects themselves.
00:04:46My goal is for all of us to recognize these issues
00:04:49and work toward improving them.
00:04:51So, I hope viewers refrain from mentioning specific names
00:04:55or turning this into a call-out in the comments.
00:04:57That was a long introduction, so let's get into
00:04:58the classifications of quant.
00:05:01The term “quant” is defined very broadly.
00:05:04For convenience, if we categorize it by time horizons,
00:05:07first, there is High-Frequency Trading (HFT).
00:05:10Specifically, what we call “Ultra HFT.”
00:05:12This involves co-locating servers near the exchange,
00:05:14coding at the machine language level,
00:05:19and focusing heavily on hardware.
00:05:20That's the level we're talking about there.
00:05:22Next, with a slightly longer time horizon,
00:05:24is algorithmic trading.
00:05:28This involves using technical indicators or rule-based systems.
00:05:29This is popular among individual investors
00:05:33and is becoming more accessible through backtesting platforms.
00:05:35Then we have Statistical Arbitrage,
00:05:39including things like pair trading.
00:05:41This uses statistical models and techniques
00:05:42to identify historical patterns,
00:05:44operating on the assumption of mean reversion.
00:05:46Next is Factor Investing.
00:05:48This has a longer-term outlook and involves factors like
00:05:50momentum, value, and carry.
00:05:52It seeks to identify factors that drive prices and find alpha.
00:05:54And a hot topic in recent years
00:05:59is “Quantamental.”
00:06:01This involves quantifying and automating fundamental analysis,
00:06:03incorporating various data analysis and alternative data
00:06:06for long-term investment strategies.
00:06:07In the same vein, things like machine learning,
00:06:10Big Data, and alternative data
00:06:12are expanding into various fields.
00:06:16These categories are just for convenience,
00:06:18and the boundaries are often quite blurred.
00:06:20Some people might refer to this whole spectrum
00:06:23simply as algorithmic trading.
00:06:26I'll be discussing general quant trading
00:06:28by grouping these all together.
00:06:30The fundamental principles of quant trading are:
00:06:31First, you need an investment idea or hypothesis.
00:06:33Second, you perform backtesting.
00:06:35This means testing your idea or hypothesis
00:06:37against historical data.
00:06:40You think, “If I do this, I can make money,”
00:06:42so you check if that specific approach
00:06:44actually worked in the past.
00:06:47If backtesting yields good returns,
00:06:50you move to live trading
00:06:51while implementing risk management.
00:06:54The process follows these four stages.
00:06:56Until the mid-2010s, quant trading
00:06:57was essentially the exclusive domain of institutions,
00:07:00specifically quant funds staffed with PhDs in STEM fields.
00:07:01But it's becoming more common in institutions
00:07:03in terms of execution and other areas.
00:07:06Moreover, services like Quantopian in the U.S.
00:07:09made backtesting much easier,
00:07:13allowing individuals to easily access quant trading.
00:07:16It's a growing trend.
00:07:18However, misunderstandings about quant investing
00:07:21are also on the rise.
00:07:23For example, someone might say,
00:07:25“Over the last 15 years, investing in companies with a PBR under 0.9
00:07:28and rising prices over the last 12 months yielded 20.2% annually.”
00:07:30Then they tweak the PBR slightly
00:07:33and see returns of 14% or 17.8%.
00:07:35So, looking at the backtesting results,
00:07:38they conclude that because the first one was best,
00:07:40they should invest using those specific rules.
00:07:42I see this kind of conclusion quite often.
00:07:46But this is actually a bad example.
00:07:48If you think about it carefully,
00:07:51the backtesting process is based on the unproven assumption
00:07:53that past patterns will repeat in the future.
00:07:56It's just finding patterns that were profitable in the past.
00:07:58But whether it was over the last 10,
00:08:0120, or 30 years,
00:08:03how many profitable patterns do you think
00:08:04existed during that time?
00:08:07If you pause the video and think about it,
00:08:09the answer is infinite.
00:08:12And that's actually provable.
00:08:14Because the parameters for various strategies are continuous,
00:08:16there are effectively an infinite number of profitable strategies.
00:08:18But the real question is: how many will remain profitable in the future?
00:08:21This is the true “holy grail” of quant.
00:08:24Anyone can find a pattern that worked in the past
00:08:26if they have the right backtesting tools.
00:08:29But finding something that worked in the past
00:08:30AND will continue to work in the future
00:08:32is incredibly difficult.
00:08:34It's like finding a needle in a haystack.
00:08:36When I looked through various Korean blogs and sites,
00:08:38I noticed that Joel Greenblatt's “Magic Formula”
00:08:42is very famous.
00:08:46He wrote about a very simple formula
00:08:50that selects stocks based on things like
00:08:52market cap and other filters.
00:08:55This Magic Formula became a huge hit
00:08:56and became well-known among individual investors.
00:09:00Now, this person is legendary in the hedge fund world.
00:09:02He's been investing since the 1980s,
00:09:04and during that period, he actually recorded
00:09:07higher returns than Warren Buffett.
00:09:09That's why the Magic Formula received so much attention.
00:09:12But let me give you the conclusion first.
00:09:13Even this legendary formula
00:09:15has its limitations in the real world.
00:09:17to select stocks through specific filters,
00:09:20and he even wrote a book about this “Magic Formula” strategy.
00:09:24It became a huge sensation,
00:09:25and it seems to be well-known among many individual investors.
00:09:28He's also incredibly famous in the hedge fund world.
00:09:31He's been investing since the 1980s,
00:09:33and during that period, he recorded returns
00:09:35even higher than those of Warren Buffett.
00:09:37I think that's why the Magic Formula received so much attention.
00:09:40To give you the bottom line first,
00:09:42Greenblatt isn't actually a quant,
00:09:44and his hedge fund didn't invest using only the Magic Formula.
00:09:47Those high returns weren't solely from the Magic Formula.
00:09:50His fund focused on value investing,
00:09:52but it also engaged in “special situation” investing.
00:09:54That involves things like spin-offs,
00:09:57for example, when a company splits off from another,
00:09:59allowing him to identify price discrepancies
00:10:01and gain an “edge” in those specific areas.
00:10:04He combined those methodologies.
00:10:07And for the value investing portion, I doubt he used
00:10:10such a simplistic formula alone.
00:10:12Of course, it likely reflected that framework,
00:10:14but I don't believe he generated those returns
00:10:18by just mechanically buying based on the formula.
00:10:20If we backtest the returns of the Magic Formula
00:10:22since it was made public in 2005,
00:10:26the gray line is the S&P index,
00:10:28and the green line is the Magic Formula.
00:10:29As you can see, after some high volatility,
00:10:32it has consistently underperformed.
00:10:34These types of results are similar to
00:10:37what you'd see from systematic equity ETFs.
00:10:40You could say that as the market becomes more efficient,
00:10:42that specific edge has vanished.
00:10:44As we can see from the performance of such a famous formula,
00:10:48finding patterns that were profitable in the past is very easy.
00:10:50You can even write a book about them.
00:10:53However, finding a pattern that will be profitable in the future
00:10:56requires an immense amount of work.
00:11:00Strategies that yield 20% annual returns with just
00:11:03a few days of thought and a few clicks simply don't exist.
00:11:06Another example is Quantopian.
00:11:08Quantopian was a startup founded around 2011,
00:11:12and it was a platform that made backtesting very easy in the US.
00:11:16300,000 people ran 12 million backtests,
00:11:20testing and creating countless quant strategies there.
00:11:24The famous billionaire Steve Cohen even invested,
00:11:27the well-known hedge fund trader.
00:11:29The top-tier quants at Quantopian
00:11:32even published papers on
00:11:34which of these strategies would perform well in the future,
00:11:37investigating what criteria or statistical methods
00:11:40should be used to filter them.
00:11:41They researched this very intensively
00:11:44to select the best strategies
00:11:46with the idea of running a new hedge fund.
00:11:48That was the vision,
00:11:49but it failed miserably.
00:11:51In the end, it shut down last year.
00:11:53Why do these things happen?
00:11:55And for those of you looking to get into quant trading,
00:11:58how can you avoid such a result?
00:12:02Of course, you can't avoid it perfectly.
00:12:03And I believe it's an incredibly difficult task,
00:12:07but if you still want to take on the challenge,
00:12:10I want you to keep at least these 10 things in mind
00:12:12and be very cautious.
00:12:13I'll go through them one by one.
00:12:16If you just remember these 10 points,
00:12:17you should be able to avoid wasting time on bad backtests
00:12:22and potentially suffering losses.
00:12:24Though, doing a good backtest doesn't guarantee profit.
00:12:27First, you must always question your data.
00:12:31Some people use data from Google or Yahoo,
00:12:34but that data is often incredibly “dirty.”
00:12:37Those starting quant trading from scratch
00:12:41will face many obstacles regarding data.
00:12:45Free data is often messy and full of errors.
00:12:47When it comes to the task of
00:12:50cleaning that data,
00:12:51you might think you just need to find the errors,
00:12:54but it actually involves more subjective judgment
00:12:57and bias than you might expect.
00:12:59Let me give you an example.
00:13:01Let's say a stock was trading between $41 and $43
00:13:05and then the market closed.
00:13:06But right around the closing bell,
00:13:08a trader made an order error,
00:13:11and a single share was traded at $28.
00:13:14Technically speaking,
00:13:16the low for that day is $28.
00:13:18That trader took a big loss due to a mistake,
00:13:21but the low has to be recorded as $28.
00:13:24That's the fact.
00:13:25So, how do you set the high and the low?
00:13:28If you remove that and set $41 as the low,
00:13:31you are essentially deleting a trade
00:13:34and a low that actually occurred.
00:13:36But if you don't remove it,
00:13:38let's say you're testing a strategy
00:13:40that places a buy order if the price
00:13:44drops more than 5% within 5 minutes.
00:13:45In a backtest,
00:13:47it might recognize that
00:13:48you bought the stock at $28.
00:13:51Then, it assumes you bought at $28
00:13:53and sold at the closing price of $42,
00:13:55recording an immediate profit.
00:13:58This could lead to the returns of the strategy
00:13:59being massively inflated.
00:14:01But of course, it's only been a week, so you can delete it
00:14:03but what if that trader's mistake
00:14:06was for 10 shares, 100 shares, or even 10,000?
00:14:09Cases like that actually happen.
00:14:11They happen quite occasionally.
00:14:14There have been massive cases where
00:14:17tens of millions of dollars were lost,
00:14:20but smaller mistakes of 100 or 1,000 shares
00:14:21are more common than you'd think.
00:14:23Of course, in recent years,
00:14:24since algorithms
00:14:25handle most executions,
00:14:27safeguards have been put in place.
00:14:29So it's not as frequent as it used to be,
00:14:31but when you look at backtesting data
00:14:33from before algorithmic execution was common,
00:14:36like 2005 or 2011,
00:14:37if you go back that far,
00:14:39you'll see these cases quite often.
00:14:41So, how are you going to handle that?
00:14:43Also, there are products traded
00:14:44on multiple exchanges.
00:14:45In those cases,
00:14:47you need to know if the data
00:14:49from all those various exchanges
00:14:50has been cleanly consolidated
00:14:52for the highs, lows, and volume.
00:14:53Or are you backtesting
00:14:56with incomplete trading data
00:14:57pulled from only a few exchanges?
00:14:59If the data cost is cheap,
00:15:01that's a distinct possibility.
00:15:02Also, when calculating MDD,
00:15:04do you use the low price or the closing price?
00:15:05For example, when backtesting
00:15:07a monthly rebalancing strategy,
00:15:09some use daily data
00:15:11but only look at the closing price.
00:15:13But in reality,
00:15:14to calculate true drawdowns,
00:15:15you also have to look at
00:15:17the intraday drawdown.
00:15:18These small details matter,
00:15:20like when backtesting with futures,
00:15:21how do you handle the rollover
00:15:22for products with expiration dates?
00:15:24Or in many backtests,
00:15:26they create a continuous futures dataset
00:15:27to run the test,
00:15:29but how is the rollover
00:15:31being treated?
00:15:33There are so many issues
00:15:34beyond just these.
00:15:35Have you really thought about
00:15:37these data problems?
00:15:38If you're using a backtesting service,
00:15:39are you just trusting that
00:15:40the provider did a good job with the data?
00:15:42You need to verify these things,
00:15:44because data issues cause
00:15:47far more errors than you'd think,
00:15:51often distorting the backtesting results.
00:15:53Another major data-related issue
00:15:57is survivorship bias.
00:15:59It's one of the most representative errors in backtesting.
00:16:01Look at this illustration—
00:16:04I'm not sure if it's WWI or WWII,
00:16:06but the Air Force wanted to reinforce their planes.
00:16:08They wanted to figure out
00:16:10where to add extra armor plating.
00:16:12To determine this,
00:16:16engineers examined all the planes
00:16:18that returned from dogfights
00:16:20to see where they had been shot the most.
00:16:21They found that certain areas
00:16:24took the most hits,
00:16:26so they concluded that
00:16:28they should make the armor thicker there.
00:16:29But that was a huge mistake.
00:16:33Because the planes that were hit
00:16:34in the other areas—
00:16:36like the engine or the cockpit—
00:16:38all crashed and never returned.
00:16:40It's a great example of how dangerous
00:16:42it is to draw conclusions
00:16:42based only on the data you have.
00:16:44In stock investing, survivorship bias means,
00:16:46for example, looking back and thinking,
00:16:49“I'd be rich if I bought Apple and Microsoft in the 80s.”
00:16:50It shows just how dangerous it is
00:16:52to draw conclusions based on given data.
00:16:54In stock investing, survivor bias would be like,
00:16:56for example,
00:16:57thinking right now,
00:16:59“If I had bought Apple and Microsoft in the 80s,
00:17:02I would have hit the jackpot.”
00:17:03So, with that thought,
00:17:05let's say you build a strategy to buy those kinds of tech stocks.
00:17:08But actually, back in the 80s,
00:17:10there were more than 30 companies
00:17:13just as promising as Apple or Microsoft.
00:17:14And 28 of them ended up disappearing.
00:17:17Only two of them survived.
00:17:19Even though only these two made it,
00:17:22people look at them and think,
00:17:23“If I invest like that now, I'll strike it rich.”
00:17:27So, if you use only currently surviving companies
00:17:30as your subjects for backtesting,
00:17:32the returns will inevitably be inflated.
00:17:35And this obviously becomes a bigger problem
00:17:38as the backtesting period gets longer.
00:17:40Because over that long period of time,
00:17:41there must have been many companies that existed at the start
00:17:43but have since vanished.
00:17:45However, a lot of novice investors,
00:17:47when they start backtesting,
00:17:48first define their stock universe.
00:17:51When they decide which stocks
00:17:54they are going to test,
00:17:55they populate it with companies that exist today.
00:17:58Then, within that pool,
00:17:59they backtest with various criteria
00:18:02to judge how to pick
00:18:05the “good” companies.
00:18:07But if you do it that way,
00:18:08from the start of the backtest until now,
00:18:11all the companies that went bankrupt are excluded.
00:18:13The backtesting is done while assuming
00:18:16that you have some sort of god-like foresight.
00:18:18Naturally, the returns will be higher than reality.
00:18:21So, when backtesting,
00:18:23if you are testing a 20-year period,
00:18:25you should start with the companies that existed in 2001
00:18:29and use them
00:18:30as your initial scope.
00:18:32That's what I wanted to mention.
00:18:33As a side note,
00:18:34the “super ants” you see on YouTube
00:18:37might also be subject to survivor bias.
00:18:40While some became super ants through pure skill,
00:18:43others might have taken massive risks,
00:18:45buying a huge stake in a single stock,
00:18:48and if that stock took off,
00:18:49they became a super ant.
00:18:51But there were probably
00:18:5330 or 50 other people who did the exact same thing.
00:18:55Out of those 50 people who took high risks,
00:18:58only one survived,
00:18:59and viewers are only looking at that one person.
00:19:02This could also be a matter of survivor bias.
00:19:05So, if you look at them now
00:19:06and think, “I should be like that too,”
00:19:08and dive into extremely high-risk investments,
00:19:11it's not a guaranteed path to success;
00:19:13you'd have to be that lucky 1 out of 50.
00:19:17Simply being aware of these biases
00:19:20allows for more rational and wise investing.
00:19:22When using backtesting platforms,
00:19:24you are essentially delegating the data issues
00:19:27and survivor bias problems I mentioned
00:19:28entirely to that company.
00:19:31Quite naively.
00:19:32But you have to wonder if that company
00:19:33really addressed these issues
00:19:35with extreme rigor,
00:19:37truly worrying about the users' actual returns
00:19:39in the real world,
00:19:41and invested significant capital
00:19:43to clean up the data.
00:19:45You definitely need to verify those points.
00:19:48The second thing to watch out for
00:19:50is look-ahead bias,
00:19:52which means not looking at the future beforehand.
00:19:54If I were to give it a rough name,
00:19:57maybe “future-sight bias”?
00:19:58That's one way to interpret it.
00:20:00Information that was unattainable at the time of a trade—
00:20:03since backtesting uses past data,
00:20:05chronologically speaking,
00:20:07it's information that didn't exist last year.
00:20:09But it's quite common to find cases
00:20:12where the logic is built to trade last year
00:20:14while referencing that future information.
00:20:15That is what we call look-ahead bias.
00:20:18A representative mistake of this kind would be,
00:20:21for example, as of this month, September 2021,
00:20:24it's hard to backtest all Korean stocks,
00:20:27so let's just do 100.
00:20:29That's what a user might think.
00:20:30So they pick the top 100 KOSPI companies by market cap
00:20:34and run a backtest on them.
00:20:35Say, a strategy to buy if the PER is at a certain level.
00:20:38They do that,
00:20:39and after backtesting for 10 years,
00:20:41the returns look fantastic.
00:20:42But what went wrong?
00:20:44You picked the top 100 KOSPI stocks as of September 2021.
00:20:50You only selected those stocks,
00:20:51but if you backtest for 10 years starting from 2011,
00:20:55it's like you already knew in 2011
00:20:59which companies would be in the top 100 in 2021.
00:21:01Being in the top tier of market cap
00:21:03essentially means that the stock price has risen steadily.
00:21:06Even if people are careful about other things,
00:21:08they often overlook this when they decide
00:21:11to just pick a few hundred stocks by market cap.
00:21:12They think that way
00:21:14and make a lot of mistakes.
00:21:15Another example is
00:21:17when backtesting with fundamental financial data.
00:21:21Each company releases its quarterly earnings
00:21:24on different dates.
00:21:26But consider whether rebalancing
00:21:29or trading occurs
00:21:31after those reports are actually released.
00:21:33A company might report its earnings early the following month,
00:21:36but you rebalanced at the end of the previous month
00:21:40already knowing that information.
00:21:41You're trading while already knowing the future.
00:21:44That kind of thing can slip into a backtest.
00:21:46One more example would be,
00:21:48say you're trading based on closing prices.
00:21:50You assume that
00:21:52and do daily rebalancing,
00:21:54but the closing price is info you only get after the day is over.
00:21:57Yet, if you set the backtest
00:22:00to execute the order 5 minutes before the market closes,
00:22:03in that way, in terms of timing,
00:22:05you're gaining knowledge of the future,
00:22:07and a bias can occur.
00:22:09The third point is extremely important.
00:22:11Avoiding overfitting.
00:22:13I cannot overemphasize this.
00:22:16Overfitting is when
00:22:18you make a model perform excessively well
00:22:19only on the given sample data.
00:22:23For example, here is a sample.
00:22:25What we really want to know
00:22:27is the population behind it.
00:22:29We want to estimate
00:22:32the actual overall population,
00:22:34and in case some of you
00:22:36don't know what a population is,
00:22:38to explain it briefly,
00:22:40let's say we're doing a poll
00:22:41on an election result.
00:22:44If we survey every single citizen,
00:22:46that would be a perfect poll with 100% accuracy.
00:22:48But since we can't survey everyone,
00:22:50we take a sample from the population.
00:22:53We select a portion of the people and assume that sample represents
00:22:58the population behind it.
00:22:59We assume it's representative and make an estimation.
00:23:02So, the actual population data behind this
00:23:06would have a certain distribution,
00:23:08and we pull a few samples from that
00:23:10to estimate what the population might look like.
00:23:16This is an attempt to fit a model to that shape,
00:23:20but fitting a model means
00:23:22finding a trend line where the error
00:23:25between the sample and the model is minimized.
00:23:30Lines like these.
00:23:30But as you can see, if you fit a very wiggly,
00:23:34complex model like this,
00:23:37the error on the sample data is zero.
00:23:39It touches every single sample point.
00:23:41So, for this sample, it's a perfect,
00:23:44zero-error model.
00:23:47But is this a model that accurately represents the population?
00:23:51Probably not.
00:23:51If you pull a new sample, the error will be quite large.
00:23:54So you have to fit it appropriately
00:23:58so that when new samples come in,
00:24:00the sum of those errors remains small.
00:24:03On the other hand, if you fit
00:24:06an overly simple straight line,
00:24:08that's an “underfit,” meaning it's under-optimized.
00:24:10In that case, the error is large even on the sample.
00:24:13So, the most important thing in any modeling
00:24:16is to optimize it just right,
00:24:18but when many people backtest,
00:24:20they treat past data as their sample data.
00:24:24And on that sample data,
00:24:26they try to maximize returns within that specific sample
00:24:29by throwing in all sorts of rules
00:24:32to drive the returns as high as possible.
00:24:35For example, backtesting data from 2015 to 2021 might show
00:24:39that if PER is between 13.75 and 17.23,
00:24:43market cap is between 51.7 and 62.3 billion won,
00:24:46and buying stocks with a PBR of 1.17 or less,
00:24:50an annual return of 70% is possible.”
00:24:52This is the kind of backtesting result you might get.
00:24:54As you can tell, this is a clear case of overfitting.
00:24:57It is over-optimized.
00:24:58Perhaps a company with a PER of 17.24 that performed poorly
00:25:04was included in this specific dataset,
00:25:05or maybe there was a company with a market cap of 51.5 billion won
00:25:09that was a bad example, so the parameters were set this way.
00:25:12When you look only at sample or past data and try to be that specific
00:25:16just to maximize the returns at any cost,
00:25:19you end up with a model like this.
00:25:21Then, when actual data from that distribution appears in the future,
00:25:25the margin of error becomes massive.
00:25:27That is the point here.
00:25:28Let's take a look in more detail.
00:25:29This is another example of over-optimization.
00:25:31Suppose we want to find a line
00:25:34that separates the red dots from the blue dots.
00:25:36That's our model.
00:25:37Now, the black line represents a well-learned model,
00:25:40but that squiggly green line...
00:25:42based on the blue and red dots you see right now,
00:25:46it separates them perfectly.
00:25:48So, within this specific sample data,
00:25:50it's a perfect line with zero error.
00:25:52However, in the actual underlying population,
00:25:55blue dots might appear around here,
00:25:57and red dots might start appearing over there.
00:25:59When new data comes in in the future,
00:26:03we can assume this green line will have a lot of errors.
00:26:05That's a fair assumption.
00:26:07So, if you fit your model too closely to past data,
00:26:10it won't work in the future.
00:26:11Here is a similar example.
00:26:13Detailed personal data was collected on 100 current students.
00:26:15The goal is to identify which of this year's 100 students
00:26:16will have the best grades based on that data.
00:26:19If you look at last year's top students and see things like
00:26:20their last name is Jung, or their height is in a certain range,
00:26:22and you over-optimize your identification rules
00:26:23based on those specific details from last year,
00:26:26and then apply that to this year's students,
00:26:28it could turn out to be completely absurd.
00:26:30Instead, if you set a rule based on
00:26:32students who study more than a certain number of hours,
00:26:34and apply it to last year's students,
00:26:37the accuracy might be lower than the hyper-specific rules.
00:26:39However, even though the accuracy is a bit lower,
00:26:42there's a high probability it will still be just as accurate
00:26:44when applied to this year's students.
00:26:45So, how can we mitigate this over-optimization problem?
00:26:47Every backtest has some degree of over-optimization,
00:26:49and it's impossible to eliminate it entirely.
00:26:53For instance, how do we know if a strategy that performed well
00:26:56over the last 5 years will be valid for the next 3 years?
00:27:00The perfect answer to that question
00:27:01is to actually trade it for 3 years.
00:27:06But that's after the fact.
00:27:08If you trade for 3 years and lose money,
00:27:11the test was pointless, right?
00:27:12One method is using “Out of Sample” data.
00:27:15This involves using data outside of your initial sample.
00:27:17It's commonly referred to as OOS data.
00:27:17For example, finding a strategy that works well
00:27:19on 6 years of data from Sept 2015 to Sept 2021,
00:27:21and then starting to trade it in Oct 2021 is a bad idea.
00:27:23Instead of doing that,
00:27:25you use 6 years of data from Sept 2014 to Sept 2020
00:27:27to find a high-performing strategy.
00:27:28Then, you backtest that strategy one more time
00:27:31on the data from Oct 2020 to Sept 2021.
00:27:33In other words, you find the best strategy from the 6-year period,
00:27:34pretend you started trading it in Oct 2020,
00:27:38and backtest it for that one additional year.
00:27:39If those results are good,
00:27:42then you start live trading in Oct 2021.
00:27:44Of course, splitting the data like this
00:27:46creates other problems,
00:27:49but we'll deal with those in a bit.
00:27:52The point I'm trying to convey right now is,
00:27:55if you have this much sample data,
00:27:57you set aside a portion of it.
00:28:02You set it aside,
00:28:04use the rest of the data to find a strategy,
00:28:06run many backtests, and optimize it.
00:28:09But instead of going straight to live trading,
00:28:10you take that data you didn't use to find the strategy,
00:28:12imagine it's the real world, and test it there.
00:28:13That is what we call using out-of-sample (OOS) data.
00:28:16In data science, terms like training data, validation data,
00:28:18test data, or development data are used.
00:28:19The terminology itself isn't that important.
00:28:21Point number 4 follows from point number 3:
00:28:23“The opportunity for validation happens only once.”
00:28:24This is incredibly, incredibly important.
00:28:26I really cannot emphasize this enough.
00:28:28It is such a critical concept.
00:28:30Let's dive deeper into this out-of-sample testing.
00:28:31Regarding sample data and out-of-sample data,
00:28:33there are various names for them,
00:28:34but for this video,
00:28:35I will stick to “training data” and “validation data.”
00:28:38In the previous example,
00:28:39the data from 2014 to 2020 is the training data.
00:28:41Training data is the data used to find the strategy.
00:28:42After the strategy is found,
00:28:44we validate it.
00:28:45So, we'll call that one year of backtesting
00:28:46the “validation data.”
00:28:48Now, what this graph shows
00:28:50is the complexity of the rules or the model.
00:28:53As you move to the right,
00:28:58the model becomes much more complex.
00:29:01Like defining a rule for a range
00:29:03of exactly 173cm to 173.25cm.
00:29:04The more you do that,
00:29:06the higher the complexity goes.
00:29:08Then, this axis is the prediction error.
00:29:09It represents how much error occurs
00:29:11when put into actual practice.
00:29:12As you can see,
00:29:13in the training sample (the training data),
00:29:16the more complex the model,
00:29:18the more the error decreases.
00:29:19Like the example where we had dots
00:29:20and used a squiggly, complex line.
00:29:22By making it complex,
00:29:24we could eliminate the error entirely within that sample.
00:29:26So, if you make a model incredibly complex,
00:29:28the error converges toward zero.
00:29:30However, if you take that trained model
00:29:32and test it on the validation data we set aside,
00:29:35what happens to the error?
00:29:36Initially, when the model is very simple,
00:29:38like a straight line,
00:29:40or when it's underfitted,
00:29:42the errors are similar.
00:29:44But as the model or rules become more complex,
00:29:45while the error in the training data
00:29:47continues to decrease,
00:29:49the error in the validation data
00:29:50hits a floor and then starts to increase
00:29:52as soon as it becomes overly complex.
00:29:53To use an analogy with backtesting,
00:29:54if you run countless backtests,
00:29:55set very detailed rules,
00:29:58test them over and over,
00:29:59and fine-tune
00:30:02parameters very precisely,
00:30:03like setting a specific PER value,
00:30:05the more complex you make it,
00:30:06the higher the returns in the past data will be.
00:30:08Since this is an error graph, lower is better.
00:30:12Basically, a backtest that is fitted to past data
00:30:14will show better returns the more you fit it.
00:30:16But when you apply this to reality,
00:30:18if you've made it excessively complex,
00:30:19there comes a point where a more complex rule
00:30:21leads to lower returns in practice.
00:30:23That's how it works.
00:30:24By the way, I equated lower error
00:30:26with better returns,
00:30:28and higher error with worse returns.
00:30:31Strictly speaking,
00:30:33a larger error is slightly different
00:30:35from lower returns.
00:30:37The more you mess up a backtest
00:30:40and overfit it,
00:30:42the gap between the backtest return and future return,
00:30:45which is the error, will grow.
00:30:47That error could randomly be
00:30:51either higher
00:30:52or lower.
00:30:55But generally, when such an error occurs,
00:30:56live returns tend to be worse.
00:30:59Because when you were fitting it to past data,
00:31:02you were fitting it to push returns
00:31:05as high as possible.
00:31:08So if there is an error relative to that return,
00:31:12it will likely be on the downside.
00:31:15So, how should we split the training and validation data
00:31:17when performing backtests?
00:31:18For example, from 2011 to 2021...
00:31:21let's say we have 10 years of data.
00:31:23One common way is a 7:3 split.
00:31:24You use the first 7 years for training
00:31:26and the last 3 years for validation.
00:31:28Or you could use an 8:2 ratio.
00:31:31Another method is “Walk-Forward Analysis.”
00:31:32This involves moving the window forward step by step.
00:31:33You train on 2011-2015, validate on 2016.
00:31:34Then train on 2012-2016, validate on 2017.
00:31:37This way, you can see if the strategy holds up
00:31:39across different market regimes.
00:31:42The key is to never use the validation data
00:31:45during the initial strategy development phase.
00:31:47If you peek at the validation data even once,
00:31:49it's no longer a clean test.
00:31:50It becomes part of the training process itself.
00:31:51But generally, when such a discrepancy occurs,
00:31:53the actual real-world returns tend to be worse.
00:31:55That's because when you're fitting it to historical data,
00:31:57you've forced the parameters
00:31:59to maximize the returns as much as possible.
00:32:00So if there is an error in those returns,
00:32:02it will usually be on the downside.
00:32:03Then, how should we split the data
00:32:06into training and validation sets for backtesting?
00:32:08For example, taking 11 years of data from 2011 to 2021,
00:32:11training on it, and applying it starting next year—
00:32:15that means you aren't using a separate validation set.
00:32:18You're using everything as training data, then going live,
00:32:21which is not recommended.
00:32:22The splitting method I mentioned earlier
00:32:25would be taking 10 years as training data,
00:32:28using the final year, 2021, for validation,
00:32:31and then applying the strategy from 2022.
00:32:34But as I'll explain in a moment,
00:32:36this isn't necessarily the best way either.
00:32:38So what are some improved methods?
00:32:40There is a method called Walk-Forward Testing.
00:32:43What this does is,
00:32:44for instance, you take 3 years starting from '99,
00:32:46train and optimize your parameters there,
00:32:49validate the results over the following year,
00:32:52and then roll that window forward.
00:32:55If you establish a strategy using this method,
00:32:58even with a very simple model—
00:33:01though I think backtesting based solely on PER
00:33:04is quite nonsensical—
00:33:05let's assume a strategy of buying stocks below a certain PER.
00:33:08Based on 10 years of historical data,
00:33:11if you optimize the PER threshold,
00:33:13the ideal criteria would differ for every single year,
00:33:17so you'd end up picking an average value that works okay.
00:33:20But if you narrow the scope,
00:33:22you can set the PER value based on the last 3 years
00:33:26and trade accordingly.
00:33:28By testing this way, you can adjust
00:33:30the parameters more flexibly over time.
00:33:32That's how this type of testing works.
00:33:35You can use that approach,
00:33:37or there's K-Fold CV,
00:33:38which stands for Cross-Validation.
00:33:39How this works is,
00:33:41the 'K' refers to the number of groups you divide the data into.
00:33:45Looking at the diagram, let's say K is 5.
00:33:47If you set K to 5, you split the data into 5 equal parts.
00:33:50You train on 4 years worth of data,
00:33:53then check the returns on the remaining 1 year of validation data.
00:33:56Then you train on a different set of 4 years
00:33:59and validate it against the remaining year.
00:34:01You repeat this and then average the five return sets.
00:34:05Essentially, you are averaging those returns.
00:34:09The idea is that this average represents
00:34:12the return you can actually expect.
00:34:13Alternatively, if you're using 10 years of data,
00:34:16some people train on even-numbered years
00:34:19and validate on odd-numbered years.
00:34:22All these methods have their pros and cons,
00:34:23but a major advantage of these approaches
00:34:26is that the parameters stay stable during market regime changes.
00:34:30What I mean by that is,
00:34:31when a financial crisis or COVID-19 hits,
00:34:33the fundamental nature of the market changes.
00:34:35For example, the 2008 financial crisis happened,
00:34:39but if you trained only on data from 1998 to 2007
00:34:43to find the best returns,
00:34:45and then validated it there,
00:34:46it wouldn't work because the market's nature shifted.
00:34:49The distribution of data changes,
00:34:51and the patterns from the past
00:34:52won't reflect the new market environment.
00:34:55So, by splitting the data in these ways,
00:34:57even when major events occur
00:35:00and change market properties and patterns,
00:35:02you can validate your strategy more reliably.
00:35:06That's why these methods are used,
00:35:08but you must be careful about “looking into the future.”
00:35:11You have to be very cautious about that.
00:35:13It depends on your trading frequency,
00:35:16but if you're trading on a monthly basis,
00:35:18and the training data
00:35:19includes info from 2014,
00:35:22depending on what rules or data you used in 2013,
00:35:26things that wouldn't be known until 2014
00:35:28could leak into the validation data.
00:35:30Then the returns for that validation data would be inflated.
00:35:34Because you've already trained by looking into the future.
00:35:36You must be extremely careful with this part.
00:35:39I've been speaking quite broadly,
00:35:41but in fields like Machine Learning,
00:35:44there's a concept called hyperparameters.
00:35:46Generally, parameters are things adjusted by the model itself
00:35:50to reduce the error in the sample data,
00:35:54whereas hyperparameters are things a person must decide.
00:35:57For example, in regression analysis,
00:35:59you decide whether to use a straight line or a curve.
00:36:03Basically, how complex the formula
00:36:07or the model will be—
00:36:09that is a human decision.
00:36:11So the number of parameters and such are hyperparameters.
00:36:15Once those are set,
00:36:18the model fits the line
00:36:22in a way that optimizes the data's error.
00:36:23Things like the slope or the intercept
00:36:28are what the model learns, and these are called parameters.
00:36:33So you have to try various hyperparameters as well.
00:36:36Instead of just splitting into train and test data,
00:36:40we often add another split called 'dev data'.
00:36:42You perform your optimization there—
00:36:45you optimize the hyperparameters on the dev set
00:36:48and then validate with the test data.
00:36:51Those familiar with machine learning will already understand this,
00:36:55and if you don't know it, this brief explanation won't be enough,
00:36:58so I'll just move on.
00:37:00However, when doing this work, there is one thing
00:37:04that is so important it can't be overemphasized.
00:37:08It's about the validation data.
00:37:10You must NEVER, EVER look at the validation data twice.
00:37:15Specifically, the results.
00:37:16You train on the training set and backtest many times to find a high-return strategy, right?
00:37:22At that point, you've found something that performs well on training data,
00:37:26but to see if it will actually be good in reality,
00:37:31you run it against a period or dataset that was never used for training.
00:37:38You must never run this more than once.
00:37:41Run it exactly once, and if the returns are bad,
00:37:45no matter how many years you worked or how much effort you put in,
00:37:50you must scrap the entire strategy.
00:37:52Why? Because in the real world, you only get one shot at profit or loss.
00:37:57You can't turn back time.
00:37:58Despite this, people feel bad that the validation results were poor,
00:38:03so they go back to the training data, tweak the parameters,
00:38:07and run it again until the validation returns look good.
00:38:10The moment you do that, it's no longer validation data;
00:38:14it has effectively become training data.
00:38:16You've optimized the parameters including the validation set.
00:38:21Consequently, for this strategy,
00:38:26we can't guarantee how it will perform in the real world.
00:38:29That's why this point is so critical.
00:38:31Another important thing in backtesting—related to this—
00:38:34is the concept of 'Market Regime' and how times change.
00:38:37Let me ask you a question.
00:38:39Between 20 years of backtesting and 3 years,
00:38:42which one is more meaningful?
00:38:44I've already given away the answer in the title,
00:38:47but many beginners think that the longer the backtest,
00:38:50and the more data you have, the better.
00:38:54But for me, between these two,
00:38:57it depends on the time horizon and trading frequency,
00:39:00but generally,
00:39:01I would choose the 3-year backtest.
00:39:03Having more data is generally better.
00:39:06But the data must come from the same distribution.
00:39:09More data is always better, but
00:39:11it's not good if it's mixed with data from an environment that has already changed.
00:39:17The problem with long backtests
00:39:20is that the nature of the market changes.
00:39:22I think this graph is real returns...
00:39:26anyway, it's a graph related to interest rates.
00:39:28As you can see, the concept of a “fair interest rate”
00:39:33fluctuates like this,
00:39:34but the baseline level within a regime changes drastically.
00:39:38At this point, it was here—maybe the oil shock?
00:39:41Anyway, after that period, it was here,
00:39:45and then after the 1980s,
00:39:47this became the generally accepted interest rate level.
00:39:51Now, imagine you're trading bonds,
00:39:53and you train a strategy within this period
00:39:57to use it over here.
00:39:59If the market regime has changed,
00:40:02the profitable strategy you built on that training data
00:40:07won't work here.
00:40:08That's what we call a Market Regime Change.
00:40:11A shift in the market's nature or system.
00:40:14Market shifts can happen
00:40:17due to changes in the market players.
00:40:20For example, after COVID, there was a massive influx of retail investors,
00:40:23leading to events like the GameStop saga.
00:40:25Before COVID,
00:40:27short-selling strategies—
00:40:30there are hedge funds that specialize in short-selling—
00:40:32used to work very well.
00:40:34But with the sudden change in market nature,
00:40:37some were even driven to bankruptcy.
00:40:39Then there are changes in policy and regulation. After the financial crisis,
00:40:43proprietary trading was banned for investment banks,
00:40:45and various regulations changed the derivatives market.
00:40:49Strategies trained on data
00:40:50from before the financial crisis
00:40:52likely wouldn't work well afterward.
00:40:54There are also exogenous events,
00:40:55like the oil shock, which are massive,
00:40:57transformative events for the market,
00:40:59macroeconomic in nature.
00:41:01Then there are other macroeconomic shifts.
00:41:03As debt ratios steadily climbed,
00:41:06interest rates that used to be at a certain level
00:41:08transitioned into an era of ultra-low rates.
00:41:11and quantitative easing also played a role
00:41:13in contributing to these low interest rates,
00:41:15causing growth stocks to suddenly outperform
00:41:17massively over the past 10 years.
00:41:19But if you found a profitable strategy
00:41:22using training data from before quantitative easing,
00:41:24it might involve buying things like value stocks.
00:41:25Then, naturally, over the next 10 years,
00:41:27the performance would have been very poor.
00:41:28Other factors include the emergence of new technologies
00:41:30or changes in industrial structure,
00:41:32things of that nature.
00:41:33So, when backtesting for 20 years,
00:41:35is data from 2001 really meaningful?
00:41:38Of course, a market regime change is
00:41:40depends on which factors you are looking at.
00:41:42It varies based on that.
00:41:43Ultimately, it depends on the logic,
00:41:45the rules of the strategy, or which elements
00:41:47and data the model
00:41:49is observing and using.
00:41:51Based on those factors,
00:41:52you have to see how the regime
00:41:53of that data changes.
00:41:55For some data,
00:41:56the characteristics change very quickly,
00:41:58even on a monthly basis.
00:41:59Others might remain
00:42:01quite stable for 10 or 15 years.
00:42:03Since the cycles for each are different,
00:42:05generally speaking,
00:42:07it doesn't mean that just because COVID-19 happened,
00:42:09all previous patterns
00:42:09become completely meaningless.
00:42:12However, if you use 20 years worth
00:42:14of data like that,
00:42:15there will definitely be some issues.
00:42:17You can look at it that way.
00:42:18And if you try to use
00:42:20very old data to make inferences,
00:42:22even if the market regime
00:42:23changed in the middle,
00:42:24it can change yet again.
00:42:25So, if it is data from the distant past
00:42:29that reflects the current point in time,
00:42:30it might actually be usable.
00:42:32That is why some people say
00:42:33the 1940s and the present day are similar.
00:42:35But that's just a side note.
00:42:37Anyway, quant trading
00:42:38has become very common,
00:42:41and even individuals do it now.
00:42:42But when it comes to long-term investing,
00:42:44the pitfall of quant investing is that
00:42:45when applying these quantitative techniques
00:42:47to long-term investments,
00:42:49it is very difficult to avoid regime changes
00:42:51while trying to secure a lot of data.
00:42:53For example, let's say there is an algorithmic
00:42:55trading strategy that uses minute-by-minute data.
00:42:57In one hour,
00:42:59there are 60 data points.
00:43:01Since there are 60 minutes,
00:43:02you get 60 data points,
00:43:03and let's say it's a futures contract
00:43:04that trades 24 hours a day.
00:43:05If you multiply that by 24,
00:43:08is it 1,440?
00:43:09Wait, let me check.
00:43:10Yes, you get 1,440 data points per day.
00:43:10So if you have 1,440 points a day
00:43:12and trade 5 days a week, roughly 250 days,
00:43:15assuming there are 250 trading days,
00:43:17you secure about 300,000
00:43:20or so data points
00:43:21in just one year.
00:43:23Because you can gather
00:43:25over 300,000 data points in a year,
00:43:26you have enough significant data
00:43:29to perform validation
00:43:32and even use more complex models.
00:43:33You can do that.
00:43:35But consider a rebalancing strategy
00:43:36that trades on a monthly basis.
00:43:37You only get 12 data points a year.
00:43:39Even over 20 years,
00:43:41that's only 240 points.
00:43:42Since you can't increase the data count on the time axis,
00:43:44you try to secure significance
00:43:47by expanding the scope
00:43:49to include various individual stocks.
00:43:51But ultimately, on the time axis,
00:43:53it is difficult to avoid regime changes.
00:43:54These aspects are extremely challenging.
00:43:57That is why, after COVID-19 hit,
00:43:58many quants—specifically,
00:44:00a person named Inigo Fraser Jenkins,
00:44:02who I believe is the Head of Quant at a very famous firm—
00:44:05explained “Why I am no longer a quant.”
00:44:09The gist of his message was that
00:44:11a quant's job is to predict the future based on past patterns,
00:44:13but when something like COVID-19 happens,
00:44:15past patterns become useless.
00:44:19When a market regime change occurs,
00:44:20there is very little a quant can do.
00:44:23People even talk about
00:44:25an “existential crisis” for quants.
00:44:28And quants had a very rough time last year.
00:44:30While some did well,
00:44:31on average, it was very, very bad.
00:44:34I think we are about halfway through now.
00:44:36An hour and a half has already passed,
00:44:38so we will wrap up Part 1 here.
00:44:40Tomorrow, in Part 2, we will cover items 6 through 10,
00:44:43discussing strengths and limitations,
00:44:45and then a curriculum for studying quant finance.
00:44:49We will cover those topics.
00:44:50I will see you in Part 2.
00:44:52Thank you.