[Wall St. Uncle] The Truth About Quant Investing - Part 1: Why a Half-Baked Approach is Poisonous

월가아재의 과학적 투자
StocksAdvertising/MarketingBeginning InvestingInternet Technology

Transcript

00:00:00But whether it was over the last 10,
00:00:0320, or 30 years,
00:00:04how many profitable patterns do you think existed during that time?
00:00:09The answer is infinite. And I can actually prove it.
00:00:12Hello, everyone. This is Wall Street Guy.
00:00:21Today, we're going to dive into the world of quant investing.
00:00:24I've received many comments and emails
00:00:26asking me to explain quant trading.
00:00:29Given my future goals and the channel's curriculum,
00:00:33I didn't plan on covering algorithmic trading
00:00:36or quant investing anytime soon.
00:00:38However, I noticed that about 15% of my subscribers
00:00:40are already practicing quant trading,
00:00:44and I've recently developed some concerns about the field,
00:00:48which is why I decided to make this video.
00:00:50Today, we'll start with a general overview
00:00:53of quant classifications and principles,
00:00:56and then discuss 10 things to watch out for in quant trading.
00:00:59If you keep these 10 points in mind,
00:01:04you can avoid wasting a significant amount of time
00:01:06and prevent yourself from using flawed backtesting methodologies
00:01:09that could lead to massive losses.
00:01:11I believe this can prevent many of those cases.
00:01:15These points are the absolute basics, yet it seems
00:01:18that many expensive paid courses on the market
00:01:21don't actually cover them in detail.
00:01:24In fact, many of those courses tend to over-glamorize
00:01:27backtesting and quant investing.
00:01:31By remembering these 10 things, you'll be able to stay safe
00:01:35and protect yourself regardless of what information you hear
00:01:39or which services you use.
00:01:43Before we begin, I must admit I made a mistake
00:01:47in a short, somewhat heated post I wrote a few days ago.
00:01:49I think I overreacted a bit.
00:01:51I wrote that the viral marketing and exaggerated ads
00:01:54for quant investing have gone too far.
00:01:55But then, some people started leaving malicious comments
00:01:59targeting specific individuals or companies.
00:02:02I deleted the post because I didn't want to cause any trouble.
00:02:06To be honest, what they're doing isn't illegal like
00:02:09unauthorized signal rooms or shady brokerage accounts.
00:02:12So, I might just be meddling in others' business.
00:02:16However, with illegal accounts, people usually know
00:02:19it's wrong but get involved out of greed.
00:02:22They bear some of that responsibility themselves.
00:02:25But the current discourse around quant investing
00:02:28worries me because it targets everyday people
00:02:33who are simply trying to work hard and improve their finances.
00:02:35They enter the world of quant investing with good intentions,
00:02:37but they could end up getting hurt.
00:02:40Because “quant” is associated with keywords like “science” and “statistics,”
00:02:46it can mislead people into seeing it as foolproof, even when it's not.
00:02:51Illegal rooms and accounts are obviously shady,
00:02:55so you can avoid them if you want to.
00:02:56But here, well-meaning people who are trying their best can become victims.
00:03:01Claims like “Anyone can become a quant expert in a few days,”
00:03:04or “This strategy is proven by decades of data,”
00:03:08or implying that a 20% compound annual return over 10 years
00:03:11guarantees the same performance in the future...
00:03:14While these might just be offhand remarks without malicious intent,
00:03:18novice investors can easily be misled.
00:03:20They might mistake these claims for absolute truth,
00:03:23waste countless hours on backtesting,
00:03:25and eventually suffer significant financial losses.
00:03:27This happens when you have blind faith in backtesting results.
00:03:32In fact, under SEC regulations in the U.S.,
00:03:35marketing a fund that way is strictly illegal.
00:03:38I wish those discussing quant investing
00:03:41would feel the weight of other people's money a bit more.
00:03:45I don't know how long I'll be doing YouTube,
00:03:47but I'm not saying this to play the “good guy.”
00:03:51I'm saying this because I struggled a lot with money
00:03:52in my mid-twenties, so I know how it feels.
00:03:56Since I often talk about managing the psychology of loss
00:03:58and share stories about my own failures in my youth,
00:04:01I seem to receive a lot of related inquiries.
00:04:05Every single week, I receive several emails
00:04:09from subscribers who have lost hundreds of thousands of dollars
00:04:14asking for advice.
00:04:16I believe YouTubers covering finance, stocks, and real estate
00:04:20should practice some self-reflection at times.
00:04:24Lately, while doing my “80-Day Investment Journey,”
00:04:26I felt like I was starting to sound like a “signal room” leader.
00:04:29I thought I should return to my original purpose once this downturn ends.
00:04:33Anyway, that's why I'm making this video.
00:04:37I'm not trying to attack any specific person or company.
00:04:40Those marketing quant investment
00:04:43might be unaware of certain aspects themselves.
00:04:46My goal is for all of us to recognize these issues
00:04:49and work toward improving them.
00:04:51So, I hope viewers refrain from mentioning specific names
00:04:55or turning this into a call-out in the comments.
00:04:57That was a long introduction, so let's get into
00:04:58the classifications of quant.
00:05:01The term “quant” is defined very broadly.
00:05:04For convenience, if we categorize it by time horizons,
00:05:07first, there is High-Frequency Trading (HFT).
00:05:10Specifically, what we call “Ultra HFT.”
00:05:12This involves co-locating servers near the exchange,
00:05:14coding at the machine language level,
00:05:19and focusing heavily on hardware.
00:05:20That's the level we're talking about there.
00:05:22Next, with a slightly longer time horizon,
00:05:24is algorithmic trading.
00:05:28This involves using technical indicators or rule-based systems.
00:05:29This is popular among individual investors
00:05:33and is becoming more accessible through backtesting platforms.
00:05:35Then we have Statistical Arbitrage,
00:05:39including things like pair trading.
00:05:41This uses statistical models and techniques
00:05:42to identify historical patterns,
00:05:44operating on the assumption of mean reversion.
00:05:46Next is Factor Investing.
00:05:48This has a longer-term outlook and involves factors like
00:05:50momentum, value, and carry.
00:05:52It seeks to identify factors that drive prices and find alpha.
00:05:54And a hot topic in recent years
00:05:59is “Quantamental.”
00:06:01This involves quantifying and automating fundamental analysis,
00:06:03incorporating various data analysis and alternative data
00:06:06for long-term investment strategies.
00:06:07In the same vein, things like machine learning,
00:06:10Big Data, and alternative data
00:06:12are expanding into various fields.
00:06:16These categories are just for convenience,
00:06:18and the boundaries are often quite blurred.
00:06:20Some people might refer to this whole spectrum
00:06:23simply as algorithmic trading.
00:06:26I'll be discussing general quant trading
00:06:28by grouping these all together.
00:06:30The fundamental principles of quant trading are:
00:06:31First, you need an investment idea or hypothesis.
00:06:33Second, you perform backtesting.
00:06:35This means testing your idea or hypothesis
00:06:37against historical data.
00:06:40You think, “If I do this, I can make money,”
00:06:42so you check if that specific approach
00:06:44actually worked in the past.
00:06:47If backtesting yields good returns,
00:06:50you move to live trading
00:06:51while implementing risk management.
00:06:54The process follows these four stages.
00:06:56Until the mid-2010s, quant trading
00:06:57was essentially the exclusive domain of institutions,
00:07:00specifically quant funds staffed with PhDs in STEM fields.
00:07:01But it's becoming more common in institutions
00:07:03in terms of execution and other areas.
00:07:06Moreover, services like Quantopian in the U.S.
00:07:09made backtesting much easier,
00:07:13allowing individuals to easily access quant trading.
00:07:16It's a growing trend.
00:07:18However, misunderstandings about quant investing
00:07:21are also on the rise.
00:07:23For example, someone might say,
00:07:25“Over the last 15 years, investing in companies with a PBR under 0.9
00:07:28and rising prices over the last 12 months yielded 20.2% annually.”
00:07:30Then they tweak the PBR slightly
00:07:33and see returns of 14% or 17.8%.
00:07:35So, looking at the backtesting results,
00:07:38they conclude that because the first one was best,
00:07:40they should invest using those specific rules.
00:07:42I see this kind of conclusion quite often.
00:07:46But this is actually a bad example.
00:07:48If you think about it carefully,
00:07:51the backtesting process is based on the unproven assumption
00:07:53that past patterns will repeat in the future.
00:07:56It's just finding patterns that were profitable in the past.
00:07:58But whether it was over the last 10,
00:08:0120, or 30 years,
00:08:03how many profitable patterns do you think
00:08:04existed during that time?
00:08:07If you pause the video and think about it,
00:08:09the answer is infinite.
00:08:12And that's actually provable.
00:08:14Because the parameters for various strategies are continuous,
00:08:16there are effectively an infinite number of profitable strategies.
00:08:18But the real question is: how many will remain profitable in the future?
00:08:21This is the true “holy grail” of quant.
00:08:24Anyone can find a pattern that worked in the past
00:08:26if they have the right backtesting tools.
00:08:29But finding something that worked in the past
00:08:30AND will continue to work in the future
00:08:32is incredibly difficult.
00:08:34It's like finding a needle in a haystack.
00:08:36When I looked through various Korean blogs and sites,
00:08:38I noticed that Joel Greenblatt's “Magic Formula”
00:08:42is very famous.
00:08:46He wrote about a very simple formula
00:08:50that selects stocks based on things like
00:08:52market cap and other filters.
00:08:55This Magic Formula became a huge hit
00:08:56and became well-known among individual investors.
00:09:00Now, this person is legendary in the hedge fund world.
00:09:02He's been investing since the 1980s,
00:09:04and during that period, he actually recorded
00:09:07higher returns than Warren Buffett.
00:09:09That's why the Magic Formula received so much attention.
00:09:12But let me give you the conclusion first.
00:09:13Even this legendary formula
00:09:15has its limitations in the real world.
00:09:17to select stocks through specific filters,
00:09:20and he even wrote a book about this “Magic Formula” strategy.
00:09:24It became a huge sensation,
00:09:25and it seems to be well-known among many individual investors.
00:09:28He's also incredibly famous in the hedge fund world.
00:09:31He's been investing since the 1980s,
00:09:33and during that period, he recorded returns
00:09:35even higher than those of Warren Buffett.
00:09:37I think that's why the Magic Formula received so much attention.
00:09:40To give you the bottom line first,
00:09:42Greenblatt isn't actually a quant,
00:09:44and his hedge fund didn't invest using only the Magic Formula.
00:09:47Those high returns weren't solely from the Magic Formula.
00:09:50His fund focused on value investing,
00:09:52but it also engaged in “special situation” investing.
00:09:54That involves things like spin-offs,
00:09:57for example, when a company splits off from another,
00:09:59allowing him to identify price discrepancies
00:10:01and gain an “edge” in those specific areas.
00:10:04He combined those methodologies.
00:10:07And for the value investing portion, I doubt he used
00:10:10such a simplistic formula alone.
00:10:12Of course, it likely reflected that framework,
00:10:14but I don't believe he generated those returns
00:10:18by just mechanically buying based on the formula.
00:10:20If we backtest the returns of the Magic Formula
00:10:22since it was made public in 2005,
00:10:26the gray line is the S&P index,
00:10:28and the green line is the Magic Formula.
00:10:29As you can see, after some high volatility,
00:10:32it has consistently underperformed.
00:10:34These types of results are similar to
00:10:37what you'd see from systematic equity ETFs.
00:10:40You could say that as the market becomes more efficient,
00:10:42that specific edge has vanished.
00:10:44As we can see from the performance of such a famous formula,
00:10:48finding patterns that were profitable in the past is very easy.
00:10:50You can even write a book about them.
00:10:53However, finding a pattern that will be profitable in the future
00:10:56requires an immense amount of work.
00:11:00Strategies that yield 20% annual returns with just
00:11:03a few days of thought and a few clicks simply don't exist.
00:11:06Another example is Quantopian.
00:11:08Quantopian was a startup founded around 2011,
00:11:12and it was a platform that made backtesting very easy in the US.
00:11:16300,000 people ran 12 million backtests,
00:11:20testing and creating countless quant strategies there.
00:11:24The famous billionaire Steve Cohen even invested,
00:11:27the well-known hedge fund trader.
00:11:29The top-tier quants at Quantopian
00:11:32even published papers on
00:11:34which of these strategies would perform well in the future,
00:11:37investigating what criteria or statistical methods
00:11:40should be used to filter them.
00:11:41They researched this very intensively
00:11:44to select the best strategies
00:11:46with the idea of running a new hedge fund.
00:11:48That was the vision,
00:11:49but it failed miserably.
00:11:51In the end, it shut down last year.
00:11:53Why do these things happen?
00:11:55And for those of you looking to get into quant trading,
00:11:58how can you avoid such a result?
00:12:02Of course, you can't avoid it perfectly.
00:12:03And I believe it's an incredibly difficult task,
00:12:07but if you still want to take on the challenge,
00:12:10I want you to keep at least these 10 things in mind
00:12:12and be very cautious.
00:12:13I'll go through them one by one.
00:12:16If you just remember these 10 points,
00:12:17you should be able to avoid wasting time on bad backtests
00:12:22and potentially suffering losses.
00:12:24Though, doing a good backtest doesn't guarantee profit.
00:12:27First, you must always question your data.
00:12:31Some people use data from Google or Yahoo,
00:12:34but that data is often incredibly “dirty.”
00:12:37Those starting quant trading from scratch
00:12:41will face many obstacles regarding data.
00:12:45Free data is often messy and full of errors.
00:12:47When it comes to the task of
00:12:50cleaning that data,
00:12:51you might think you just need to find the errors,
00:12:54but it actually involves more subjective judgment
00:12:57and bias than you might expect.
00:12:59Let me give you an example.
00:13:01Let's say a stock was trading between $41 and $43
00:13:05and then the market closed.
00:13:06But right around the closing bell,
00:13:08a trader made an order error,
00:13:11and a single share was traded at $28.
00:13:14Technically speaking,
00:13:16the low for that day is $28.
00:13:18That trader took a big loss due to a mistake,
00:13:21but the low has to be recorded as $28.
00:13:24That's the fact.
00:13:25So, how do you set the high and the low?
00:13:28If you remove that and set $41 as the low,
00:13:31you are essentially deleting a trade
00:13:34and a low that actually occurred.
00:13:36But if you don't remove it,
00:13:38let's say you're testing a strategy
00:13:40that places a buy order if the price
00:13:44drops more than 5% within 5 minutes.
00:13:45In a backtest,
00:13:47it might recognize that
00:13:48you bought the stock at $28.
00:13:51Then, it assumes you bought at $28
00:13:53and sold at the closing price of $42,
00:13:55recording an immediate profit.
00:13:58This could lead to the returns of the strategy
00:13:59being massively inflated.
00:14:01But of course, it's only been a week, so you can delete it
00:14:03but what if that trader's mistake
00:14:06was for 10 shares, 100 shares, or even 10,000?
00:14:09Cases like that actually happen.
00:14:11They happen quite occasionally.
00:14:14There have been massive cases where
00:14:17tens of millions of dollars were lost,
00:14:20but smaller mistakes of 100 or 1,000 shares
00:14:21are more common than you'd think.
00:14:23Of course, in recent years,
00:14:24since algorithms
00:14:25handle most executions,
00:14:27safeguards have been put in place.
00:14:29So it's not as frequent as it used to be,
00:14:31but when you look at backtesting data
00:14:33from before algorithmic execution was common,
00:14:36like 2005 or 2011,
00:14:37if you go back that far,
00:14:39you'll see these cases quite often.
00:14:41So, how are you going to handle that?
00:14:43Also, there are products traded
00:14:44on multiple exchanges.
00:14:45In those cases,
00:14:47you need to know if the data
00:14:49from all those various exchanges
00:14:50has been cleanly consolidated
00:14:52for the highs, lows, and volume.
00:14:53Or are you backtesting
00:14:56with incomplete trading data
00:14:57pulled from only a few exchanges?
00:14:59If the data cost is cheap,
00:15:01that's a distinct possibility.
00:15:02Also, when calculating MDD,
00:15:04do you use the low price or the closing price?
00:15:05For example, when backtesting
00:15:07a monthly rebalancing strategy,
00:15:09some use daily data
00:15:11but only look at the closing price.
00:15:13But in reality,
00:15:14to calculate true drawdowns,
00:15:15you also have to look at
00:15:17the intraday drawdown.
00:15:18These small details matter,
00:15:20like when backtesting with futures,
00:15:21how do you handle the rollover
00:15:22for products with expiration dates?
00:15:24Or in many backtests,
00:15:26they create a continuous futures dataset
00:15:27to run the test,
00:15:29but how is the rollover
00:15:31being treated?
00:15:33There are so many issues
00:15:34beyond just these.
00:15:35Have you really thought about
00:15:37these data problems?
00:15:38If you're using a backtesting service,
00:15:39are you just trusting that
00:15:40the provider did a good job with the data?
00:15:42You need to verify these things,
00:15:44because data issues cause
00:15:47far more errors than you'd think,
00:15:51often distorting the backtesting results.
00:15:53Another major data-related issue
00:15:57is survivorship bias.
00:15:59It's one of the most representative errors in backtesting.
00:16:01Look at this illustration—
00:16:04I'm not sure if it's WWI or WWII,
00:16:06but the Air Force wanted to reinforce their planes.
00:16:08They wanted to figure out
00:16:10where to add extra armor plating.
00:16:12To determine this,
00:16:16engineers examined all the planes
00:16:18that returned from dogfights
00:16:20to see where they had been shot the most.
00:16:21They found that certain areas
00:16:24took the most hits,
00:16:26so they concluded that
00:16:28they should make the armor thicker there.
00:16:29But that was a huge mistake.
00:16:33Because the planes that were hit
00:16:34in the other areas—
00:16:36like the engine or the cockpit—
00:16:38all crashed and never returned.
00:16:40It's a great example of how dangerous
00:16:42it is to draw conclusions
00:16:42based only on the data you have.
00:16:44In stock investing, survivorship bias means,
00:16:46for example, looking back and thinking,
00:16:49“I'd be rich if I bought Apple and Microsoft in the 80s.”
00:16:50It shows just how dangerous it is
00:16:52to draw conclusions based on given data.
00:16:54In stock investing, survivor bias would be like,
00:16:56for example,
00:16:57thinking right now,
00:16:59“If I had bought Apple and Microsoft in the 80s,
00:17:02I would have hit the jackpot.”
00:17:03So, with that thought,
00:17:05let's say you build a strategy to buy those kinds of tech stocks.
00:17:08But actually, back in the 80s,
00:17:10there were more than 30 companies
00:17:13just as promising as Apple or Microsoft.
00:17:14And 28 of them ended up disappearing.
00:17:17Only two of them survived.
00:17:19Even though only these two made it,
00:17:22people look at them and think,
00:17:23“If I invest like that now, I'll strike it rich.”
00:17:27So, if you use only currently surviving companies
00:17:30as your subjects for backtesting,
00:17:32the returns will inevitably be inflated.
00:17:35And this obviously becomes a bigger problem
00:17:38as the backtesting period gets longer.
00:17:40Because over that long period of time,
00:17:41there must have been many companies that existed at the start
00:17:43but have since vanished.
00:17:45However, a lot of novice investors,
00:17:47when they start backtesting,
00:17:48first define their stock universe.
00:17:51When they decide which stocks
00:17:54they are going to test,
00:17:55they populate it with companies that exist today.
00:17:58Then, within that pool,
00:17:59they backtest with various criteria
00:18:02to judge how to pick
00:18:05the “good” companies.
00:18:07But if you do it that way,
00:18:08from the start of the backtest until now,
00:18:11all the companies that went bankrupt are excluded.
00:18:13The backtesting is done while assuming
00:18:16that you have some sort of god-like foresight.
00:18:18Naturally, the returns will be higher than reality.
00:18:21So, when backtesting,
00:18:23if you are testing a 20-year period,
00:18:25you should start with the companies that existed in 2001
00:18:29and use them
00:18:30as your initial scope.
00:18:32That's what I wanted to mention.
00:18:33As a side note,
00:18:34the “super ants” you see on YouTube
00:18:37might also be subject to survivor bias.
00:18:40While some became super ants through pure skill,
00:18:43others might have taken massive risks,
00:18:45buying a huge stake in a single stock,
00:18:48and if that stock took off,
00:18:49they became a super ant.
00:18:51But there were probably
00:18:5330 or 50 other people who did the exact same thing.
00:18:55Out of those 50 people who took high risks,
00:18:58only one survived,
00:18:59and viewers are only looking at that one person.
00:19:02This could also be a matter of survivor bias.
00:19:05So, if you look at them now
00:19:06and think, “I should be like that too,”
00:19:08and dive into extremely high-risk investments,
00:19:11it's not a guaranteed path to success;
00:19:13you'd have to be that lucky 1 out of 50.
00:19:17Simply being aware of these biases
00:19:20allows for more rational and wise investing.
00:19:22When using backtesting platforms,
00:19:24you are essentially delegating the data issues
00:19:27and survivor bias problems I mentioned
00:19:28entirely to that company.
00:19:31Quite naively.
00:19:32But you have to wonder if that company
00:19:33really addressed these issues
00:19:35with extreme rigor,
00:19:37truly worrying about the users' actual returns
00:19:39in the real world,
00:19:41and invested significant capital
00:19:43to clean up the data.
00:19:45You definitely need to verify those points.
00:19:48The second thing to watch out for
00:19:50is look-ahead bias,
00:19:52which means not looking at the future beforehand.
00:19:54If I were to give it a rough name,
00:19:57maybe “future-sight bias”?
00:19:58That's one way to interpret it.
00:20:00Information that was unattainable at the time of a trade—
00:20:03since backtesting uses past data,
00:20:05chronologically speaking,
00:20:07it's information that didn't exist last year.
00:20:09But it's quite common to find cases
00:20:12where the logic is built to trade last year
00:20:14while referencing that future information.
00:20:15That is what we call look-ahead bias.
00:20:18A representative mistake of this kind would be,
00:20:21for example, as of this month, September 2021,
00:20:24it's hard to backtest all Korean stocks,
00:20:27so let's just do 100.
00:20:29That's what a user might think.
00:20:30So they pick the top 100 KOSPI companies by market cap
00:20:34and run a backtest on them.
00:20:35Say, a strategy to buy if the PER is at a certain level.
00:20:38They do that,
00:20:39and after backtesting for 10 years,
00:20:41the returns look fantastic.
00:20:42But what went wrong?
00:20:44You picked the top 100 KOSPI stocks as of September 2021.
00:20:50You only selected those stocks,
00:20:51but if you backtest for 10 years starting from 2011,
00:20:55it's like you already knew in 2011
00:20:59which companies would be in the top 100 in 2021.
00:21:01Being in the top tier of market cap
00:21:03essentially means that the stock price has risen steadily.
00:21:06Even if people are careful about other things,
00:21:08they often overlook this when they decide
00:21:11to just pick a few hundred stocks by market cap.
00:21:12They think that way
00:21:14and make a lot of mistakes.
00:21:15Another example is
00:21:17when backtesting with fundamental financial data.
00:21:21Each company releases its quarterly earnings
00:21:24on different dates.
00:21:26But consider whether rebalancing
00:21:29or trading occurs
00:21:31after those reports are actually released.
00:21:33A company might report its earnings early the following month,
00:21:36but you rebalanced at the end of the previous month
00:21:40already knowing that information.
00:21:41You're trading while already knowing the future.
00:21:44That kind of thing can slip into a backtest.
00:21:46One more example would be,
00:21:48say you're trading based on closing prices.
00:21:50You assume that
00:21:52and do daily rebalancing,
00:21:54but the closing price is info you only get after the day is over.
00:21:57Yet, if you set the backtest
00:22:00to execute the order 5 minutes before the market closes,
00:22:03in that way, in terms of timing,
00:22:05you're gaining knowledge of the future,
00:22:07and a bias can occur.
00:22:09The third point is extremely important.
00:22:11Avoiding overfitting.
00:22:13I cannot overemphasize this.
00:22:16Overfitting is when
00:22:18you make a model perform excessively well
00:22:19only on the given sample data.
00:22:23For example, here is a sample.
00:22:25What we really want to know
00:22:27is the population behind it.
00:22:29We want to estimate
00:22:32the actual overall population,
00:22:34and in case some of you
00:22:36don't know what a population is,
00:22:38to explain it briefly,
00:22:40let's say we're doing a poll
00:22:41on an election result.
00:22:44If we survey every single citizen,
00:22:46that would be a perfect poll with 100% accuracy.
00:22:48But since we can't survey everyone,
00:22:50we take a sample from the population.
00:22:53We select a portion of the people and assume that sample represents
00:22:58the population behind it.
00:22:59We assume it's representative and make an estimation.
00:23:02So, the actual population data behind this
00:23:06would have a certain distribution,
00:23:08and we pull a few samples from that
00:23:10to estimate what the population might look like.
00:23:16This is an attempt to fit a model to that shape,
00:23:20but fitting a model means
00:23:22finding a trend line where the error
00:23:25between the sample and the model is minimized.
00:23:30Lines like these.
00:23:30But as you can see, if you fit a very wiggly,
00:23:34complex model like this,
00:23:37the error on the sample data is zero.
00:23:39It touches every single sample point.
00:23:41So, for this sample, it's a perfect,
00:23:44zero-error model.
00:23:47But is this a model that accurately represents the population?
00:23:51Probably not.
00:23:51If you pull a new sample, the error will be quite large.
00:23:54So you have to fit it appropriately
00:23:58so that when new samples come in,
00:24:00the sum of those errors remains small.
00:24:03On the other hand, if you fit
00:24:06an overly simple straight line,
00:24:08that's an “underfit,” meaning it's under-optimized.
00:24:10In that case, the error is large even on the sample.
00:24:13So, the most important thing in any modeling
00:24:16is to optimize it just right,
00:24:18but when many people backtest,
00:24:20they treat past data as their sample data.
00:24:24And on that sample data,
00:24:26they try to maximize returns within that specific sample
00:24:29by throwing in all sorts of rules
00:24:32to drive the returns as high as possible.
00:24:35For example, backtesting data from 2015 to 2021 might show
00:24:39that if PER is between 13.75 and 17.23,
00:24:43market cap is between 51.7 and 62.3 billion won,
00:24:46and buying stocks with a PBR of 1.17 or less,
00:24:50an annual return of 70% is possible.”
00:24:52This is the kind of backtesting result you might get.
00:24:54As you can tell, this is a clear case of overfitting.
00:24:57It is over-optimized.
00:24:58Perhaps a company with a PER of 17.24 that performed poorly
00:25:04was included in this specific dataset,
00:25:05or maybe there was a company with a market cap of 51.5 billion won
00:25:09that was a bad example, so the parameters were set this way.
00:25:12When you look only at sample or past data and try to be that specific
00:25:16just to maximize the returns at any cost,
00:25:19you end up with a model like this.
00:25:21Then, when actual data from that distribution appears in the future,
00:25:25the margin of error becomes massive.
00:25:27That is the point here.
00:25:28Let's take a look in more detail.
00:25:29This is another example of over-optimization.
00:25:31Suppose we want to find a line
00:25:34that separates the red dots from the blue dots.
00:25:36That's our model.
00:25:37Now, the black line represents a well-learned model,
00:25:40but that squiggly green line...
00:25:42based on the blue and red dots you see right now,
00:25:46it separates them perfectly.
00:25:48So, within this specific sample data,
00:25:50it's a perfect line with zero error.
00:25:52However, in the actual underlying population,
00:25:55blue dots might appear around here,
00:25:57and red dots might start appearing over there.
00:25:59When new data comes in in the future,
00:26:03we can assume this green line will have a lot of errors.
00:26:05That's a fair assumption.
00:26:07So, if you fit your model too closely to past data,
00:26:10it won't work in the future.
00:26:11Here is a similar example.
00:26:13Detailed personal data was collected on 100 current students.
00:26:15The goal is to identify which of this year's 100 students
00:26:16will have the best grades based on that data.
00:26:19If you look at last year's top students and see things like
00:26:20their last name is Jung, or their height is in a certain range,
00:26:22and you over-optimize your identification rules
00:26:23based on those specific details from last year,
00:26:26and then apply that to this year's students,
00:26:28it could turn out to be completely absurd.
00:26:30Instead, if you set a rule based on
00:26:32students who study more than a certain number of hours,
00:26:34and apply it to last year's students,
00:26:37the accuracy might be lower than the hyper-specific rules.
00:26:39However, even though the accuracy is a bit lower,
00:26:42there's a high probability it will still be just as accurate
00:26:44when applied to this year's students.
00:26:45So, how can we mitigate this over-optimization problem?
00:26:47Every backtest has some degree of over-optimization,
00:26:49and it's impossible to eliminate it entirely.
00:26:53For instance, how do we know if a strategy that performed well
00:26:56over the last 5 years will be valid for the next 3 years?
00:27:00The perfect answer to that question
00:27:01is to actually trade it for 3 years.
00:27:06But that's after the fact.
00:27:08If you trade for 3 years and lose money,
00:27:11the test was pointless, right?
00:27:12One method is using “Out of Sample” data.
00:27:15This involves using data outside of your initial sample.
00:27:17It's commonly referred to as OOS data.
00:27:17For example, finding a strategy that works well
00:27:19on 6 years of data from Sept 2015 to Sept 2021,
00:27:21and then starting to trade it in Oct 2021 is a bad idea.
00:27:23Instead of doing that,
00:27:25you use 6 years of data from Sept 2014 to Sept 2020
00:27:27to find a high-performing strategy.
00:27:28Then, you backtest that strategy one more time
00:27:31on the data from Oct 2020 to Sept 2021.
00:27:33In other words, you find the best strategy from the 6-year period,
00:27:34pretend you started trading it in Oct 2020,
00:27:38and backtest it for that one additional year.
00:27:39If those results are good,
00:27:42then you start live trading in Oct 2021.
00:27:44Of course, splitting the data like this
00:27:46creates other problems,
00:27:49but we'll deal with those in a bit.
00:27:52The point I'm trying to convey right now is,
00:27:55if you have this much sample data,
00:27:57you set aside a portion of it.
00:28:02You set it aside,
00:28:04use the rest of the data to find a strategy,
00:28:06run many backtests, and optimize it.
00:28:09But instead of going straight to live trading,
00:28:10you take that data you didn't use to find the strategy,
00:28:12imagine it's the real world, and test it there.
00:28:13That is what we call using out-of-sample (OOS) data.
00:28:16In data science, terms like training data, validation data,
00:28:18test data, or development data are used.
00:28:19The terminology itself isn't that important.
00:28:21Point number 4 follows from point number 3:
00:28:23“The opportunity for validation happens only once.”
00:28:24This is incredibly, incredibly important.
00:28:26I really cannot emphasize this enough.
00:28:28It is such a critical concept.
00:28:30Let's dive deeper into this out-of-sample testing.
00:28:31Regarding sample data and out-of-sample data,
00:28:33there are various names for them,
00:28:34but for this video,
00:28:35I will stick to “training data” and “validation data.”
00:28:38In the previous example,
00:28:39the data from 2014 to 2020 is the training data.
00:28:41Training data is the data used to find the strategy.
00:28:42After the strategy is found,
00:28:44we validate it.
00:28:45So, we'll call that one year of backtesting
00:28:46the “validation data.”
00:28:48Now, what this graph shows
00:28:50is the complexity of the rules or the model.
00:28:53As you move to the right,
00:28:58the model becomes much more complex.
00:29:01Like defining a rule for a range
00:29:03of exactly 173cm to 173.25cm.
00:29:04The more you do that,
00:29:06the higher the complexity goes.
00:29:08Then, this axis is the prediction error.
00:29:09It represents how much error occurs
00:29:11when put into actual practice.
00:29:12As you can see,
00:29:13in the training sample (the training data),
00:29:16the more complex the model,
00:29:18the more the error decreases.
00:29:19Like the example where we had dots
00:29:20and used a squiggly, complex line.
00:29:22By making it complex,
00:29:24we could eliminate the error entirely within that sample.
00:29:26So, if you make a model incredibly complex,
00:29:28the error converges toward zero.
00:29:30However, if you take that trained model
00:29:32and test it on the validation data we set aside,
00:29:35what happens to the error?
00:29:36Initially, when the model is very simple,
00:29:38like a straight line,
00:29:40or when it's underfitted,
00:29:42the errors are similar.
00:29:44But as the model or rules become more complex,
00:29:45while the error in the training data
00:29:47continues to decrease,
00:29:49the error in the validation data
00:29:50hits a floor and then starts to increase
00:29:52as soon as it becomes overly complex.
00:29:53To use an analogy with backtesting,
00:29:54if you run countless backtests,
00:29:55set very detailed rules,
00:29:58test them over and over,
00:29:59and fine-tune
00:30:02parameters very precisely,
00:30:03like setting a specific PER value,
00:30:05the more complex you make it,
00:30:06the higher the returns in the past data will be.
00:30:08Since this is an error graph, lower is better.
00:30:12Basically, a backtest that is fitted to past data
00:30:14will show better returns the more you fit it.
00:30:16But when you apply this to reality,
00:30:18if you've made it excessively complex,
00:30:19there comes a point where a more complex rule
00:30:21leads to lower returns in practice.
00:30:23That's how it works.
00:30:24By the way, I equated lower error
00:30:26with better returns,
00:30:28and higher error with worse returns.
00:30:31Strictly speaking,
00:30:33a larger error is slightly different
00:30:35from lower returns.
00:30:37The more you mess up a backtest
00:30:40and overfit it,
00:30:42the gap between the backtest return and future return,
00:30:45which is the error, will grow.
00:30:47That error could randomly be
00:30:51either higher
00:30:52or lower.
00:30:55But generally, when such an error occurs,
00:30:56live returns tend to be worse.
00:30:59Because when you were fitting it to past data,
00:31:02you were fitting it to push returns
00:31:05as high as possible.
00:31:08So if there is an error relative to that return,
00:31:12it will likely be on the downside.
00:31:15So, how should we split the training and validation data
00:31:17when performing backtests?
00:31:18For example, from 2011 to 2021...
00:31:21let's say we have 10 years of data.
00:31:23One common way is a 7:3 split.
00:31:24You use the first 7 years for training
00:31:26and the last 3 years for validation.
00:31:28Or you could use an 8:2 ratio.
00:31:31Another method is “Walk-Forward Analysis.”
00:31:32This involves moving the window forward step by step.
00:31:33You train on 2011-2015, validate on 2016.
00:31:34Then train on 2012-2016, validate on 2017.
00:31:37This way, you can see if the strategy holds up
00:31:39across different market regimes.
00:31:42The key is to never use the validation data
00:31:45during the initial strategy development phase.
00:31:47If you peek at the validation data even once,
00:31:49it's no longer a clean test.
00:31:50It becomes part of the training process itself.
00:31:51But generally, when such a discrepancy occurs,
00:31:53the actual real-world returns tend to be worse.
00:31:55That's because when you're fitting it to historical data,
00:31:57you've forced the parameters
00:31:59to maximize the returns as much as possible.
00:32:00So if there is an error in those returns,
00:32:02it will usually be on the downside.
00:32:03Then, how should we split the data
00:32:06into training and validation sets for backtesting?
00:32:08For example, taking 11 years of data from 2011 to 2021,
00:32:11training on it, and applying it starting next year—
00:32:15that means you aren't using a separate validation set.
00:32:18You're using everything as training data, then going live,
00:32:21which is not recommended.
00:32:22The splitting method I mentioned earlier
00:32:25would be taking 10 years as training data,
00:32:28using the final year, 2021, for validation,
00:32:31and then applying the strategy from 2022.
00:32:34But as I'll explain in a moment,
00:32:36this isn't necessarily the best way either.
00:32:38So what are some improved methods?
00:32:40There is a method called Walk-Forward Testing.
00:32:43What this does is,
00:32:44for instance, you take 3 years starting from '99,
00:32:46train and optimize your parameters there,
00:32:49validate the results over the following year,
00:32:52and then roll that window forward.
00:32:55If you establish a strategy using this method,
00:32:58even with a very simple model—
00:33:01though I think backtesting based solely on PER
00:33:04is quite nonsensical—
00:33:05let's assume a strategy of buying stocks below a certain PER.
00:33:08Based on 10 years of historical data,
00:33:11if you optimize the PER threshold,
00:33:13the ideal criteria would differ for every single year,
00:33:17so you'd end up picking an average value that works okay.
00:33:20But if you narrow the scope,
00:33:22you can set the PER value based on the last 3 years
00:33:26and trade accordingly.
00:33:28By testing this way, you can adjust
00:33:30the parameters more flexibly over time.
00:33:32That's how this type of testing works.
00:33:35You can use that approach,
00:33:37or there's K-Fold CV,
00:33:38which stands for Cross-Validation.
00:33:39How this works is,
00:33:41the 'K' refers to the number of groups you divide the data into.
00:33:45Looking at the diagram, let's say K is 5.
00:33:47If you set K to 5, you split the data into 5 equal parts.
00:33:50You train on 4 years worth of data,
00:33:53then check the returns on the remaining 1 year of validation data.
00:33:56Then you train on a different set of 4 years
00:33:59and validate it against the remaining year.
00:34:01You repeat this and then average the five return sets.
00:34:05Essentially, you are averaging those returns.
00:34:09The idea is that this average represents
00:34:12the return you can actually expect.
00:34:13Alternatively, if you're using 10 years of data,
00:34:16some people train on even-numbered years
00:34:19and validate on odd-numbered years.
00:34:22All these methods have their pros and cons,
00:34:23but a major advantage of these approaches
00:34:26is that the parameters stay stable during market regime changes.
00:34:30What I mean by that is,
00:34:31when a financial crisis or COVID-19 hits,
00:34:33the fundamental nature of the market changes.
00:34:35For example, the 2008 financial crisis happened,
00:34:39but if you trained only on data from 1998 to 2007
00:34:43to find the best returns,
00:34:45and then validated it there,
00:34:46it wouldn't work because the market's nature shifted.
00:34:49The distribution of data changes,
00:34:51and the patterns from the past
00:34:52won't reflect the new market environment.
00:34:55So, by splitting the data in these ways,
00:34:57even when major events occur
00:35:00and change market properties and patterns,
00:35:02you can validate your strategy more reliably.
00:35:06That's why these methods are used,
00:35:08but you must be careful about “looking into the future.”
00:35:11You have to be very cautious about that.
00:35:13It depends on your trading frequency,
00:35:16but if you're trading on a monthly basis,
00:35:18and the training data
00:35:19includes info from 2014,
00:35:22depending on what rules or data you used in 2013,
00:35:26things that wouldn't be known until 2014
00:35:28could leak into the validation data.
00:35:30Then the returns for that validation data would be inflated.
00:35:34Because you've already trained by looking into the future.
00:35:36You must be extremely careful with this part.
00:35:39I've been speaking quite broadly,
00:35:41but in fields like Machine Learning,
00:35:44there's a concept called hyperparameters.
00:35:46Generally, parameters are things adjusted by the model itself
00:35:50to reduce the error in the sample data,
00:35:54whereas hyperparameters are things a person must decide.
00:35:57For example, in regression analysis,
00:35:59you decide whether to use a straight line or a curve.
00:36:03Basically, how complex the formula
00:36:07or the model will be—
00:36:09that is a human decision.
00:36:11So the number of parameters and such are hyperparameters.
00:36:15Once those are set,
00:36:18the model fits the line
00:36:22in a way that optimizes the data's error.
00:36:23Things like the slope or the intercept
00:36:28are what the model learns, and these are called parameters.
00:36:33So you have to try various hyperparameters as well.
00:36:36Instead of just splitting into train and test data,
00:36:40we often add another split called 'dev data'.
00:36:42You perform your optimization there—
00:36:45you optimize the hyperparameters on the dev set
00:36:48and then validate with the test data.
00:36:51Those familiar with machine learning will already understand this,
00:36:55and if you don't know it, this brief explanation won't be enough,
00:36:58so I'll just move on.
00:37:00However, when doing this work, there is one thing
00:37:04that is so important it can't be overemphasized.
00:37:08It's about the validation data.
00:37:10You must NEVER, EVER look at the validation data twice.
00:37:15Specifically, the results.
00:37:16You train on the training set and backtest many times to find a high-return strategy, right?
00:37:22At that point, you've found something that performs well on training data,
00:37:26but to see if it will actually be good in reality,
00:37:31you run it against a period or dataset that was never used for training.
00:37:38You must never run this more than once.
00:37:41Run it exactly once, and if the returns are bad,
00:37:45no matter how many years you worked or how much effort you put in,
00:37:50you must scrap the entire strategy.
00:37:52Why? Because in the real world, you only get one shot at profit or loss.
00:37:57You can't turn back time.
00:37:58Despite this, people feel bad that the validation results were poor,
00:38:03so they go back to the training data, tweak the parameters,
00:38:07and run it again until the validation returns look good.
00:38:10The moment you do that, it's no longer validation data;
00:38:14it has effectively become training data.
00:38:16You've optimized the parameters including the validation set.
00:38:21Consequently, for this strategy,
00:38:26we can't guarantee how it will perform in the real world.
00:38:29That's why this point is so critical.
00:38:31Another important thing in backtesting—related to this—
00:38:34is the concept of 'Market Regime' and how times change.
00:38:37Let me ask you a question.
00:38:39Between 20 years of backtesting and 3 years,
00:38:42which one is more meaningful?
00:38:44I've already given away the answer in the title,
00:38:47but many beginners think that the longer the backtest,
00:38:50and the more data you have, the better.
00:38:54But for me, between these two,
00:38:57it depends on the time horizon and trading frequency,
00:39:00but generally,
00:39:01I would choose the 3-year backtest.
00:39:03Having more data is generally better.
00:39:06But the data must come from the same distribution.
00:39:09More data is always better, but
00:39:11it's not good if it's mixed with data from an environment that has already changed.
00:39:17The problem with long backtests
00:39:20is that the nature of the market changes.
00:39:22I think this graph is real returns...
00:39:26anyway, it's a graph related to interest rates.
00:39:28As you can see, the concept of a “fair interest rate”
00:39:33fluctuates like this,
00:39:34but the baseline level within a regime changes drastically.
00:39:38At this point, it was here—maybe the oil shock?
00:39:41Anyway, after that period, it was here,
00:39:45and then after the 1980s,
00:39:47this became the generally accepted interest rate level.
00:39:51Now, imagine you're trading bonds,
00:39:53and you train a strategy within this period
00:39:57to use it over here.
00:39:59If the market regime has changed,
00:40:02the profitable strategy you built on that training data
00:40:07won't work here.
00:40:08That's what we call a Market Regime Change.
00:40:11A shift in the market's nature or system.
00:40:14Market shifts can happen
00:40:17due to changes in the market players.
00:40:20For example, after COVID, there was a massive influx of retail investors,
00:40:23leading to events like the GameStop saga.
00:40:25Before COVID,
00:40:27short-selling strategies—
00:40:30there are hedge funds that specialize in short-selling—
00:40:32used to work very well.
00:40:34But with the sudden change in market nature,
00:40:37some were even driven to bankruptcy.
00:40:39Then there are changes in policy and regulation. After the financial crisis,
00:40:43proprietary trading was banned for investment banks,
00:40:45and various regulations changed the derivatives market.
00:40:49Strategies trained on data
00:40:50from before the financial crisis
00:40:52likely wouldn't work well afterward.
00:40:54There are also exogenous events,
00:40:55like the oil shock, which are massive,
00:40:57transformative events for the market,
00:40:59macroeconomic in nature.
00:41:01Then there are other macroeconomic shifts.
00:41:03As debt ratios steadily climbed,
00:41:06interest rates that used to be at a certain level
00:41:08transitioned into an era of ultra-low rates.
00:41:11and quantitative easing also played a role
00:41:13in contributing to these low interest rates,
00:41:15causing growth stocks to suddenly outperform
00:41:17massively over the past 10 years.
00:41:19But if you found a profitable strategy
00:41:22using training data from before quantitative easing,
00:41:24it might involve buying things like value stocks.
00:41:25Then, naturally, over the next 10 years,
00:41:27the performance would have been very poor.
00:41:28Other factors include the emergence of new technologies
00:41:30or changes in industrial structure,
00:41:32things of that nature.
00:41:33So, when backtesting for 20 years,
00:41:35is data from 2001 really meaningful?
00:41:38Of course, a market regime change is
00:41:40depends on which factors you are looking at.
00:41:42It varies based on that.
00:41:43Ultimately, it depends on the logic,
00:41:45the rules of the strategy, or which elements
00:41:47and data the model
00:41:49is observing and using.
00:41:51Based on those factors,
00:41:52you have to see how the regime
00:41:53of that data changes.
00:41:55For some data,
00:41:56the characteristics change very quickly,
00:41:58even on a monthly basis.
00:41:59Others might remain
00:42:01quite stable for 10 or 15 years.
00:42:03Since the cycles for each are different,
00:42:05generally speaking,
00:42:07it doesn't mean that just because COVID-19 happened,
00:42:09all previous patterns
00:42:09become completely meaningless.
00:42:12However, if you use 20 years worth
00:42:14of data like that,
00:42:15there will definitely be some issues.
00:42:17You can look at it that way.
00:42:18And if you try to use
00:42:20very old data to make inferences,
00:42:22even if the market regime
00:42:23changed in the middle,
00:42:24it can change yet again.
00:42:25So, if it is data from the distant past
00:42:29that reflects the current point in time,
00:42:30it might actually be usable.
00:42:32That is why some people say
00:42:33the 1940s and the present day are similar.
00:42:35But that's just a side note.
00:42:37Anyway, quant trading
00:42:38has become very common,
00:42:41and even individuals do it now.
00:42:42But when it comes to long-term investing,
00:42:44the pitfall of quant investing is that
00:42:45when applying these quantitative techniques
00:42:47to long-term investments,
00:42:49it is very difficult to avoid regime changes
00:42:51while trying to secure a lot of data.
00:42:53For example, let's say there is an algorithmic
00:42:55trading strategy that uses minute-by-minute data.
00:42:57In one hour,
00:42:59there are 60 data points.
00:43:01Since there are 60 minutes,
00:43:02you get 60 data points,
00:43:03and let's say it's a futures contract
00:43:04that trades 24 hours a day.
00:43:05If you multiply that by 24,
00:43:08is it 1,440?
00:43:09Wait, let me check.
00:43:10Yes, you get 1,440 data points per day.
00:43:10So if you have 1,440 points a day
00:43:12and trade 5 days a week, roughly 250 days,
00:43:15assuming there are 250 trading days,
00:43:17you secure about 300,000
00:43:20or so data points
00:43:21in just one year.
00:43:23Because you can gather
00:43:25over 300,000 data points in a year,
00:43:26you have enough significant data
00:43:29to perform validation
00:43:32and even use more complex models.
00:43:33You can do that.
00:43:35But consider a rebalancing strategy
00:43:36that trades on a monthly basis.
00:43:37You only get 12 data points a year.
00:43:39Even over 20 years,
00:43:41that's only 240 points.
00:43:42Since you can't increase the data count on the time axis,
00:43:44you try to secure significance
00:43:47by expanding the scope
00:43:49to include various individual stocks.
00:43:51But ultimately, on the time axis,
00:43:53it is difficult to avoid regime changes.
00:43:54These aspects are extremely challenging.
00:43:57That is why, after COVID-19 hit,
00:43:58many quants—specifically,
00:44:00a person named Inigo Fraser Jenkins,
00:44:02who I believe is the Head of Quant at a very famous firm—
00:44:05explained “Why I am no longer a quant.”
00:44:09The gist of his message was that
00:44:11a quant's job is to predict the future based on past patterns,
00:44:13but when something like COVID-19 happens,
00:44:15past patterns become useless.
00:44:19When a market regime change occurs,
00:44:20there is very little a quant can do.
00:44:23People even talk about
00:44:25an “existential crisis” for quants.
00:44:28And quants had a very rough time last year.
00:44:30While some did well,
00:44:31on average, it was very, very bad.
00:44:34I think we are about halfway through now.
00:44:36An hour and a half has already passed,
00:44:38so we will wrap up Part 1 here.
00:44:40Tomorrow, in Part 2, we will cover items 6 through 10,
00:44:43discussing strengths and limitations,
00:44:45and then a curriculum for studying quant finance.
00:44:49We will cover those topics.
00:44:50I will see you in Part 2.
00:44:52Thank you.

Key Takeaway

Quant investing requires rigorous scientific discipline to avoid the 'poisonous' traps of over-optimized backtesting, biased data, and the failure to account for shifting market regimes.

Highlights

Quant investing classifications range from Ultra-High Frequency Trading (HFT) and algorithmic trading to Statistical Arbitrage, Factor Investing, and 'Quantamental' approaches.

The 'Holy Grail' of quant trading is not just finding past profitable patterns, which are infinite, but identifying those that will persist in the future.

Data integrity is the most overlooked foundation, as 'dirty' free data, consolidated exchange feeds, and rollover handling can catastrophically distort backtesting results.

Survivorship bias and look-ahead bias are common pitfalls where investors unknowingly use 'future-sight' or exclude failed companies, leading to inflated, unrealistic returns.

Overfitting occurs when a model is made too complex to eliminate errors in sample data, making it lose predictive power for the actual underlying population.

The 'One-Shot Validation' rule dictates that validation data must never be used more than once; otherwise, it effectively becomes training data and loses its objective value.

Market Regime Changes, such as the 2008 financial crisis or the COVID-19 pandemic, can render historical patterns obsolete and trigger existential crises for quants.

Timeline

Introduction and the Dangers of Misleading Quant Marketing

The speaker introduces the world of quant investing while expressing deep concern over current marketing trends that target everyday investors with over-glamorized claims. He notes that roughly 15% of his subscribers are already practicing quant trading, yet many are being misled by paid courses promising easy success through backtesting. These courses often fail to mention the high risks of financial loss or the fact that marketing funds with guaranteed past returns is illegal under SEC regulations. The speaker shares his personal motivation for this warning, citing his own past struggles with money and the frequent emails he receives from subscribers facing massive losses. He emphasizes that while 'quant' sounds scientific, it is not foolproof and requires a grounded understanding of the weight of managing capital.

Classifying Quant Trading and the Four Fundamental Stages

This section provides a broad classification of quant trading based on time horizons, starting from Ultra-High Frequency Trading (HFT) using machine language and hardware co-location to longer-term Factor Investing and 'Quantamental' analysis. The speaker explains the four fundamental stages of the quant process: developing an investment hypothesis, performing backtesting against historical data, implementing live trading, and maintaining strict risk management. While this field was once the exclusive domain of PhDs and institutions, platforms like Quantopian have democratized access for individual investors. However, this accessibility has also led to a rise in misunderstandings regarding how these strategies actually generate alpha. The speaker clarifies that the boundaries between these categories are often blurred, but the core principles remain consistent across the spectrum.

The Backtesting Paradox: Infinite Patterns vs. Future Profitability

The speaker critiques the common practice of 'tweaking' parameters to find the highest historical return, arguing that this approach is fundamentally flawed. He proves that an infinite number of profitable patterns exist in past data due to continuous parameters, but finding the one that remains profitable in the future is the true challenge. The legendary 'Magic Formula' by Joel Greenblatt is used as an example; while it shows massive historical success, it has consistently underperformed the S&P 500 since becoming public in 2005. This demonstrates how a specific 'edge' can vanish as the market becomes more efficient and everyone begins to use the same logic. Even professional platforms like Quantopian failed despite running 12 million backtests, proving that historical success does not guarantee future results.

Critical Data Issues: Dirty Data and Survivorship Bias

The speaker details the first major pitfall in quant trading: the reliability of the data itself. He warns against using free, 'dirty' data from sources like Google or Yahoo, which can contain order errors like single-share trades at outlier prices that massively inflate backtest returns. A significant portion of this section is dedicated to survivorship bias, illustrated by the famous WWI/WWII aircraft armor analogy. In the context of stocks, many investors mistakenly backtest using only companies that exist today, completely ignoring the 'corpses' of companies that went bankrupt over the testing period. This creates a 'god-like foresight' effect that produces returns far higher than what is possible in reality. He urges users to verify if their backtesting platforms are actually cleaning data with extreme rigor.

The Trap of Look-Ahead Bias and Future-Sight Errors

Look-ahead bias is explored as a subtle but devastating error where information that was unattainable at the time of a trade is used in a backtest. For example, selecting the 'Top 100 Stocks of 2021' to run a backtest starting in 2011 is a massive mistake because you are selecting winners with the benefit of hindsight. Other common errors include using fundamental financial data that hadn't actually been released to the public yet or assuming a trade occurs at a closing price before the day is over. These timing errors give the model 'future-sight' that is impossible to replicate in live trading. The speaker warns that even if investors are careful with other aspects, these logical slips frequently occur when defining the stock universe or rebalancing schedules. Such biases lead to an overestimation of a strategy's efficacy and inevitable disappointment during execution.

Combating Overfitting and the Importance of Out-of-Sample (OOS) Testing

The speaker delves into the concept of overfitting, where a model is made excessively complex to fit every 'wiggle' in the sample data. While a complex model might have zero error on past data, it fails miserably when exposed to the actual population or future data. He uses the analogy of identifying top students; rules based on hard work are more likely to stay accurate than hyper-specific rules based on height or last names. To mitigate this, he introduces Out-of-Sample (OOS) data, where a portion of historical data is set aside and never used during the strategy's optimization phase. This hidden data serves as a 'proxy' for the real world to see if the strategy can actually perform in an unknown environment. This rigorous separation of training and testing is a cornerstone of professional data science and machine learning.

Advanced Validation Techniques: Walk-Forward and K-Fold Cross-Validation

Building on the concept of validation, the speaker introduces more sophisticated methods like Walk-Forward Analysis and K-Fold Cross-Validation. Walk-Forward Testing involves moving a training window forward step-by-step to see how a strategy holds up across different time periods. K-Fold Cross-Validation splits data into several groups, averaging the returns across different 'folds' to find a more reliable expected return. He distinguishes between 'parameters' learned by the model and 'hyperparameters' decided by the human researcher, emphasizing the need for a 'dev set' for optimization. However, the most critical rule is that the final validation set must only be checked once. If you look at the results and go back to tweak your strategy, you have 'poisoned' the data and ruined the validity of the test.

Market Regime Changes and the Existential Crisis of Quants

The final section discusses the impact of Market Regime Changes, which are shifts in the market's fundamental nature due to policy, macroeconomics, or technology. The speaker argues that a 3-year backtest in the current regime can be more meaningful than a 20-year backtest that includes data from a completely different economic environment. Events like the 2008 crisis or COVID-19 influxes of retail investors change the 'distribution' of data, making past patterns useless. This creates a paradox for quants who need large amounts of data but find that long-term data often spans multiple, incompatible regimes. He concludes by mentioning a famous 'Head of Quant' who famously questioned the future of the profession because patterns break during extreme events. Part 1 ends with a promise to cover the remaining five pitfalls and a study curriculum in the next video.

Community Posts

View all posts