### Stress Tests & Robustness Tests

##### multimarket performance evaluation

If you developed a given futures market strategy, in an ideal world, it would perform well on all markets (from metals, energies, currencies, bonds, stock indices, grains, softs). However, from our experience, we know that this is a challenging task. You would be happy if it worked for markets from the same segment.

Stress tests and robustness tests are crucial to understand in order to choose the strategy for live trading.

So how about EXCELSIOR-RUS2000 applied to other stock indexes?

How does this strategy perform, for example, on the E-mini NASDAQ 100 market (Figure 1)?

Figure 1: EXCELSIOR-RUS2000 on E-mini NASDAQ 100

What about E-mini S&P Midcap 400 (Figure 2)?

Figure 2: EXCELSIOR-RUS2000 on E-mini S&P MidCap 400

And E-mini S&P 500 (Figure 3)?

Figure 3: EXCELSIOR-RUS2000 on E-mini S&P 500

As you can see, this powerful strategy works very well on other stock indices. And this is a good indication of the potential robustness and the first kind of stress test. However, it is way too soon to say that a strategy is robust and appropriate for live trading. We need to perform the next stress & robustness tests

#### Missing out of sample tests

A trader needs to fully understand the principle of testing and validation on Out-of-Sample data. First, you have to understand the difference between In-Sample and Out-of-Sample historical data. When you back-test on historical data, it is always good to set apart a certain period to test the results’ validity.

So if we have a 10-year data set of historical data, we split the data by a specific ratio. Many traders use, for example, the rate of 70% for In-Sample data and 30% for Out-of-Sample data. Thus, if we have historical data covering 10 years, we split it into 2 parts (Figure 4):

70% = 7 years of In-Sample historical data

30% = 3 years of Out-of-Sample historical data

Figure 4: In-Sample and Out-of-Sample testing

We have set the ratio of our In-Sample and Out-of-Sample data. The basic principle of in-sample data is always to use this part to “develop and train a trading strategy”. In other words, to find either optimal trading rules and inputs (i. e. length of moving averages). What means optimal is always quantified by the fitness function.

For example, the fitness function can be maximizing the Net Profit or Sharpe Ratio of a strategy or minimizing the Maximum Drawdown or a combination of both.

If the fitting on the In-Sample data gives you stable and profitable results, we check the performance on out of sample data. In case a trading strategy is stable and profitable on the Out-of-Sample data as well, we have validated the potential viability of a strategy.

In other words, out-of-sample test is an essential step to evaluate the real profitable potential of a trading strategy because, on this data, a strategy was not fit. Forget about in-sample data. In most cases, a strategy will have too optimistic results on in-sample data.

##### So how to evaluate that out-of-sample tests are successful?

One of the testing criteria can be, for example, the condition that the Out-of-Sample (OOS) results are equally profitable or stable as the In-Sample (IS) results.

Given that the In-Sample and Out-of-Sample data always have different percentages of data length (e.g. 70% IS and 30% OOS, 60% IS and 40% OOS, or 80% IS and 20% OOS etc,) we have to standardize a strategy Net Profit by calculating the profitability per day. By comparing standardized Net Profit per day of out-of-sample and in-sample data, we get a measure called Efficiency.

Efficiency (Robustness Value Index in TradeStation Optimization Tests) is an important measure. It tells you if in sample fitting is meaningful or not. If you see on out of sample data efficiency is very low compared to in-sample data, you can conclude that a strategy is overfitted with very high probability.

Efficiency is based on the “Robustness Index,” which represents and compares the Out-of-Sample (OOS) and In-Sample (IS) results. In essence, it is a % measure of a strategy’s quality. Robustness Index is calculated as follows:

Robustness Index = (OOS profit * 365 / number of OOS days) / (IS profit * 365 / number of IS days) * 100

One of the test criteria we can use is the requirement that the Robustness Index is at least 50%. Having a Robustness Index higher than 50% is not that usual. If you find a strategy with outstanding OOS’s outstanding Efficiency, there is a light at the end of the tunnel. But we’re still in the beginning.

In conclusion, let’s have a look at two equity curves divided into In-Sample and Out-of-Sample data (Figure 5 and Figure 6).

It will be a trading strategy with a negative Robustness Index because the strategy is not profitable on out of sample data (Figure 5):

Figure 5: Equity curve: Good in-sample and bad out-of-sample result

Contrary to Figure 6 is a trading strategy’s equity curve showing the good out of sample performance (green line) with high Robustness Index Value:

Figure 6: Equity curve: Good in-sample and out-of-sample result

Many inexperienced traders would conclude that this out of sample test (green line) is enough to confirm the strategy’s profitability. But remember: we focus on science, and we need to focus on the proof.

Why is one out-of-sample test not a proof? Imagine a football player. On training (in-sample data) he is performing exceptionally well. Then there is a single match (out of sample), and he plays well. Is it proof that he will perform well every single game? It is not. To prove the player’s quality, he needs to provide exemplary performance in almost every match to be a quality player. Similarly, with a trading strategy.

One out-of-sample test is simply not enough to prove the quality of a trading strategy. Fortunately, there is a solution to this issue. And it is called Walk Forward Out of Sample Test. To understand the concept of walk-forward testing, we need to go through optimization tests first.

#### Missing optimization tests

A backtest itself may be the first indication of the strategy’s quality. Still, by no means, you should conclude that a strategy will be profitable also in live trading because of a single backtest. You must perceive the necessary backtesting as only one part of the entire strategy performance evaluation.

For example, I consider backtesting only as a test that shows me the potential profitability of a strategy on the historical data for the selected market. In the following example, I will explain why the basic backtest never sufficiently evaluates the quality of a trading strategy:

Imagine that you have created a strategy with simple entry and exit conditions: Crossings of two moving averages (MA). In the chart below (Figure 7), you can see two examples of such crossings:

Figure 7: Example of the crossing of two moving averages

The “Fast” moving average calculates and draws into the chart the arithmetic average of the last 15 CLOSE values (blue line). The “Slow” moving average draws the average of the last 30 CLOSE values (purple line).

On the left side, you can see that if FMA crosses above SMA, the trading strategy enters into a long position (LONG) by the BUY MARKET order on the bar’s CLOSE price (respectively the OPEN the next bar). In the opposite situation, if FMA crosses below SMA, the system closes the long position by the SELL MARKET order and, at the same time, opens a SHORT position on the CLOSE price of the bar. Thus, we are always in a position (reversal trading strategy).

As an example, the FMA 15 and SMA 30 setting seems to be profitable at first glance. Thus we need to test whether the system would still be profitable with other parameter values (for example, FMA 10 and SMA 40). We will try different combinations of the strategy’s parameter values on historical data.

For example:

FMA: 5 10 15 20 25 30 35 40 45 50

SMA: 55 60 65 70 75 80 85 90 95 100

In this example, the result shows that from the total of 100 FMA and SMA values, there were only 20 profitable ones on the out-of-sample data. I. e. only 20% of the combinations were profitable, which is not a sufficient value. The aim is that most combinations within the optimization were profitable. The presented example is very simplified and inapplicable in practice. Many traders are not familiar with this vital part of trading strategy evaluation or do not understand it correctly.

Now let’s have a look at the EXCELSIOR-RUS2000 and its optimization test. In this strategy, we use two inputs (parameters) for optimization:

N1 from 1 to 25 with an increment 1

X1 from 0,1 to 5.5 with an increment 0.1

The optimization table was applied to the E-mini Russell 2000 market with a date range from 2008 – 2020. In-Sample data is from 2008-2014, and Out-of-Sample is from 2015-2020. Let’s have a look at the optimization table from the TradeStation platform:

Figure 8: TradeStation Strategy Optimization Report for EXCELSIOR-RUS2000

In Figure 8 you can see random ten optimization tests. In total, we got 1375 combinations. Out of these 1375 combinations, only 5 combinations were not profitable based on sample data (2015-2020), i.e., only 0.36%.

That is a fantastic result that confirms the real robustness. Everything below 20% is suspicious to us. But having 0.36% unprofitable combinations is unique. The next critical metric is Robustness Index. It compares the standardized profitability per day between In-Sample and Out-Of-Sample data.

The Average Robustness Index from those 1375 tests is 65.5%, which is also an encouraging result. We expect the Average Robustness Index higher than 50%. Thanks to this good value of the Average Robustness Index, TradeStation recommends the EXCELSIOR-RUS2000 for Walk Forward Out of Sample tests.

#### Missing Walk-Forward Out-of-Sample test

Walk-Forward Out of Sample testing is a statistical tool available in TradeStation platform with these purposes:

- To provide proof that a strategy is robust by performing multiple forward out-of-sample testing. Remember, the football player needs to prove his ability to provide quality performance in most matches.
- To determine the best possible values of strategy inputs (for example, moving average lookback period) to use in a trading strategy. The trading strategy is optimized with in-sample data for a time window in a data series. The remaining data is reserved for out-of-sample testing. A small portion of the reserved data (out of sample) following the in-sample data is tested with the results recorded. The in-sample time window is shifted forward by the period covered by the out of sample test, and the process is repeated. The following scheme contains a series of In-Sample and Out-of-Sample tests on historical data with individual Walk-Forward OOS tests (8 tests in total).

Figure 9: Rolling Walk Forward Analysis

In Figure 9, you can see an example of WFA for 12 months. This WFA includes 8 In-Sample parts (blue fields) and 8 Out-of-Sample parts (green fields). Thus, as you can see from the fifth month, we simulate live trading conditions on unknown data.

We perform an Optimization test for all In-Sample runs. After identifying input parameter settings with the highest In-Sample Fitness Function (for example, Net Profit, Net Profit / Max Drawdown) we apply these settings to the Out-of-Sample data.

Forward testing is also known for its powerful ability to create a simulation of live trading. It tries to simulate the conditions when you change parameters over time to adjust to changing market conditions.

As an example, imagine that you have a strategy with two parameters (Inputs) for a longer and a shorter moving average. For the longer moving average, we use, for example, the parameter range from 50 to 100 with increments of 10 (i.e., 6 combinations), and for the shorter moving period, a parameter range from 5 to 45 also with increments of 10 (i.e., 5 combinations).

In total, we get 5 x 6 = 30 possible combinations of two input parameters with this optimization setting. We determine that a fitness function will be the highest Net Profit. In the first In-Sample run, i.e. the first to the fourth month, we found out that the highest Net Profit achieved the combination with these inputs: 5 for the shorter moving average and 60 for the longer moving average.

Therefore, we will apply the parameter settings 5 and 60 to the first Out-of-Sample testing, i.e., the fifth month. In the second In-Sample run, i.e., the second to the fifth month, we found out that the highest Net Profit achieved the combination 15 and 80.

We will therefore apply the parameter settings 15 and 80 to the second Out-of-Sample testing, i.e., the sixth month. We use the same principle to the next runs up to the In-Sample run 8 with an application on the last Out-of-Sample data, i.e., the 12^{th} month.

This principle is called “Rolling” walk forward out of sample testing. The point is that we divide the historical data into smaller parts, thanks to which we obtain more Out-of-Sample tests for evaluating the strategy robustness. Important is to look for strategy parameters – if they do not change significantly with every new in-sample, that is another proof that strategy is robust. If they change significantly on every in-sample, they are just overfitted according to the new training dataset.

After performing these tests, we assess the robustness potential by the predetermined test criteria. Those test criteria contain conditions for overall efficiency, profitability, stability, and minimized drawdown.

Let me show you an example of two real and profitable trading strategies. The first strategy is applied on a Rough Rice (RR) – a future market (you can see the yellow curve represents the market price data) (Figure 10). The second strategy is our well known E-mini Russell 2000 that has already been presented in this ebook (Figure 11).

We also have an equity curve (green) which contains 5 out of sample tests:

Figure 10: Rough Rice Strategy – Walk Forward Analysis

Figure 11: E-mini Rusell 2000 Strategy

The ratio of In sample and out of sample data division is 70%/30%. The horizontal axis represents trades over time; the vertical axis represents Cumulated Net Profit. As you can see, for Rough Rice strategy out-of-sample tests (Figure 10) start from approximately 650^{th} trade, and for E-mini Russell 2000 strategy out of sample tests (Figure 11) start from 240^{th} trade.

From this trade, you get out of sample results – a simulation of live trading. But what about trades before? How should we approach this part of the data? How should we properly evaluate out-of-sample results? Are they good enough to conclude that we have a strategy that will be profitable in live trading?

Many traders do believe that they understand enough to interpret the results. But based on our experience with clients, the opposite is true. And if you don’t fully grasp the knowledge behind these advanced tests, you might never be able to identify robust strategies.

If you want to know your potential live trading results’ realistic scenario, walk forward analysis can be the right answer. To sum it up, which questions should be answered by performing a walk-forward analysis?

- Will a trading strategy be profitable on unseen data – multiple out of sample tests?
- Does a strategy perform well on out of sample data?
- Should we re-optimize the system’s input parameters (Inputs), and if yes, when and how often?

### Cluster Walk Forward analysis

Answers to all questions mentioned above can be found in the very known TradeStation tool. This tool goes beyond the possibilities offered by most software trading platforms and is called the Cluster Walk Forward Analysis (“CWFA”).

CWFA helps us maximize the probability that a trading strategy is robust. Why? Because CWFA contains various walk forward analysis (WFA) with different Out-of-Sample periods expressed as percentages with different numbers of runs. And when you perform more WFA tests, you increase the probability that you have a potentially profitable trading strategy in your pocket.

Figure 12: Successful Cluster Walk Forward Analysis

In this Figure 12, you can see a classic example of CWFA that includes 30 WFAs with different Out-of-Sample periods (rows) – precisely 10%, 15%, 20%, 25%, and 30% – and with varying numbers of runs – specifically 5, 10, 15, 20, 25, and 30 (columns). In the Table, you can see that I chose an example in which all 30 WFAs have successfully passed the various test criteria.

In other words, a strategy meets the conditions to prove that it is robust. In this case, the overall results for the given WFA include a box saying “PASS”. Yet, in practice, we will instead meet with the variant that most of the 30 WFAs will not meet some of the test criteria, and in such case, we will see a box saying “FAILED” in the WFA’s results (Figure 13) :

Figure 13: Unsuccessful Cluster Walk Forward Analysis

As we are starting to deal with a number of concepts, it’s for the best to graphically (see Figure 14) show the individual links between Walk Forward tests, Walk Forward Analysis, and Cluster Walk Forward Analysis. It is clear that CWFA includes many Walk Forward tests, which are the basis of WFA.

Figure 14: The link between Walk Forward tests Walk Forward Analysis, and Cluster Walk Forward Analysis

* *

Now let’s get back to our results. It is because CWFA is a very demanding robustness test, and only a few strategies are robust enough to meet the strict test criteria. Among the most crucial test criteria for each walk forward analysis belong:

- Overall profitability of out-of-sample tests
- High efficiency of-out-of sample tests compared to in-sample data (more than 50%)
- Stable distribution of profits on out-of-sample data
- Acceptable maximum drawdown

Figure 16 shows a perfect example of that is potentially very robust for the selected market and timeframe. It is a relatively rare phenomenon. Searching for genuinely robust strategies can be likened to searching for a needle in a haystack.

So what about the result of EXCELSIOR-RUS2000 for E-mini Russell 2000 (Figure 15)?

Figure 15: Cluster Walk Forward Analysis for E-mini Russell 2000 strategy

As you can see, this strategy is very robust. Let’s have a look at if the strategy passed other essential stress tests.

### Stress tests

Most beginners don’t understand how important it is to stress-test strategies. The stress tests are performed to test the robustness of a trading strategy. Robustness is an insensitivity to variations in the data on which the strategy is based.

The more robust a trading strategy is, the more probably a strategy will perform well in live trading. Testing a trading system for robustness is often called as a parameter sensitivity analysis. The basic idea is to evaluate what happens when changes are made to the strategy inputs (parameters), price data, or other essential strategy elements. A robust strategy does not react too much to small changes.

In contrast, a strategy that is not robust will react disproportionally when small modifications are done to its inputs or environment. So why is the concept of robustness so crucial? It is important because the markets change over time. They are very unstable. If we consider strategy inputs such as the lookback length for a moving average – some value might be optimal over the backtesting period.

Still, on out of sample data, different values might be better in terms of the strategy’s performance. The robustness test’s ultimate goal is to evaluate how the strategy will adjust to changing market conditions when the inputs are no longer optimal.

For illustration purposes, we will analyze the EXCELSIOR-RUS2000 for all necessary stress tests.

### Stress Test: Changing the strategy inputs

One way to address the parameters’ stability is to see how the results change when the input values are changed. Inputs value can be a look back period of indicators, price patterns, etc.

So now we know that the EXCELSIOR-RUS2000 strategy has real potential as it passed Cluster Walk Forward Analysis. But how about a stress test where we vary strategy inputs? This strategy has two inputs: N1, X1.

For N1, we will try to modify the N1: 3, 4, 5, 6, 7, 8, 9, 10 and X1: 2.5, 3, 3.5, 3.75, 4, 4.25, 4.5, 5

When we combine modified strategy inputs values, we get 72 combinations in total. We developed it on the in-sample data (2003-2015). The rest of the data is the real live trading simulation, pure out-of-sample test. It doesn’t make any sense to apply stress tests on in-sample data where the strategy was fit. We can obtain the accurate picture of a trading strategy insensitivity from out of sample data

Figure 16: 72 iterations of strategy input modification (EXCELSIOR-RUS2000)

As you can see from the chart (Figure 16), those 72 modifications make insignificant changes to the strategy’s overall performance. And that is a good sign.

#### Stress test: Making small changes to individual prices

Another way is to make small changes to individual prices of Open, High, Low, Close to identifying that a strategy is over-fit. Imagine the strategy enters at the low price of the day. What would the strategy’s performance look like if the low had been one tick lower on those days?

If such an insignificant change would threaten the performance, the strategy is clearly not robust and, therefore, won’t be almost certainly profitable in live trading. In this test, we change open, high, low, and close of the bar with 20% probability.

First, we randomize HIGH and LOW, and then CLOSE and OPEN to keep them inside HIGH and LOW interval. In Figure 17 below, you see that all equity curves are close to each other. Therefore it seems that strategy reacts well to small changes in individual prices and is not overfitted just to one historical price time series.

Figure 17: EXCELSIOR-RUS2000 on randomized prices

#### Stress test: Changing the starting bar

A good strategy should not underperform when you start the backtest on a different bar (it means other times in history). Imagine a trading strategy that enters long on an exponential moving average crossover. This strategy then holds the trade exactly five bars before exiting at market order.

Now consider what the trade history might look like on a price chart. Suppose a short-term moving average crosses above a long-term moving average. In that case, it’s possible that in a sustained up-trend, the entry condition could be valid for an extended period. Now consider what would happen if the starting bar were changed.

Some trades would possibly be much more profitable than others, depending on how the trades aligned with any underlying five-bar trend cycle that existed. So, depending on the starting bar, the strategy might be highly profitable or unprofitable because of where the trades started and ended. During deployment, it might not be evident that the strategy logic had this type of dependency on the starting bar.

##### Stress test: Changing the size of the bars and custom trading sessions

To check that the EXCELSIOR-RUS2000 strategy is not overfitted for the given bar size and custom trading session, we would like to know how the equity changes if we use slightly different bars and sessions.

Most important is local randomization, where we look only at small changes (that’s why local or neighborhood. We used different bar lengths (65, 68, 70, 72, 75) and eight different sessions, each having 7 minutes shifts. So totally, we created 40 different basic settings. Since futures trade all day, we cut the sessions to fit into classical market hours 9 am – 4 pm. These shifts can be explained easily; we have:

- 65 min bar, shift = 0 min, session: 9:45-15:10
- 65 min bar, shift = 7 min, session: 9:52-15:17
- 65 min bar, shift = 14 min, session: 9:59-15:24
- 65 min bar, shift = 21 min, session: 9:01-15:31
- …

For each setting, we get a different session. By the way, don’t be afraid of using different bar lengths and sessions. What works on 5 min bars should work on 4 or 6 min bars and shifted sessions. If not, it is just an overfit.

Figure 18: EXCELSIOR-RUS2000 on different bars’ length and sessions

After looking at the equities, without doing any statistical analysis, this strategy is quite robust (Figure 18). That means a tiny change in the sessions and bars does not change the result significantly.

We can see that in some cases, the overall performance is quite different. Returns are still more than 60% correlated, and then the equity curves have more or less the same trace.

We could not expect having almost identical equities, but this stable distribution is a good sign. We have seen strategies where a small change in bars or sessions made perfectly profitable strategies unprofitable; that is a behavior we don’t want to see.

In the chart, we created a color map that goes from light colors for 65 min bars (yellow), through 68 (orange), 70 (red), to 72 (purple), and finally 75 (black). We can see that the session setting can also make a huge difference for some bar settings, but the most stable bars for sessions are the red and purple ones (70 and 72).

Changing the time session makes the difference because, with different sessions, we get different daily bars, and inside the algorithm, we use the values for actual highs/lows of the day and close price from the previous day.

Figure 19: EXCELSIOR-RUS2000 on different bars’ length and sessions (broader neighborhood)

We also created a more extensive test, where we used more bars (50, 60, 70, 80, 90) with four different sessions, each shifted by 14 minutes (Figure 19). To our surprise, even using a broader neighborhood works very well. We use similar coloring; the darker the color, the longer the bar, so 90-minute bars are too much because they cause the highest variation with changing sessions. Actually, by this analysis, we are proving that the price pattern used is very robust, and that is what we want to see from an algo-trading strategy.

We used 1-minute data generated from Tradestation as a continuous contract (with automatic rollings provided by Tradestation). From these data, we created our bars and sessions. Unfortunately, the saying “the only correct data are those you collect yourself” is always correct (or construct everything yourself from high-quality tick data).

These data have some errors like missing a few minute bars or sometimes some parts of the day. Also, because of automated rolling by Tradestation, we can see on the plot a few more significant drops or gains within one day, which were reversed another day. Thinking of rolling the contracts is very important when trading futures. In this case, these data mistakes, which occur every three months, do not affect the overall performance.

#### Randomization Test

One of the most crucial tests is testing against randomness. That means your strategy has to beat random strategies with the same properties. If your strategy doesn’t beat the randomness, the result is straightforward: when you trade random, you can get better results, so your strategy is useless.

In this case, we are not comparing our strategy to something entirely random but a random strategy with similar properties considering the number of trades, the average length of position, proportion of long/short trades, average pause between the trades, and so on. Simply said, only the entry and exit date is randomized.

We will show you a simple example of randomizing EXCELSIOR-RUS2000. We do two types of these tests, depending on the complexity of a strategy. If the strategy trades only one market, it is easy to use the actual distribution of trades and randomize just dates (this is our case for this example).

On the other hand, if the strategy is more complex or trades many markets (we consider each stock as one market), we calculate mathematical properties of the trades’ distribution from the backtest. We create random samples, so the resulted random trades are from the same distribution. Not in returns but the number of trades, maximum, minimum, average trade duration, the same proportion of long/short positions, etc.

Since this strategy uses only one time series, it is easy to construct random trades using Python’s exact properties. This strategy’s out-of-sample is from 2015 to 2020, so we will use this period for the randomization test. Basic features of backtested trades:

- Number of trades: 274, longs 118, shorts 156
- Average trade duration: 28 bars or 7 days (calendar)
- Median trade duration: 8 bars or 3 days (calendar)
- Short-term (up to 10 bars): 158 trades, only 18 longs, 140 shorts
- Medium-term trades (11-100 bars): 94 trades, 78 longs, and only 16 shorts
- Long-term trades (101 bars and more): 22 trades, all longs

It is essential to look at how shorts and longs are distributed over trade lengths. Because of long bias on the stock market, we can see that long-term shorts are not traded with this strategy. If you don’t count this fact in the randomization, you compare something different, and you could come to false conclusions.

We will do the type 1 randomization and use all the trades. We will only shuffle the start and end dates (according to it also entry and exit prices). We calculate only with trades (daily returns are not necessary). Slippage and all costs are included, and we always use one contract of RTY (E-mini Russell 2000).

Let’s look at the results; we plot equity curves for the backtest and all random strategies. Usually, the results are straightly visible from the plot.

Figure 20: Randomization test for EXCELSIOR-RUS2000 strategy

In Figure 20, we can see that EXCELSIOR-RUS2000 beats the randomness (green line is our current backtest, grey lines are random strategies on RTY). The percentile at the end is 100%, through the time it is stable over 95% (except for a few parts of 2015).

We can also look at percentiles of yearly returns to see how strategy is doing inside the plot (usually, we don’t beat the randomness every month, but annual should be visible). As stated before, we calculated equity from trades, not for daily returns; that’s why we can see jumps in prices.

Year | Percentile of Return |

2015 | 83.63 % |

2016 | 99.45 % |

2017 | 82.17 % |

2018 | 95.64 % |

2019 | 64.45 % |

2020 | 99.95 % (till august 2020) |

Table 1: EXCELSIOR-RUS2000 vs. Randomness: Percentile of Return Year by Year* *

This analysis is simple, has straightforward results but sometimes can be hard to compute. We usually want to see an overall beat of randomness on the plot with a percentile over 90 % or 95 %. When we look inside, it is good to have at least 80 % percentiles to beat given years; if overall is doing exceptionally well, we can accept even lower percentiles for given years. We can see that year 2019 was weaker than the others (Table 1).

Randomization tests serve as a confirmation that you beat random strategies. Thus, you get a real edge in the market. It is necessary to apply these tests to real out-of-sample data. On in-sample data where the strategy is fit, the results can be too optimistic. Be careful with that.

#### SUMMARY

As you can see, it is essential to understand that robustness is related to strategy over-fitting. We need to be sure that the strategy has not been fit so tightly to the market during the strategy building process that it can’t withstand reasonable changes to the market.

A trading strategy that does not perform well to relatively small changes is not robust and is likely to be over-fit. Such a strategy should not perform well in live trading, and if it does, it is just a matter of sheer luck.

Varying the variables randomly over a large number of iterations can provide you a much better insight into potential performance in live trading.

The crucial questions that can be related to stress tests are:

- How should you approach and evaluate the stress tests applied on in sample, out of sample, and validate data? Are there any issues with it?
- How should you approach the setup of these stress tests to increase the probability of detecting overfit strategies?
- Is there an ideal way of combining different types of stress tests, and is it necessary?
- Should you apply stress tests to the building process of the strategies or only after the build process?

If you don’t want to read all I want to share with you article by article, grab our Ultimate Guide To Successful Algorithmic Trading here and read it anytime you want! 12 chapters, 112 pages: all in one place and completely FREE of charge!