Testing Statistical Significance in Financial Data with Python
On this page:
Statistical significance tests help determine whether two or more datasets are different and how meaningful their difference is. These tests evaluate whether observed discrepancies, such as a trading strategy outperforming the S&P 500 benchmark or a particular student group achieving higher exam scores than another, can be attributed to genuine effects rather than mere chance. Essentially, the question at the heart of these tests is whether the observed differences are a product of "luck" or if they would likely recur under the same experimental or study conditions, thereby indicating consistent variations in the datasets or groups.
Applications of statistical significance tests are wide and varied. Examples include:
- Evaluating trading strategies by comparing the returns of different strategies or portfolios against a benchmark or each other.
- Evaluating whether an event had a statistically significant impact on stock prices by using event study methodologies combined with significance tests like the t-test.
- Assessing whether there are patterns in returns indicating market inefficiencies, using statistical tests like serial correlation tests or runs tests.
- Investigate the relationships between different financial variables, such as interest rates and stock market returns, using correlation and regression analysis.
- Backtesting Value at Risk (VaR) models by comparing the predicted maximum losses to actual losses over a period. Significance tests help determine whether discrepancies between predicted and actual losses are statistically significant.
Statistical testing revolves around hypothesis testing, where a claim is evaluated against data. Selecting an appropriate statistical test is depending on the data and hypothesis. Common tests include Student's t-test for comparing means, the chi-square test for categorical data, Kruskal-Wallis for comparing more than two groups, or correlation test to examine the relationships between variables.
A test is used to calculate a test statistic (t-value) from your data. This numerical value represents the degree of difference or association observed in your data. For instance, in a Student's t-test, the t-value measures the difference in means between two groups relative to the variability within those groups. It's important to note that terms like "t-value," "t-statistic," or "t-stat" are often used interchangeably.
Each statistical test assumes certain conditions about the data. If these assumptions are not met, the test statistic might be irrelevant, highlighting the importance of choosing the appropriate test for your data and hypothesis.
Parametric vs Non-Parametric Tests
Statistical tests are broadly categorized into parametric and non-parametric tests, each suited for different types of data and assumptions about their distribution.
Parametric tests are applied when data adhere to specific criteria, such as normally distributed data and equal variance. However, real-world data often deviate from these ideal conditions, limiting the applicability of parametric tests.
Parametric tests include:
- Student's t-test: Comparing the means of two independent groups, with variants like the one-sample t-test, independent samples t-test, paired samples t-test.
- Welch's t-test: Specifically designed for situations where two groups do not share equal variances or sample sizes, offering a more flexible approach than the standard t-test.
- ANOVA (Analysis of Variance): Comparing the means across three or more independent groups, with variations such as one-way ANOVA, two-way ANOVA, and repeated measures ANOVA to address different research questions.
- Pearson correlation: Assesses the linear relationship between two continuous variables, quantifying both the direction and strength of this relationship.
- Linear regression: Investigates how a dependent variable changes in relation to one (simple linear regression) or more (multiple linear regression) independent variables.
- Logistic regression: Models the probability of a binary outcome based on one or more predictor variables.
Non-parametric tests, or distribution-free tests, do not assume a specific underlying distribution for the data. They are particularly valuable when data do not meet the assumptions required for parametric testing.
Non-parametric tests include:
- Mann-Whitney U test (also known as the Wilcoxon rank-sum test): Compares the distributions of two independent groups.
- Wilcoxon signed-rank test: Used for comparing two paired or related samples to assess differences in their median values.
- Kruskal-Wallis H test: An extension of the Mann-Whitney U test for comparing more than two independent groups, assessing differences in their distributions.
- Spearman's rank correlation: Measures the strength and direction of the monotonic relationship between two variables, which can be linear or nonlinear, such as the correlation between the square footage of homes and their prices.
- Chi-squared test: Determines whether there is a significant association between categorical variables.
Choosing between parametric and non-parametric tests depends on the nature of the data and the specific research question, with non-parametric tests offering an alternative when data do not meet the assumptions of parametric tests.
How to choose a statistical test
Choosing the appropriate statistical test for your data involves a series of steps to assess the characteristics of your data and assumptions underlying various tests. Here is a simplified guide to help you make an informed decision.
1. Define Your Research Question and Hypothesis
Clearly specify what you aim to investigate. This determines whether you need a test for comparing means (e.g., t-tests, ANOVA), analyzing relationships (e.g., Pearson, Spearman correlation), or something else (e.g., chi-square test for categorical data).
2. Assess the Type of Data You Have
Identify whether your data are categorical or continuous, and whether you are comparing groups or looking at relationships between variables. This helps narrow down your choice of tests.
3. Check for Normality
Many parametric tests assume that the data are normally distributed. Use QQ plots and the Shapiro-Wilk test to assess normality.
- QQ Plots: Visual method to compare the distribution of your data against a normal distribution. Deviations from the line indicate departures from normality.
- Shapiro-Wilk Test: Provides a p-value to statistically assess normality. A p-value below a certain threshold (commonly 0.05) suggests that the data do not follow a normal distribution.
4. Test for Equal Variances
Some tests assume that variances are equal across groups (homoscedasticity).
- Levene's Test: A robust test for equal variances that is less sensitive to departures from normality.
- F-test (ANOVA assumption check): More sensitive to normality, used primarily when comparing means across more than two groups and assuming normality.
5. Choose the Test Based on Your Findings
- Normal Distribution and Equal Variances:
- Use parametric tests such as the independent samples t-test for comparing two means or ANOVA for more than two groups.
- Non-Normal Distribution or Unequal Variances:
- Opt for non-parametric alternatives like the Mann-Whitney U test (instead of the t-test) or the Kruskal-Wallis test (instead of ANOVA) for comparing groups. Use Welch's t-test if you suspect unequal variances between two groups but still consider a t-test appropriate.
- Relationships Between Variables:
- For normally distributed data, Pearson's correlation can assess linear relationships. If data are non-normal or you're interested in monotonic relationships, use Spearman's rank correlation.
6. Conduct the Test and Interpret Results
Perform the chosen statistical test using appropriate software or programming libraries, such as Python. Interpret the results in the context of your research question, focusing on p-values to assess statistical significance and effect sizes to understand the magnitude of any observed effects.
This process is iterative and might require adjustments based on your findings at each step. It's essential to clearly understand the assumptions of each test and ensure your data meet these assumptions for the results to be valid.
Simple Example of a Statistical Significance Test (T-Test)
A t-test is a statistical method used to determine if there's a significant difference between the means (averages) of two groups. It's especially useful when dealing with small sample sizes and when the data is approximately normally distributed.
There are three main types of t-tests:
- One-sample t-test: Compares the mean of a single group to a known average.
- Two-sample t-test (independent): Compares the means of two separate groups to see if they're significantly different.
- Paired t-test: Compares the means of the same group at two different times (e.g., before and after a treatment).
When you run a t-test, you get a t-value. This t-value, along with the sample size, helps determine if the difference between the groups is likely due to chance or if it's statistically significant. In short, a t-test helps you figure out if two groups are genuinely different or if the observed differences could just be random.
The t-test makes several assumptions, including:
- The data is normally distributed (or approximately so).
- The variances of the two populations are equal (for a two-sample t-test) unless you're conducting a version of the test that doesn't assume equal variances (often called the Welch's t-test).
- Observations are independent of each other.
If the data doesn't meet these assumptions, other statistical tests might be more appropriate.
import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import matplotlib.style as style
style.use("default")
params = {
"axes.labelsize": 8,
"font.size": 8,
"legend.fontsize": 8,
"xtick.labelsize": 8,
"ytick.labelsize": 8,
"text.usetex": False,
"font.family": "sans-serif",
"axes.spines.top": False,
"axes.spines.right": False,
"grid.color": "grey",
"axes.grid": True,
"grid.alpha": 0.5,
"grid.linestyle": ":",
}
plt.rcParams.update(params)
Scenario
A school wants to determine if a new teaching technique for mathematics is more effective than the traditional method. They decide to conduct an experiment.
Procedure
- The school selects 30 students at random.
- These students are split into two groups of 15 each.
- Group A is taught using the traditional method for one month.
- Group B is taught using the new teaching technique for the same duration.
- At the end of the month, all students take the same math test.
The school now wants to know:
- Is the difference in test scores between the two groups statistically significant?
- Is the new technique genuinely better, or could this difference just be due to random chance?
# Hypothetical test scores
group_A = [85, 88, 90, 80, 87, 86, 84, 82, 88, 85, 83, 89, 85, 84, 87]
group_B = [87, 93, 92, 90, 95, 91, 92, 90, 93, 91, 94, 95, 92, 91, 94]
# Plot test scores of both groups
plt.figure(figsize=(4, 2.5))
plt.hist(group_A, alpha=0.5, label='Group A', edgecolor='black')
plt.hist(group_B, alpha=0.5, label='Group B', edgecolor='black')
plt.xlabel('Test Scores'); plt.ylabel('Frequency'); plt.legend(loc='upper left'); plt.tight_layout(); plt.show()
# Two-sample t-test
t_stat, p_value = stats.ttest_ind(group_A, group_B)
print(f"t-statistic:\t{t_stat:.3f}")
print(f"p-value:\t{p_value}")
if p_value < 0.05:
print("p < 0.05: The difference between the two groups is significant")
else:
print("p > 0.05: The difference between the two groups is not significant")
t-statistic: -7.236
p-value: 7.080650095530462e-08
p < 0.05: The difference between the two groups is significant
The results of the t-test are as follows:
- t-statistic: -7.236
- p-value: 0.0000000708 (7.08 × 10^-8)
The negative t-statistic indicates that Group B (New Technique) has a higher mean score than Group A (Traditional). The extremely small p-value (much less than 0.05) suggests that the difference between the two groups is statistically significant. In other words, it's very unlikely that the observed difference in test scores is due to random chance. Thus, based on this data, we can conclude that the new teaching technique appears to be more effective than the traditional method.
T-Distribution Chart
The t-distribution, also known as Student's t-distribution, is a type of probability distribution that is symmetric and bell-shaped, similar to the normal distribution but with heavier tails. It arises when estimating the mean of a normally distributed population in situations where the sample size is small and the population standard deviation is unknown. The shape of the t-distribution is determined by a parameter known as degrees of freedom (df). Degrees of freedom are typically related to the sample size (e.g., df = sample size - 1 for a single sample t-test). The larger the degrees of freedom, the closer the t-distribution is to the normal distribution.
Following is a chart of the t-distribution with 10 degrees of freedom. The horizontal axis represents different t-values, and the vertical axis shows the probability density associated with each t-value. This chart illustrates how the probability density decreases as you move away from the center (t-value = 0), indicating that more extreme t-values are less likely to occur by chance.
# Define the parameters for the t-distribution
df = 10 # degrees of freedom
x = np.linspace(-4, 4, 1000) # range of x values
y = stats.t.pdf(x, df) # probability density function for the t-distribution
plt.figure(figsize=(4, 2.5))
plt.plot(x, y, label=f'T-distribution (df={df})')
plt.title('T-distribution Chart'); plt.xlabel('T-value'); plt.ylabel('Probability Density')
plt.legend(); plt.grid(True); plt.tight_layout(); plt.show()
p-value explained
A p-value gives us an idea of how likely it is to observe the results we did (or more extreme results) if we assume that there's no real effect present — essentially, if it's all just due to chance.
So, when you see a p-value of 0.05, it means there's a 5% chance that the differences or relationships you observed in your data could occur just by luck, even if there's actually no real underlying effect or difference. This is a way of saying, "If we were to repeat this experiment or study many times under the same conditions, we'd expect to see results like these (or more extreme) about 5% of the time just by random chance."
A lower p-value (like 0.01 or 1%) suggests that the observed data would be very unlikely if there were no real effect, making us more confident that the effect we're seeing might be real. On the other hand, a higher p-value (like 0.10 or 10%) means there's a higher likelihood that the observed results could just be due to chance, making us less confident in the presence of a real effect.
It's important to note, however, that the p-value doesn't tell us the probability that our hypothesis is true or false. Instead, it's about the probability of the data given a specific assumption (no effect or difference).
The p-value is also referred to as significance level, alpha, probability value.
Here's where the actual p-value calculation comes into play:
- The test statistic is compared against a distribution (e.g., t-distribution for a t-test) that represents what we would expect if there were no real effect or difference (the baseline assumption).
- The p-value is then determined by finding the probability of obtaining a test statistic as extreme as (or more extreme than) the one calculated from your data, given that the baseline assumption is true.
- Essentially, the p-value is calculated by looking at the tails of the distribution. If your test statistic falls far out in the tail, it means that obtaining such a result would be very unlikely if the baseline assumption were true, leading to a low p-value.
Let's illustrate the calculation of a p-value by plotting a t-distribution and marking the area corresponding to a p-value in red. For this example, let's assume we have a t-value of 2.5 with 10 degrees of freedom.
t_value = 2.5
df = 10
# Define the range of x values for plotting
x = np.linspace(-4, 4, 1000)
y = stats.t.pdf(x, df)
# Calculate the p-value for the given t-value (for a two-tailed test)
# This involves finding the area in the tails beyond the t-value and its negative counterpart
p_value_area = stats.t.sf(np.abs(t_value), df) * 2 # sf is the survival function, equivalent to 1 - cdf
plt.figure(figsize=(4, 2.5))
plt.plot(x, y, label=f'T-distribution (df={df})')
# Highlight the area corresponding to the p-value
# This will be the area in the tails beyond the t-value and its negative counterpart
plt.fill_between(x, 0, y, where=(x >= t_value) | (x <= -t_value), color='red', alpha=0.5, label='p-value area')
plt.title('T-distribution with p-value Area Highlighted')
plt.xlabel('T-value'); plt.ylabel('Probability Density')
plt.legend(); plt.grid(True); plt.tight_layout(); plt.show()
print(f"p-value:\t{p_value_area}")
if p_value_area < 0.05:
print("p < 0.05: Variance is statistically significant")
else:
print("p > 0.05: Variance is not statistically significant")
p-value: 0.031446844236608776
p < 0.05: Variance is statistically significant
In the chart above, we've plotted a t-distribution with 10 degrees of freedom and highlighted the area corresponding to a p-value in red. This red area represents the probability of observing a t-value as extreme as 2.5 (or more extreme) in both tails of the distribution, assuming there's no real effect or difference.
The calculated p-value for a t-value of 2.5 with 10 degrees of freedom is approximately 0.031, meaning there's about a 3.1% chance of observing such an extreme result (or more extreme) due to random chance alone. This area under the curve in the tails (marked in red) visually represents that probability.
Null Hypothesis and Rejecting It
The null hypothesis is a statement that suggests there is no difference between datasets or variables. It's a default position that assumes any observed differences in data are due to chance rather than a specific effect or cause. The null hypothesis is often denoted as H_0.
When you perform statistical testing, you're essentially trying to find out whether there's enough evidence in your data to reject the null hypothesis. Put simply, you're attempting to show that two or more datasets are not the same, thereby establishing that there is a significant difference between them.
Setting Up Hypotheses
- Null Hypothesis (H_0): There is no effect, no difference, or no association between the datasets. For example, "The new trading strategy does not outperform the SP500 benchmark".
- Alternative Hypothesis (H_1 or H_a): Contrary to H_0, it suggests there is an effect, a difference, or an association. For example, "The new strategy outperforms the benchmark".
Collecting Data and Performing a Test: You collect data and perform a statistical test suited to your question. This test calculates a test statistic and a p-value.
Interpreting the p-value: The p-value tells you the probability of observing your data (or something more extreme) if the null hypothesis were true. A low p-value (typically <0.05) indicates that such an extreme result is unlikely under the null hypothesis, suggesting evidence against H_0.
Making a Decision: If the p-value is below a predefined threshold (like 0.05), you might reject the null hypothesis in favor of the alternative hypothesis, concluding that there's statistical evidence for an effect or difference. If the p-value is above the threshold, you do not have enough evidence to reject the null hypothesis, meaning you haven't found statistically significant evidence of an effect or difference.
In essence, by attempting to disprove or reject H_0, researchers use statistical tests to find evidence of a real effect, difference, or association that could be considered statistically significant.
Normality Test
A normality test is a statistical procedure used to determine if a dataset is well-modeled by a normal distribution or Gaussian distribution. Given the importance of the normal distribution in many statistical methods and tests (which often assume that the data follows this distribution), assessing the normality of data is a critical step before selecting a statistical test or model. There are several methods for testing normality, including graphical and numerical approaches:
Graphical Methods
- Histogram: Plotting a histogram of the data can provide a visual assessment of whether the data appears to be normally distributed, based on the bell curve shape.
- Q-Q (Quantile-Quantile) Plot: A graphical method that compares the quantiles of the data to the quantiles of the normal distribution. If the data are normally distributed, the points should fall approximately along a straight line.
- P-P (Probability-Probability) Plot: Similar to the Q-Q plot, but it compares the cumulative distribution functions of the data and a normal distribution.
Numerical Methods
- Shapiro-Wilk Test: A popular test that evaluates if a sample comes from a normally distributed population. It is suitable for small to moderate sample sizes.
- Kolmogorov-Smirnov Test: Compares the empirical distribution function of the sample data with the expected distribution function of a normal distribution. It can be used for larger samples but is less powerful than the Shapiro-Wilk test for small samples.
- Anderson-Darling Test: Another test that compares the sample distribution to a specified distribution (like the normal distribution). It gives more weight to the tails than the Kolmogorov-Smirnov test.
- D'Agostino's K^2 Test: Uses skewness and kurtosis (measures of asymmetry and tail heaviness, respectively) to assess the normality of the data.
These tests generate a p-value, which is used to decide whether to reject the null hypothesis that the data are normally distributed. A low p-value (typically <0.05) indicates that the data do not follow a normal distribution, while a higher p-value suggests that the data are not significantly different from a normal distribution.
It's essential to choose an appropriate normality test based on the data size and the specific analysis needs. Also, combining graphical and numerical methods can provide a more comprehensive assessment of normality.
Histogram
Let's explore the distribution of daily stock returns for the S&P 500 index (ticker: SPY) spanning from 2018 to 2023, a period of six years. In the realm of financial research, it's a common assumption that stock returns follow a normal distribution. We aim to challenge this notion by first presenting the daily returns in a histogram. At first glance, the distribution of SP500 daily returns may appear to adhere to a normal distribution. However, our further analysis using QQ plots and the Shapiro-Wilk test will reveal that this initial impression does not accurately reflect the true nature of the data.
import yfinance as yf
start_year = 2018
data = yf.download('SPY', start=f'{start_year}-01-01', end='2023-12-31')
data['daily_return'] = data['Adj Close'].pct_change() * 100
daily_returns = data['daily_return'].dropna()
plt.figure(figsize=(4, 2.5))
plt.hist(daily_returns, bins=100, alpha=0.5)
plt.title(f'SPY Daily Returns Histogram ({start_year}-2023)')
plt.xlabel('Daily Returns (%)'); plt.ylabel('Frequency'); plt.grid(True); plt.tight_layout(); plt.show()
[*********************100%%**********************] 1 of 1 completed
descp_stats = pd.DataFrame(daily_returns.describe())
print("Descriptive Statistics for SPY Daily Returns:")
descp_stats.round(2)
Descriptive Statistics for SPY Daily Returns:
daily_return | |
---|---|
count | 1508.00 |
mean | 0.05 |
std | 1.28 |
min | -10.94 |
25% | -0.49 |
50% | 0.08 |
75% | 0.69 |
max | 9.06 |
Q-Q Plot
Quantile-Quantile (Q-Q) plots are graphical tools used to assess if a dataset follows a particular distribution, such as the normal distribution. They work by plotting the quantiles of the data against the quantiles of the theoretical distribution. If the dataset follows the theoretical distribution, the points should lie approximately along a straight line (often called the line of identity).
In a Q-Q plot:
- The x-axis represents the theoretical quantiles from the normal distribution.
- The y-axis represents the ordered values (quantiles) from the dataset being tested.
We continue with creating the Q-Q plot for our SPY daily returns from 2018 to 2023.
# QQ plot of SPY daily returns
plt.figure(figsize=(4, 2.5))
stats.probplot(daily_returns, dist="norm", plot=plt)
plt.title(f'Q-Q Plot of SPY Daily Returns ({start_year} - 2023)'); plt.tight_layout(); plt.show()
The Q-Q plot compares the quantiles of our data (SPY daily returns) to the quantiles of a normal distribution. If the returns are normally distributed, the points should roughly lie on the 45-degree red line. Deviations from this line suggest departures from normality.
Looking at the generated Q-Q plot of SPY daily returns:
- The blue points represent the actual quantiles of the SPY daily returns plotted against the theoretical quantiles of a standard normal distribution.
- The red line represents the ideal situation where the sample quantiles perfectly match the theoretical quantiles of a normal distribution.
The plot indicates that while the central part of the dataset (between about -2 and 2 on the theoretical quantiles) appears to follow the red line quite closely, suggesting normal distribution characteristics in that range, the tails of the distribution (the ends of the dataset) deviate from the line significantly. The lower tail (left side) shows that the smallest values are more extreme than what would be expected for a normal distribution, and the upper tail (right side) shows that the largest values are also more extreme than expected.
The Q-Q plot tells us that the daily returns of the SPY index exhibit heavier tails than a normal distribution, which is a common characteristic of financial return data known as "fat tails" or "leptokurtosis." These fat tails indicate a higher occurrence of extreme events (both negative and positive returns) than would be predicted by a normal distribution. This finding challenges the assumption that stock returns are normally distributed, especially in terms of the behavior of extreme values, and suggests that the distribution of daily returns is likely to be leptokurtic.
Shapiro-Wilk Test
The Shapiro-Wilk test is used to test if a sample comes from a normally distributed population. It is one of the most powerful tests for normality, especially for small sample sizes, although it can be used for moderately large sample sizes up to 5,000 datapoints.
The test returns a W statistic and a p-value.
- The W statistic is the test statistic used in the Shapiro-Wilk test, which measures how closely the data correspond to a normal distribution. A W value close to 1 indicates that the data are likely normally distributed.
- A high p-value suggests the data do not significantly diverge from normality, meaning there's no strong reason to reject the null hypothesis that the data are normal. Conversely, a low p-value indicates a significant deviation from normality, leading to the rejection of the null hypothesis. In practice, if the p-value falls below a predefined threshold (often 0.05), it is taken as evidence against the data's normality. Not rejecting the null hypothesis doesn't definitively confirm normality - it merely implies that the data does not show strong enough evidence of non-normality.
shapiro_stat, shapiro_p = stats.shapiro(daily_returns)
print(f"Shapiro-Wilk Test for SPY Daily Returns ({start_year} - 2023):")
print(
f"W statistic:\t{shapiro_stat:.4f}
p-value:\t{shapiro_p}"
)
if shapiro_p < 0.01:
print(f"p < 0.01: data does not originate from normally distributed population")
else:
print(f"p < 0.01: data originates from normally distributed population")
Shapiro-Wilk Test for SPY Daily Returns (2018 - 2023):
W statistic: 0.8907
p-value: 2.156638108052502e-31
p < 0.01: data does not originate from normally distributed population
- Test Statistic (0.8907): This value indicates the degree of fit between the data and a normal distribution; values closer to 1 suggest a closer fit. A value of 0.8907 suggests some deviation from normality.
- p-value (2.16e-31): This extremely small p-value (practically 0 when considering numerical precision) provides very strong evidence against the null hypothesis that the data are normally distributed. Therefore, we conclude that the SPY daily returns from 2018 to 2023 do not come from a normally distributed population. The data are significantly different from a normal distribution.
Let's examine the implications of overlooking our discovery that SPY returns do not adhere to a normal distribution, while applying a one-sample t-test to determine if the average of SPY daily returns significantly deviates from 0. This approach involves using a statistical test, which presupposes normally distributed data, on a dataset that, in reality, does not follow a normal distribution, to assess if the mean return differs from 0.
# One-sample t-test on the daily returns
t_stat, t_p = stats.ttest_1samp(daily_returns, 0)
print(f"
One-sample t-test on SPY Daily Returns ({start_year} - 2023):")
print(f"t-statistic:\t{t_stat:.4f}
p-value:\t{t_p}")
One-sample t-test on SPY Daily Returns (2018 - 2023):
t-statistic: 1.5954
p-value: 0.11084172033526497
# Calculate total return of SPY over the period
total_return = (data['Adj Close'][-1] / data['Adj Close'][0] - 1) * 100
print(f"
SPY total return (2018 to 2023): {total_return:.2f}%")
SPY total return (2018 to 2023): 95.54%
- t-statistic (1.5954): This positive value indicates that the mean of the SPY daily returns is above 0 for testing against no change in returns. However, the magnitude of the t-statistic suggests a moderate difference.
- p-value (0.11084179870334748): This p-value is greater than the common alpha levels (e.g., 0.05, 0.01), indicating that there is not enough statistical evidence to reject the null hypothesis at these significance levels.
Based on this test, we cannot conclude that the average daily return significantly deviates from 0. We would incorrectly conclude that the market, as represented by SPY, has not consistently gained or lost value over this period based on daily returns, at least not to a statistically significant degree. However, looking at the total return of the index from 2018 to 2023, we find a total return of 95.54%, indicating a substantial gain. This discrepancy highlights the importance of checking the assumptions of statistical tests before drawing conclusions. Given the non-normality of the data, the conclusions drawn from the t-test should be approached with caution. For data that do not meet the normality assumption, non-parametric tests or transformations could provide more reliable insights into the data's characteristics.
Equal Variance
Equal variance, also known as homoscedasticity, describes a situation where different datasets have the same variance or spread. It is a common assumption in various statistical tests, including t-tests and analysis of variance (ANOVA).
When data exhibit equal variance, it means that the variability or dispersion around the mean is consistent across all groups being compared, regardless of the group means. This uniformity in variance is crucial for certain parametric tests because these tests rely on the assumption of equal variance to accurately estimate the sampling distribution and, consequently, to provide valid p-values and confidence intervals.
There are several statistical tests designed to assess whether the assumption of equal variance holds across groups, including:
- Levene's Test: Useful for testing equality of variances when the data may not be normally distributed.
- Brown-Forsythe Test: Similar to Levene's test but uses medians instead of means, making it robust to outliers.
- Bartlett's Test: More appropriate for data that are normally distributed but sensitive to deviations from normality.
- F-test: Assumes normally distributed data and compares the ratio of the variances of two samples to determine if they are significantly different.
If the assumption of equal variance is violated, alternative approaches or adjustments may be necessary, such as using different versions of statistical tests that do not assume equal variance (e.g., Welch's t-test instead of the standard t-test) or transforming the data to stabilize the variance.
# Crawl list of SP100 constituents from Wikipedia
import requests
response = requests.get( "https://en.wikipedia.org/wiki/S%26P_100")
tables = pd.read_html(response.text)
sp100 = tables[2]
sp100['Symbol'] = sp100['Symbol'].str.replace(".", "-")
sp100
Symbol | Name | Sector | |
---|---|---|---|
0 | AAPL | Apple | Information Technology |
1 | ABBV | AbbVie | Health Care |
2 | ABT | Abbott | Health Care |
3 | ACN | Accenture | Information Technology |
4 | ADBE | Adobe | Information Technology |
... | ... | ... | ... |
96 | V | Visa | Information Technology |
97 | VZ | Verizon | Communication Services |
98 | WFC | Wells Fargo | Financials |
99 | WMT | Walmart | Consumer Staples |
100 | XOM | ExxonMobil | Energy |
101 rows × 3 columns
import yfinance as yf
ticker_list = sp100['Symbol'].to_list()
ticker_list.append('SPY')
start_year = 2018
price_data = yf.download(ticker_list, start=f'{start_year}-01-01', end='2023-12-31')
price_data_daily_returns = price_data['Adj Close'].pct_change().dropna() * 100
[*********************100%%**********************] 102 of 102 completed
fig, ax = plt.subplots(3, 2, figsize=(6, 6))
for i, stock in enumerate(['SPY', 'AAPL', 'WMT']):
left, right = ax[i]
left.hist(price_data_daily_returns[stock], bins=100, alpha=0.5)
stats.probplot(price_data_daily_returns[stock], dist="norm", plot=right)
right.get_lines()[0].set_markersize(2.0); right.get_lines()[0].set_color('tab:blue')
left.set_title(f'{stock} - Daily Returns'); right.set_title(f'{stock} - Q-Q Plot')
left.set_ylabel('Frequency'); left.set_xlabel('Daily Returns (%)')
fig.suptitle(f'Daily Returns Histogram & Q-Q Plots ({start_year}-2023)', y=1.0)
plt.tight_layout(); plt.show()
Levene's Test
Levene's test is used to assess the equality of variances across different groups. It's designed to test if all input samples come from populations with equal variances, without assuming a normal distribution for those populations.
aapl_daily_returns = price_data_daily_returns["AAPL"]
wmt_daily_returns = price_data_daily_returns["WMT"]
# Levene's test for equal variances
levene_stat, levene_p = stats.levene(aapl_daily_returns, wmt_daily_returns)
print("Levene's Test for Equal Variances:")
print(f"test statistic:\t{levene_stat:.3f}
p-value:\t{levene_p}")
if levene_p < 0.01:
print(f"p < 0.01: variances are significantly different.")
print(f"Interpretation: Use non-parametric tests or Welch's t-test for unequal variances")
else:
print(f"p > 0.01: variances are equal")
print(f"Interpretation: Use parametric tests for equal variances")
Levene's Test for Equal Variances:
test statistic: 106.564
p-value: 1.7836047743809876e-24
p < 0.01: variances are significantly different.
Interpretation: Use non-parametric tests or Welch's t-test for unequal variances
F-Test
The F-test for equality of variances, also known as the F-test for homogeneity of variances, is a statistical procedure used to determine if two populations have the same variance. It's important to note that the F-test assumes that both groups are sampled from normally distributed populations. If the normality assumption is violated, the results of the F-test may not be reliable. In such cases, other tests like Levene's test are preferred because they are less sensitive to departures from normality. Additionally, the F-test is typically used only for comparing two groups; for more than two groups, ANOVA would be the appropriate technique to test for equal variances.
# Conducting an F-test for equal variances
var_original = np.var(aapl_daily_returns, ddof=1)
var_cloned = np.var(wmt_daily_returns, ddof=1)
F = var_original / var_cloned
df1 = len(aapl_daily_returns) - 1 # degrees of freedom for the first sample
df2 = len(wmt_daily_returns) - 1 # degrees of freedom for the second sample
F_p_value = (
1 - stats.f.cdf(F, df1, df2)
if var_original > var_cloned
else stats.f.cdf(F, df1, df2)
)
print("F-test for Equal Variances:")
print(f"test statistic:\t{F:.3f}
p-value:\t{F_p_value}")
if F_p_value < 0.01:
print(f"p < 0.01: variances are significantly different")
print(f"Interpretation: Use non-parametric tests or Welch's t-test for unequal variances")
else:
print(f"p > 0.01: variances are equal")
print(f"Interpretation: Use parametric tests for equal variances")
F-test for Equal Variances:
test statistic: 2.075
p-value: 1.1102230246251565e-16
p < 0.01: variances are significantly different
Interpretation: Use non-parametric tests or Welch's t-test for unequal variances
# Set equal_var=False to indicate that we do not assume equal population variances
t_stat, p_value = stats.ttest_ind(aapl_daily_returns, wmt_daily_returns, equal_var=False)
print("Welch's t-test for AAPL and WMT Daily Returns:")
print(f"t-statistic:\t{t_stat:,.3f}
p-value:\t{p_value:.6f}1")
if p_value < 0.01:
print(f"p < 0.01: distributions are significantly different")
else:
print(f"p > 0.01: distributions are not significantly different")
Welch's t-test for AAPL and WMT Daily Returns:
t-statistic: 1.201
p-value: 0.2300541
p > 0.01: distributions are not significantly different
# Mann-Whitney U test
# The alternative parameter specifies the hypothesis to test; 'two-sided' tests the hypothesis
# that the distributions are different (not specifying whether one is greater than the other).
# If you have a directional hypothesis, you can use 'greater' or 'less' instead.
u_statistic, p_value = stats.mannwhitneyu(
aapl_daily_returns, wmt_daily_returns, alternative="two-sided"
)
print("Mann-Whitney U Test for AAPL and WMT Daily Returns:")
print(f"t-statistic:\t{u_statistic:,.3f}
p-value:\t{p_value:.6f}1")
if p_value < 0.01:
print(f"p < 0.01: distributions are significantly different")
else:
print(f"p > 0.01: distributions are not significantly different")
Mann-Whitney U Test for AAPL and WMT Daily Returns:
t-statistic: 752,986.000
p-value: 0.0985831
p > 0.01: distributions are not significantly different
# Calculate AAPL total return and WMT total return
aapl_total_return = (price_data['Adj Close']['AAPL'][-1] / price_data['Adj Close']['AAPL'][0] - 1) * 100
wmt_total_return = (price_data['Adj Close']['WMT'][-1] / price_data['Adj Close']['WMT'][0] - 1) * 100
print(f"AAPL total return (2018 to 2023): {aapl_total_return:.2f}%")
print(f"WMT total return (2018 to 2023): {wmt_total_return:.2f}%")
AAPL total return (2018 to 2023): 372.78%
WMT total return (2018 to 2023): 77.89%
# Calculate Pearson's correlation coefficient
correlation_coefficient, p_value = stats.pearsonr(aapl_daily_returns, wmt_daily_returns)
print("Pearson's Correlation Coefficient for AAPL & WMT Daily Returns:")
print(f"r:\t\t{correlation_coefficient:.3f}
p-value:\t{p_value}")
Pearson's Correlation Coefficient for AAPL & WMT Daily Returns:
r: 0.390
p-value: 4.503405572247887e-45
# Filter price_data_daily_returns to only include 5 stocks for each sector in SP100
sectors = sp100['Sector'].unique()
sector_stocks = []
for sector in sectors:
sector_stocks.extend(sp100[sp100['Sector'] == sector]['Symbol'].to_list()[:5])
price_data_daily_returns_sector = price_data_daily_returns[sector_stocks]
import seaborn as sns
pearson_correlation_matrix = price_data_daily_returns_sector.corr(method='pearson')
plt.figure(figsize=(9, 7))
sns.heatmap(pearson_correlation_matrix, annot=False, cmap='coolwarm', cbar=True)
plt.title("Pearson's Correlation Matrix for Daily Stock Returns")
plt.yticks(rotation=0); plt.tight_layout(); plt.show()
The matrix shows the correlation coefficients between pairs of variables in the dataset. These coefficients range from -1 to 1, where:
- 1 indicates a perfect positive linear relationship,
- -1 indicates a perfect negative linear relationship, and
- 0 indicates no linear relationship.
The heatmap visualization makes it easy to identify the strength and direction of relationships between the variables. For example, you can see how 'Variable1' correlates with 'Variable2', 'Variable3', and 'Variable4', and so on for each pair of variables. The color scale aids in quickly identifying strong and weak correlations.
# Calculate Spearman's rank correlation coefficient
rank_correlation_coefficient, rank_p_value = stats.spearmanr(aapl_daily_returns, wmt_daily_returns)
print("Spearman's Rank Correlation Coefficient for AAPL & WMT Daily Returns:")
print(f"r:\t\t{rank_correlation_coefficient:.3f}
p-value:\t{rank_p_value}")
Spearman's Rank Correlation Coefficient for AAPL & WMT Daily Returns:
r: 0.315
p-value: 4.3214224570231754e-29
spearman_correlation_matrix = price_data_daily_returns_sector.corr(method='spearman')
plt.figure(figsize=(9, 7))
sns.heatmap(spearman_correlation_matrix, annot=False, cmap='coolwarm', cbar=True)
plt.title("Spearman's Rank Correlation Matrix for Daily Stock Returns")
plt.yticks(rotation=0); plt.tight_layout(); plt.show()