What model predicts stocks best?

Overview

One of the most tantalizing challenges in modeling is trying to model stock prices — the more accurately you can predict a normally unpredictable marketplace, the greater trades you can make that could lead to financial success. Banks, brokerages, and other players in the stock market have spent years (especially now in the age of robo-investing) trying to accurately predict the markets.

Key questions

  • What model (linear regression, boosted tree, nearest neighbors) provides the most accurate predictions for the percent change in stock price across the S&P 500?
  • What model (linear regression, boosted tree, nearest neighbors) provides the most accurate predictions for the percent change, in each sector?

Setting up the data

To setup the data, I imported the stock data for the 500 individual companies (all_stocks_5yr.csv, which becomes sp500_companies) and the data that has company details like the name and sector (constituents.csv, which became company_names).

## Warning: Removed 1 rows containing non-finite values (stat_density).
The distribution of the variable I’m focusing on, the percent change in stock price.

Setting up the models

The models I’m using for this EDA are: — A simple linear regression model, using the “lm” engine, — a boosted tree algorithmic model, using the “xgboost” engine, — and a k-nearest neighbors model, using the “kknn” engine.

Setting up the function and for loop

To apply these three models to each of the 500 companies, we have to set up: — A function, that takes a company and a model as an input, and then runs the best version of that model (based on the model’s RMSE value) on the company’s stock data. This function then returns the mean absolute error, or the average difference between the predicted percent change and the company’s actual percent change in 2017. — A for loop, which will put every company & model combination through the above function, and will output that combination’s mean absolute error.

Cleaning up and presenting our results

To graph our results, we’ll import the already processed company details dataset, company_names.csv, along with the finished function and for loop output, compare_models.rds.

The index-wide results
The sector specific results

Interesting findings

  • The un-tuned linear regression model is actually surprisingly accurate for a lot of these companies. Though nearest neighbors is right behind, more than 40% of the companies on the S&P 500 were better modeled by a simple regression.
  • Boosted tree models completely flopped in the sector of Communication Services (companies like Facebook, Google, Apple and Disney). By contrast, the boosted tree models had their highest foothold in Utilities, Health Care, and Information Technology.

Answering our key questions

  • What model (linear regression, boosted tree, nearest neighbors) provides the most accurate predictions for the percent change in stock price across the S&P 500?

Making sense of our results

So what does this all mean?

What now?

To build on these results, I’d love some more recent stock data — specifically, data for the beginning of the pandemic during March 2020 to the present. I’d love to know if a specific model is more resilient to a large sea change like that than the others.

Data citations

S&P 500 stock data” by Cam Nugent, uploaded on kaggle.com: S&P 500 stock data | Kaggle

my medium page. check out my website: manband.one