A.I and the Stock Market: The Training Dilemma

As I create the embeddings which I will use in my upcoming trading algorithm, I am faced with yet another question. When training a recurrent neural network on stock market data, is it best to train your network on specific stocks/ETFs (embeddings tailored to a specific commodity, such as AAPL_embeddings for example) or to train your network on all data available?
If I train my network on data from all stocks and ETFs, this would give us more information to learn from, and would produce a more experienced neural network and more meaningful embeddings. For me, this would allow for more breadth in identifying patterns in the markets, since there are more historical embeddings in my data set. However, this could confuse things, as maybe there are patterns which are specific to particular stocks or ETFs, and these patterns would be lost if we train on thousands of other data points. If I train my neural net solely on historical AAPL stock prices, my trading algorithm would only be able to analyze current AAPL stock prices, and try to match these current prices to patterns in AAPL stock prices. This should mean that when a match is found, this match is more likely to be correct (because our embeddings are so specific to AAPL prices).
My opinion is that I should train my network on all of my data, since the market patterns I’m looking for are universal (e.g: debt-driven market cycles and short-squeezes are not isolated to single stocks, so these patterns would appear through the behavior of all stocks and ETFs). Also, the fact I’m using % price shifts as data points should make training even more universally applicable.
Essentially, there’s a trade-off between specificity and quantity of data. This question isn’t limited to stock market AI, as it is applicable to all forms of artificial intelligence and neural networks. It’s actually quite an interesting discussion. Some, such as Google’s Peter Norvig, argue that the algorithms of today are no better than the algorithms we had in the past, it’s just that we have more data and better technology to train these algorithms. Gottfried Leibniz detailed binary systems and logical deductions some 300+ years ago, it’s just that he didn’t have the right technology to implement his systems on. Nowadays, all modern computers use hardware based off of his binary calculating system.
In cases of supervised machine learning, you definitely want more specific data over more data, as supervised learning algorithms train on labelled examples which have to be carefully (and accurately) sorted. Consider an algorithm which aims to recognize the letter E. You would need to train this algorithm on thousands of pictures of the letter E, and thousands of pictures of letters that aren’t E. In each training example, you need to tell the computer which letter is/isn’t an E, so it knows what it is learning. Simply feeding millions more letters into your training set without carefully labeling them will only damage the accuracy of your system. Here’s one example of this, where Stanford researchers found that more data isn’t better for automatically classifying chest X-rays.
Thankfully, NLP A.I is almost always done in forms of unsupervised learning. Think of Word2Vec, which learns word relationships by reading through a corpus of text. It doesn’t matter which corpus of text you use, and you don’t need to label each word since Word2Vec learns through context. This means that the more data (text) you train your Word2Vec model on, the more accurate your algorithm will be for predicting which words appear in which contexts. And that’s not just me speculating:

This is taken from Scaling to Very Very Large Corpora for Natural Language Disambiguationa 2001 paper published by Microsoft NLP researchers. In their paper, they compare the performances of various natural language processing algorithms. In the graph above, we see how the accuracy of four (drastically) different algorithms all increase (almost linearly!) as more words are added to the training set. Word2Vec shares this property, as is acknowledged in various papers, such as Improving the Accuracy of Pre-Trained Word Embeddings for Sentiment Analysis : “The accuracy of the Word2vec and Glove depends on text corpus size. Meaning, the accuracy increases
with the growth of text corpus.”
Taking all of this into consideration, I still think it is best to train my Word2Vec embeddings on as much stock market data as possible, rather than creating specific embeddings which focus on one stock/ETF at a time. One thing I will mention is that I am shocked to see how little literature there is on the topic of training RNNs on stock market data, with specific reference to this question of ‘more data vs. better data’. Once again, I think this comes down to two possibilities: Either people aren’t training NLP machine learning models on long-term stock market data, or they just don’t want to share their findings. I still think it’s the latter.
 
 

Leave a Reply

Your email address will not be published. Required fields are marked *