Stock Price Predictions with LSTM Neural Networks and Twitter Sentiment

Predicting the trend of stock prices is a central topic in ﬁnancial engineering. Given the complexity and non-linearity of the underlying processes we consider the use of neural networks in general and sentiment analysis in particular for the analysis of ﬁnancial time series. As one of the biggest social media platforms with a user base across the world, Twitter offers a huge potential for such sentiment analysis. In fact, stocks themselves are a popular topic in Twitter discussions. Due to the real-time nature of the collective information quasi contemporaneous information can be harvested for the prediction of ﬁnancial trends. In this study, we give an introduction in ﬁnancial feature engineering as well as in the architecture of a Long Short-Term Memory (LSTM) to tackle the highly nonlinear problem of forecasting stock prices. This paper presents a guide for collecting past tweets, processing for sentiment analysis and combining them with technical ﬁnancial indicators to forecast the stock prices of Apple 30m and 60m ahead. A LSTM with lagged close price is used as a baseline model. We are able to show that a combination of ﬁnancial and Twitter features can outperform the baseline in all settings. The code to fully replicate our forecasting approach is available in the Appendix.


Introduction
"The most valuable commodity [...] is information." This famous quote by Gordon Gekko in The Wall Street in 1987 implies that in the framework of the stock market profits can be generated based on superior knowledge. This stands in contradiction to the Efficient Market Hypothesis (EMH) which "states that market price mirrors the assimilation of all the information available. As the new information enters the market the system immediately enters the unbalanced state and predicted correct change is eliminated by the new price. Hence given the information it is not possible to predict the future price of the stock." [2] However, during the last years the EMH was frequently taken into question as some researchers found evidence against it. Some recently published approaches show that the prediction of stock prices can be significantly improved by the use of knowledge from micro blogs like Twitter. [8] achieves an accuracy of 87.6 % in predicting the daily movements of the Dow Jones Industrial Average (DJIA) by using a large-scale of tweets. Among others [36] shows that this accuracy level is difficult to duplicate in different settings.
A crucial point in our framework is the proper feature engineering with tweets. There are many different possibilities to extract Twitter data and to create variables based on tweets. For example, one could use the data to assign tweets into different emotional categories or to use it only for counting certain words of interest. Depending on the chosen method, the predictive power of the model might differ strongly. [8,36,11,55,38,59,44,42,43] 269 Furthermore, the applied statistical method has a high impact on the achievement as the prediction problem is more complicated than a simple linear regression issue due to the non-stationary stock price data. Therefore, [42], [47], [11], [40], [14] use neural networks (NN) which are powerful tools to tackle highly nonlinear problems. More specifically, recurrent neural networks (RNN) are applied as those networks are known to be a very good choice for time series data. In order to address the issue of exploding or vanishing gradients Long-Short-Term-Memory (LSTM) framework is used. [17] The structure of this paper is organized as follows, firstly we provide a literature review regarding NN and Twitter data for stock price (index) predictions. Afterwards, in a section about feature engineering, we discuss various variables which will be relevant for the later empirical analysis. The focus will be especially on the Twitter covariates which will be calculated partly with the help of the Python package TextBlob. Thirdly, we provide insights into the RNN and the LSTM approach respectively. Moreover, we introduce Random Search as a sophisticated solution for optimizing hyperparameters. Finally, in our empirical analysis we forecast stock prices 30m and 60m ahead for Apple. This research contributes to the literature with a practical guide to forecast shortterm stock prices with a LSTM including TextBlob-Twitter variables. The code to fully replicate our forecasting approach is available in the Appendix.

Neural Networks
As stated in the introduction, NN are powerful tools to model highly nonlinear problems and to capture complex relationships between variables which allow for the flexible modelling of stock price time series. Various forms of NN exist, but during the past years two specific types have been established are in particular frequently applied. Firstly, the long short term memory (LSTM) which belongs to the class of recurrent neural networks (RNN) and secondly the convolutional neural network (CNN).
Starting with the first mentioned network, [42] analyze various machine learning algorithms for daily stock market predictions and concludes that an approach based on LSTMs works best. Also [47] uses a LSTM for predicting the National Stock Exchange Fifty (NIFTY 50). By trying different combinations of open, close, high and low prices as input parameters, they confirm the usefulness of the LSTM for stock price predictions. Further approval can be found in [11] where daily stock prices of Apple, Microsoft and Google are successfully predicted with an LSTM. The results of [40] for predicting the share prices of Apple, Google and Tesla show once again that a LSTM seems to be more accurate than the traditional machine learning algorithms. Moreover, [14] compares the LSTM against the standard deep net and logistic regression for S&P 500 predictions from December 1992 until October 2015 and finds that the LSTM beats both approaches by a clear margin. Furthermore, [37] concludes that a LSTM performs better in prediction of the close price of iShares MSCI United Kingdom ETF than other machine learning algorithms. Finally, [53] compares several regression techniques for predicting stock prices in short intervals and concludes that LSTM outperforms the other models by a large margin.
In recent years the usage of the class of CNN for modelling financial time series received more attention in the literature. [13] compares different NN for daily S&P 500 predictions and derives that CNN architectures perform best for financial time series. [52] shows that CNNs have the best predictive qualities for stock prices. Additionally, [51] confirms with their study the high accuracy of CNNs in the prediction framework of three Thai stocks. Moreover, [22] uses various NNs to predict stock prices of five different companies from National Stock Exchange (NSE) and New York Stock Exchange (NYSE) and finds that the CNN is outperforming the other models.
Most recently, the literature has shown a drive towards the combination of LSTM and CNN for stock price predictions. Two recent papers are [25] and [28] where both conclude that the hybrid form outperforms the competitors.

Twitter
In the following, we focus on publications that use Twitter data for stock price predictions. Considering the data collection, two main approaches can be found in the literature that are both based on the Twitter API. On the 270 STOCK PRICE PREDICTIONS WITH LSTM NEURAL NETWORKS AND TWITTER SENTIMENT one hand, in line with [11], [55] or [38] one can limit the extraction by defining certain search words, like for example the company name and ticker symbol to stream only tweets which are directly connected to the stock in question. [55] even analyzes the performance between different keywords and concludes that the predictive power for streamed tweets with the company ticker symbol provides better performance for Apple stock prices. However, using this approach the total quantity of tweets per day is lower compared to streaming tweets with the full company name.
On the other hand, a further frequently employed method is the usage of geographical coordinates, e.g. just tweets from major cities in the United States, or emotional keywords, e.g. "I feel" or "I am", and to stream all tweets that are available. † [8], [59] and [36] analyzes Twitter text content and gathers them into mood categories. Their results show a correlation between the Dow Jones Industrial Average and these categories. Further, their findings suggest that the accuracy of stock price indices prediction be significantly improved by using the mood categories.
In the following, we discuss the data processing in order to generate valuable features. The majority of publications directly focus on the evaluation of the sentimental content. However, only a few papers -like [33] -use meta data beyond the expressed opinion of the user, like count of tweets per day. In the framework of the sentimental evaluation, there are two major ideas. Firstly, one frequently used approach is the classification of the tweets into multiple mood categories. In this context, probably the most well known example for this method is the publication of [8]. By classifying tweets into the six mood categories calm, alert, sure, vital, kind, and happy they receive especially for certain moods significant results. The second popular method is a more simple binary classification of the tweets into positive and negative. [8] try this technique as well, but in contrast to the mood analysis do not find any predictive power for daily DJIA prices. [11] uses an extended version of the positive/negative classification by introducing seven different categories from very negative to very positive and confirmed predictive power for daily stock prices of Apple, Microsoft and Google. Further confirmation for this finding can be found in [38], where tweets are only classified into three categories (positive, neutral and negative) and predictive power for daily up/or down predictions for the Microsoft stock is concluded. [59] provides similar results with predictions for different tech companies based on positive and negative sentiment calculations. Moreover, a hybrid between both methods is used in [44] where a classification into positive and negative tweets is performed by counting emoticons or certain emotional keywords. [10] successfully predicts daily stock prices by using only two moods, e.g. happy and sad. The happiness mood is also a promising variable in the analysis by [36] who calculates in total four different mood categories and is thereby able to improve daily predictions of the DJIA.
Only few publications refute the predictive power of twitter variables for stock price predictions. For example, [43] classifies tweets into 8 mood categories and assesses the improvement of the daily predictions of S&P 500 and DJIA as negligible. Additionally, [42] finds that only extreme polarities in tweets have a significant influence.

Feature Engineering
For our empirical analysis, we distinguish between "classical" financial variables that are accessible for instance via Yahoo Finance and "novel" covariates derived from Twitter.

Yahoo Finance Features
Starting with first mentioned feature category, we use five common variables provided by Yahoo Finance: OHLC. Yahoo Finance offers Open, High, Low and Close price (OHLC) for each stock per time interval. Each type of price serves as a single input.
Volume. Additionally, we use the trading volume in a 30m interval. It reflects the current trading activities of a stock and therefore may help to improve predictions. The trading volume is used for example in the analysis by [1].

Twitter Features
Concerning the features based on tweets, we distinguish between two different kinds of covariate types. On the one hand, we want to capture the content of the Twitter data by calculating sentiments. On the other hand, we construct predictors which ignore the sentimental content of the tweets and only concentrate on descriptive statistics, e.g. number of tweets per time interval.
As shown in the literature review, many publications analyze the predictive power of sentiment based variables. However, we contribute with a more fundamental analysis by extracting more information from the available data. We assess whether more simple Twitter features could already improve the prediction of a NN, which is shown by [12], but only for a simple Support Vector Machine model.

Content Unrelated Variables
Firstly, we assess which content unrelated information can be extracted from the data. We use descriptive statistics to capture different attributes of the data. Note, that we introduce our variables regarding stock data in the interval of 30m as this was the interval of our analyzed data. Let y i,d,t be the close price of company i on day d at time t with t ∈ {10.00 am, ... , 4.00 pm}. Then which is the subset of tweets for company i on day d from time t − 31 minutes till time t − 1 minute. Furthermore, let tw i,d,n be the n th tweet on day d for company i with n = 1, ..., N , where N is the total number of tweets for the company on that day.
Frequency. One simple but plausible variable is the count of released tweets per 30m, which is shown in equation (1). In addition, we calculate the average number of released tweets per 1m during the 30m time horizon, illustrated in equation (2). It seems to be reasonable that the underlying information can offer predictive power as one would expect that the number of tweets increases, when there are positive or especially negative events for a company. [33] already observes relationship.
Volatility. As shown in equation (3), we include the volatility of the released tweets as a predictor. We calculate the number of tweets per 1m and compute the volatility during the overall 30m time period. For the tweets per 1m we maintain sufficient number of values for the volatility calculation. For instance, if one uses the number of tweets per 15m, there would be only two values available for calculating the variance during the 30m time horizon of the stock data. In general, the idea behind this feature is similar to the frequency variable. We assume that on average the volatility should be comparable on a daily or weekly basis if nothing extraordinary happens to the company.
Max (Min) number of tweets. Additionally, we include the minimum and maximum number of tweets per 1m during the 30m stock data interval which is shown in equation (4). The features are used to provide additional information for the model, regarding the distribution of tweets during the 30m horizon. For example, if there is a sudden peak for the number of tweets per 1m, the average of tweets per 1m might not be able to capture this information completely, when all other values are very similar.
Length of tweets. The last feature is the average length of the tweets during the time interval. It is mainly motivated by the results of [12], who conclude that this variable offers less predictive power than for example the 272 STOCK PRICE PREDICTIONS WITH LSTM NEURAL NETWORKS AND TWITTER SENTIMENT count of tweets. The general argument is, that tweets tend to get longer when there are some news regarding a company as the users have new topics to discuss.

Content Related Variables
After determining content unrelated variables, we focus on features that measure the content of a tweet. As shown in the literature review, two frequently used approaches are the assignment of multiple mood categories or a classification into positive or negative tweets. The first mentioned categorization is mainly used for stock indices and the second for stock prices. Therefore, as the close prices of Apple were our prediction targets, we decided to use a sentiment classification. Some publications, for example [59] and [38], manually label the Twitter data. However, pretrained classifiers such as those provided by the Python package TextBlob ( [30]) already offer a very effective implementation for sentiment predictions. The calculation of following variables are based on the analysis of [21].
Polarity. First, we calculate the average polarity based on the sentiment attribute of TextBlob for all tweets that are allocated to a 30m time interval. The polarity score ranges from -1 to 1. Scores greater than zero indicate positive content and values below zero negative content. If the score is equal to zero, the tweet is classified as neutral. In order to compute the polarity, we employ TextBlobs inbuilt dictionary approach, where each sentimental word, like "good" or "bad", has its own score as illustrated in 1. If there are several sentimental words in one text, TextBlob calculates in general the simple average. However, deviating calculations can occur in the case of amplifications, e.g very good, and negations, e.g. not good. This variable is motivated by the assumption, that when the average sentiment is positive, the stock price should rise and vice versa.
Subjectivity. Second, in a similar fashion we compute the average subjectivity. The subjectivity score ranges from 0 to 1. Scores close to zero indicate objective content and values close to one subjective content. The calculations are again based on a dictionary approach. A similar example as seen for the polarity can be found in 2. The main idea behind this variable, is to have a counterpart for the polarity variable, as some tweets might be very positive, however, the used language may indicate that the tweet could be rather subjective.
Number of positive (negative) tweets. Third, we classify the single tweets into positive, negative or neutral groups based on their polarity scores. Afterwards, we count the number of positive and negative tweets in a 30m time interval. The cardinality of this subset is given by, let with all positive classified tweets, then x i,10,d,t as shown in equation (5). Analogously, x i,11,d,t is the cardinality of the subset of all negative classified tweets. The notion is to have additional information on the amount of positive or negative classifications, since the previous described average polarity only ranges between -1 and 1.
Share of positive (negative) tweets. Finally, the previously described variables are extended by introducing percentage shares for positive and negative tweets, in comparison to the total number of tweets per time interval.

RNN
The usage of a classical NN is limited because it takes a fixed-sized vector as input and produces a fixed-sized vector as output. A RNN is an extension that is able to process sequential data. The RNN iterates through a sequence and produces output at each time step. During this process, the network keeps an additional state besides the output to save information about historic values. These memory units enable the RNN to model short term dependencies. The model can take the ordering of the elements and the relations between the elements of a sequence into account. A basic illustration of the approach can be found in Figure 1. A basic RNN consists of six major components. At each time point t the model receives a new network input which is denoted by x t . Then the internal network state at time t, i.e. s t , is generated based on the weighted network input (x t U ) and the weighted previous internal state (s t−1 W ). By combining them in an activation function ϕ s (x t U + s t−1 W ) one receives s t . U defines the contribution of x t for s t and W the contribution of s t−1 for s t . Further, y t denotes the network output at time t and is generated by the weighted internal state at this time transformed by a second activation function ϕ y (s t V ). Therefore, the contribution of s t for y t is defined by V . Even though this approach is more sophisticated for processing time series data than classical NNs, it still has a major drawback known under the vanishing and exploding gradients problem. During the processing of longer input sequences an activation function could produce a very small gradient, which indicates low importance. Therefore, the RNN forgets about this step. [58] Figure 2 shows the basic components of a LSTM cell. The actual cell state is noted by c t . This is the LSTM's memory, which stores the data along with all the cells. Based on the interaction with the gate structure it modifies the information in the cell state of each cell. Every LSTM cell receives as input the previous hidden state h t−1 , the previous cell state c t−1 and the current input x t . The first sigmoid layer is also called the forget gate (6) because the output selects the amount of information to be included from the previous cell in the current cell.
The output is a number in the interval [0,1] which is pointwise multiplied with the previous cell state. If the gate generates a zero, no new data will be added to the cell state, and by generating a one the full data will be added. Through this process, the gate decides which information should be kept or forgotten, which gives it the name forget gate.
Like the first sigmoid layer, the second sigmoid layer (7) takes as input the hidden state of the previous cell and combines it with the current input.
To regulate the network, the previous hidden state, and current input also go into the tanh layer (8). Afterwards, the output of the tanh and sigmoid function gets multiplicated pointwise, where the sigmoid output decides which information is important to keep from the tanh output. For this reason, the layer is also called input gate (9) .
The output of the current cell state (11) is also the result of a pointwise multiplication of a tanh and a sigmoid layer. Which part of the cell state will be presented in the output will be determined by the sigmoid layer, whereas 274 STOCK PRICE PREDICTIONS WITH LSTM NEURAL NETWORKS AND TWITTER SENTIMENT the tanh layer shifts the output in the range of [-1,1]. These gates determine which data goes into the next step, and regulate the optimization of the weights. If the gradient vanishes during backpropagation, it will also vanish during forward propagation. Therefore, this weight will also not impact the estimation, so it will not be optimized within backpropagation. [58]

Hyperparameters
As optimization and selection of the hyperparameters play an important role in the performance of a LSTM, the following part gives an overview of how we chose the tuning parameteres. Hyperparameters are manually set training variables that are determined before training the model. One major problem of NNs are the over-and underfitting. On the one hand, if the model overfits, the NN performs very good on the training and validation dataset, but performs very poorly on out of sample data. Due to a high number of parameters and, therefore, complexity the model has then become so effective in explaining the specific characteristics of the training set that it has lost the ability to generalize. As a result, the model will not be able to make stable and consistent predictions for new data. On the other hand, if a model has a more simple structure, it tends to underfit the data because it overgeneralizes the given input. The previous issue refers to the so-called bias-variance trade off where a high variance is usually proportional to overfitting and a higher bias to underfitting [45,49].
For finding a good balance between bias and variance in the framework of a LSTM, it is important to select the main hyperparameters carefully which are the number of epochs, batch size, activation function, optimizer, and number of hidden layers and neurons. The epoch size determines the number of times that the learning algorithm will work through the entire training dataset. Thus, the number of epochs can be any integer between one and infinity where lower values tend to create underfitting and bigger epoch sizes lead to an overfitting problem as the learning algorithm will have too many rounds to over-optimize the weights of the LSTM. A training step can be further split into many iterations based on the batch size. It defines the number of samples to work through before the internal parameters of the model are updated. Therefore, the batch size has a huge impact on how quickly a model learns and on the stability of the learning process. Often smaller batch sizes are used because they are noisy, providing a regularizing effect and a decreased error in generalization. [62] Due to this fact we decide to use 32 batches which is also confirmed as a suitable default by [34] and Bengio13practicalrecommendations.
To increase the performance of the model the right optimizer plays an important role as well. There are many variants that could be used in our framework. A common optimizer class is the stochastic gradient descent (SGD). Variants of the vanilla SGD are momentum optimization, Nesterov accelerated gradient, or adaptive learning-rate methods, i.e. AdaGrad, Adam or RMSProp. We will not go into detail regarding the differences, however a detailed description of the different approaches can be found in [17]. In our empirical analysis we use Adam (adaptive moment estimation), which works well in practice and outperforms other adaptive learning-method algorithms. Therefore, the optimizer is used among others in [13], [37] and [27]. A reason for this good performance is that Adam combines the benefits of both the AdaGrad and RMSProp algorithms. [29,48] Equation (12) shows the main steps of Adam. One can see that an exponential moving average of the gradient of the cost function with respect to the weights ∇ θ J(θ) captured by m and the squared gradient (∇ θ J(θ)) 2 captured by s is calculated by the Adam optimizer. These moving averages are estimates of the first and second moment of the gradient and are initialized as vectors of 0 ′ s. This leads to moment estimates which are biased during the initial timesteps, and when the decay rates are small. To counteract this bias the bias-corrected estimatesm andŝ are introduced. The parameters β 1 and β 2 control the decay rates of these moving averages and t represents in this formula the iteration number starting at one. The vector of weights θ is then updated by the division of m by the square root of s and a smoothing term ϵ, which is a very small number to prevent any division by zero. The quotient is afterwards weighted by the learning rate η [29,17,48]. Proceeding, a loss function has to be determined which will be the minimization target of the optimizer. We decide to use the MSE, which is shown in equation (13), as it is a popular choice for the evaluation of financial time series predictions. In this formula N stands for the total number of observations. The MSE is frequently used in other publications such as in [39], [57], [31], [56], [13], [37], [55], [4] and [54].
Moreover, the number of hidden layers and neurons can have a huge impact on the success of the predictions. A higher number for at least one of the parameters makes it possible to cover a larger complexity of the relationship between input and output variables. However, one has to pay attention the model does not over-or underfit. Therefore, finding the right combination can be a challenging task. Among other [17] recommends to construct complex models with many parameters and then to apply regularization techniques. Regularization incorporates different approaches to enable neural networks to choose models that generalize well. The aim is to minimize the variance for more accurate predictions on new input sets, without increasing the bias due to a systematic failure. Source: [6], p. 284.
On the one hand, we decide to implement early stopping which is a method that makes it possible to specify an arbitrarily large number of training epochs. It will stop the training if no further significant improvements on the validation set are achieved and automatically remembers the round that led to the best result. [7,17,18,54] On the other hand, a second commonly used regularization method that we implement are dropout layers which also help to reduce the problem of overfitting. While training a network, the dropout method excludes the data on the input connection for a given probability so that the data cannot be used for the next step. In Keras when creating a LSTM layer, this is specified with a dropout argument. The dropout value is a percentage between 0 -no dropout -and 1. [61,54,16,41] Nevertheless, even though simpler models could potentially be ignored through regularization, there remains an infinite number of more complex models that can be constructed. Due to this fact, the random search offers a handy implementation of trying various tuning parameter combinations for receiving the most promising architecture. This method is known to be more efficient than a grid search as not all hyperparameters are equally important which is illustrated in Figure 3.
Additionally, trying all constellations can be very time consuming, especially for a high number of parameters with large ranges. In general, the user defines a flexible architecture with a predefined range for all tuning parameters and a certain number of trials. Afterwards, for each trial random search decides independently on the parameter combination and fits the model various times. It makes sense to fit the same model several times as the weights at the beginning are randomly initialized. [6,20,15] We use a random search with 50 maximal trials and 3 executions per trial. The first LSTM cell can have recurrent units from 17 to 500 and as an activation function the common options relu, tanh and sigmoid. For the number of hidden LSTM layers we define a range between 0-3 where each layer has the possibility of having between 17 and 250 recurrent units. If the random search specifies a model with hidden layers, there is also a hidden dropout per layer with a minimum dropout of 5% and maximum of 95%.

Data preparation
Two different data sources are used in this study. On the one hand, we stream the stock prices for Apple via the yfinance API in Python. On the other hand, we create Twitter Developer Accounts to stream past Twitter data via the tweepy ( [46]) API in Python. As shown in the literature review for stock price predictions based on Twitter variables, a frequently used approach to extract tweets for a company is to use the respective ticker symbol (e.g. AAPL) and cashtag (e.g. $AAPL). We use both as search words in separate queries.
As part of the data cleaning, we remove mentions, hashtags, retweets and URLs from the collected Twitter data. We decide against further preprocessing, since we use the Python package TextBlob ( [30]), where further prepossessing is not necessarily useful. For example, by using stemming TextBlob cannot recognize some words anymore as it is using a dictionary approach based on full words. Lemmatization can be problematic as well because it groups inflected forms of a word to the same expression. For example, "better" would have been changed to "good". However, TextBlob has different polarity and subjectivity scores for both words, so that we decide against lemmatization. We use the polarity and subjectivity attribute of TextBlob to calculate scores per cleaned tweet. Afterwards, we classify the tweets into positive (score > 0), negative (score < 0) and neutral (score == 0) based on the polarity.
Since we analyze intervals of 30m, all further explanations will refer to this period length. The tweets are assigned to the stock prices via a moving window approach. After every 30m we receive a new close price. The tweets of this time interval are used for the calculation of the Twitter features. We have to point out that the previous described procedure remains the same for the beginning of the week and of a trading day. As such, we lose tweets on the weekend or during the night.

Results
We use a Random Search with the same range of hyperparameters. We separate the collected data into training, validation and test sets. For the two first mentioned categories we use data from 30th November 2020 until 15th January 2021. Furthermore, we apply the very common training-validation-split rule of 80/20. For the test set we take the remaining data from 16th January 2021 to 31st January 2021. An illustration of the number of tweets per day for the ticker symbol is shown in Figure 4 and for the cashtag in Figure 5. It can be seen, that the daily tweet number for AAPL is higher than for $AAPL. However, both search words show a similar pattern over time. On the weekends the daily number decreases in comparison to weekdays. In addition to that, both figures display a peak a few days before Christmas and an increased volume for the last days in January. The latter could be explained by the discussions over WallStreetBets and GameStop on Twitter where many tweets included various cashtags or ticker symbols. 30  For the ticker symbol one further peak can be observed at the beginning of 2021 which could be related to the news that Apple will release an autonomous electric car in the next five to seven years. ‡ We start with our baseline model which is a LSTM with lagged close price as the only variable. Proceeding, we combine this baseline feature with every other lagged covariate separately. Afterwards, we take the variable combination which leads to the best result and combine this selection with all other remaining covariates again separately. Subsequently, we take again the best combination and proceeded in a similar fashion as before until no further improvement in the loss function is observable. We decided for this procedure because we have 17 features in total, so that the analysis of all possible variable combinations is too time intensive. On the other hand, we separately try models with all Twitter variables and all Yahoo Finance features, but those combinations are not able to beat the baseline approach. Hence, we can confirm the findings of [60] that too many features might reduce the predictive power. 30  The results for the Apple stock are illustrated in three different tables. The MSE summary for the data where the Twitter variables are only based on the ticker symbol can be seen in Table 1. Table 2 shows the results for the data set based on only the cashtag symbol and Table 3 for the combined Twitter data of ticker symbol and cashtag. Table 4 in the Appendix shows the used abbreviations for all features. In all settings, we are able to beat the baseline approach for both forecasting horizons. For the ticker symbol data the combination of lagged close price, trading volume, and average polarity per tweet leads to the best result for the shorter forecasting horizon. Contrary, for the 60m period a more complex variable combination with lagged close price, number of positive tweets, average polarity per tweet, and volatility of number of tweets per minute minimizes the MSE. For the cashtag the results for the 30m horizon suggest again a less complex variable combination. The best performing combination here were in addition to the lagged close price, the number of negative tweets, and the minimum number of tweets per minute. For the longer forecasting horizon the same variables as in the 30m horizon plus the volatility of the number of tweets per minute gives the best performing model. ‡ Note that on 17th January 2021 the streaming did not work correctly which is observable in both figures. However, it did not effect the analysis as it was a Sunday. Considering the combination of AAPL and $AAPL the results in Table 3 confirm the previous findings. On the one hand, the shorter horizon stops to improve faster. In fact, only adding feature for the number of negative tweets is valuable. On the other hand, the number of negative tweets, the number of positive tweets, and the average subjectivity per tweet improve the results of the 60m forecasting horizon. It seems that even if the variable combinations for each approach are slightly different, there are features that lead repetitively to the best MSE. Therefore, it could be of interest to compare more stocks in future studies for figuring out if there are Twitter variables which are commonly and repeatable usable.
Comparing the best MSE for each of the three Apple results it seems like streaming Twitter data, which is directly connected to the financial discussion increases the predictive power. We find that for both forecast horizons the smallest MSE is obtained if we only use the $AAPL tweets.

Conclusion
This paper provides a practical guide for stock price predictions with Twitter data by using a combination of a LSTM and sentiment classification. We offer a pathway to extend the modelling framework beyond common financial variables by using innovative variables based on Twitter data. On the one hand, we evaluate content unrelated characteristics of the data, such as the average number of tweets per minute. On the other hand, we generate sentiment variables, such as average polarity and subjectivity. We show in a case study for the stock prices of Apple, that these novel features improve the performance of our baseline LSTM model. Our study shows not only how to use innovative Twitter based variables for forecasting purposes, but also indicates its potential to improve the modelling capacity.   [ 7 ] : 0 . 9 0 9 9 9 9 9 9 9 9 9 9 9 9 9 9 STOCK PRICE PREDICTIONS WITH LSTM NEURAL NETWORKS AND TWITTER SENTIMENT Listing 2: TextBlob example for subjectivity calculation.

l a s t d a t e f o r w h i c h t w e e t s w i l l be e x t r a c t e d
f o r i n t e r v a l i n [ " 1m" , " 5m" , " 15m" , " 30m" , " 1 h " ] : s t o c k d a t a = y f . download ( t i c k e r l i s t , s t a r t = s t a r t d a t e , end = l a s t d a t e , i n t e r v a l = i n t e r v a l ) s t o c k d a t a . t o c s v ( p a t h + f i l e n a m e + " " + s t a r t d a t e + " " + l a s t d a t e + " " + i n t e r v a l + " . c s v " ) # ############################################ # T w i t t e r Data S t r e a m # ############################################ import t w e e p y a s tw import p a n d a s a s pd import d a t e t i m e a s d t