.Trying to predict stock prices for S&P 500 stocks using Yahoo API, using recurrent neural network models. I am attaching a sample code of someone else --https://github.com/lilianweng/stock-rnn, but this is not ours and need to rewrite. We also need to incorporate sentiment analysis of twitter. Using either NLTK (https://www.nltk.org/api/nltk.sentiment.html) or Stanford NLP API (https://github.com/smilli/py-corenlp/), need to figure out which percent balance of neural network and sentiment analysis is the most accurate prediction of the real price. The code must be in Python.By sentiment analysis, it would simply search for the stock on twitter. For example, Apple stock would be #AAPL on twitter and search for the sentiment.
stock-rnn/model_rnn.pyc stock-rnn/.DS_Store __MACOSX/stock-rnn/._.DS_Store stock-rnn/images/stock_rnn_lstm128_step30_input1/SP500_epoch09_step0156.png stock-rnn/images/stock_rnn_lstm128_step30_input1/SP500_epoch28_step0211.png stock-rnn/images/stock_rnn_lstm128_step30_input1/SP500_epoch34_step0081.png stock-rnn/images/stock_rnn_lstm128_step30_input1/.DS_Store __MACOSX/stock-rnn/images/stock_rnn_lstm128_step30_input1/._.DS_Store stock-rnn/images/stock_rnn_lstm128_step30_input1/SP500_epoch21_step0246.png stock-rnn/images/stock_rnn_lstm128_step30_input1/SP500_epoch16_step0121.png stock-rnn/images/stock_rnn_lstm128_step30_input1/SP500_epoch10_step0251.png stock-rnn/images/stock_rnn_lstm128_step30_input1/SP500_epoch05_step0126.png stock-rnn/images/stock_rnn_lstm128_step30_input1/SP500_epoch32_step0241.png stock-rnn/images/stock_rnn_lstm128_step30_input1/SP500_epoch20_step0151.png stock-rnn/images/stock_rnn_lstm128_step30_input1/SP500_epoch37_step0016.png stock-rnn/images/stock_rnn_lstm128_step30_input1/SP500_epoch42_step0141.png stock-rnn/images/stock_rnn_lstm128_step30_input1/SP500_epoch26_step0021.png stock-rnn/images/stock_rnn_lstm128_step30_input1/SP500_epoch04_step0031.png stock-rnn/images/stock_rnn_lstm128_step30_input1/SP500_epoch02_step0191.png stock-rnn/images/stock_rnn_lstm128_step30_input1/SP500_epoch00_step0001.png stock-rnn/images/stock_rnn_lstm128_step30_input1/SP500_epoch43_step0236.png stock-rnn/images/stock_rnn_lstm128_step30_input1/SP500_epoch46_step0171.png stock-rnn/images/stock_rnn_lstm128_step30_input1/SP500_epoch15_step0026.png stock-rnn/images/stock_rnn_lstm128_step30_input1/SP500_epoch17_step0216.png stock-rnn/images/stock_rnn_lstm128_step30_input1/SP500_epoch06_step0221.png stock-rnn/images/stock_rnn_lstm128_step30_input1/SP500_epoch31_step0146.png stock-rnn/images/stock_rnn_lstm128_step30_input1/SP500_epoch24_step0181.png stock-rnn/images/stock_rnn_lstm128_step30_input1/SP500_epoch08_step0061.png stock-rnn/images/stock_rnn_lstm128_step30_input1/SP500_epoch35_step0176.png stock-rnn/images/stock_rnn_lstm128_step30_input1/SP500_epoch38_step0111.png stock-rnn/images/stock_rnn_lstm128_step30_input1/SP500_epoch49_step0106.png stock-rnn/images/stock_rnn_lstm128_step30_input1/SP500_epoch19_step0056.png stock-rnn/images/stock_rnn_lstm128_step30_input1/SP500_epoch13_step0186.png stock-rnn/images/stock_rnn_lstm128_step30_input1/SP500_epoch48_step0011.png stock-rnn/images/stock_rnn_lstm128_step30_input1/SP500_epoch45_step0076.png stock-rnn/images/stock_rnn_lstm128_step30_input1/SP500_epoch12_step0091.png stock-rnn/images/stock_rnn_lstm128_step30_input1/SP500_epoch23_step0086.png stock-rnn/images/stock_rnn_lstm128_step30_input1/SP500_epoch39_step0206.png stock-rnn/images/stock_rnn_lstm128_step30_input1/SP500_epoch01_step0096.png stock-rnn/images/stock_rnn_lstm128_step30_input1/SP500_epoch27_step0116.png stock-rnn/images/stock_rnn_lstm128_step30_input1/SP500_epoch30_step0051.png stock-rnn/images/stock_rnn_lstm128_step30_input1/SP500_epoch41_step0046.png stock-rnn/images/.DS_Store __MACOSX/stock-rnn/images/._.DS_Store stock-rnn/README.md ### Predict stock market prices using RNN Check my blog post "Predict Stock Prices Using RNN": [Part 1](https://lilianweng.github.io/lil-log/2017/07/08/predict-stock-prices-using-RNN-part-1.html) and [Part 2](https://lilianweng.github.io/lil-log/2017/07/22/predict-stock-prices-using-RNN-part-2.html) for the tutorial associated. One thing I would like to emphasize that because my motivation is more on demonstrating how to build and train an RNN model in Tensorflow and less on solve the stock prediction problem, I didn't try too hard on improving the prediction outcomes. You are more than welcome to take this repo as a reference point and add more stock prediction related ideas to improve it. Enjoy. 1. Make sure `tensorflow` has been installed. 2. First download the full S&P 500 data from [Yahoo! Finance ^GSPC](https://finance.yahoo.com/quote/%5EGSPC?p=^GSPC) (click the "Historical Data" tab and select the max time period). And save the .csv file to `data/SP500.csv`. 3. Run `python data_fetcher.py` to download the prices of individual stocks in S & P 500, each saved to `data/{{stock_abbreviation}}.csv`. (NOTE: Google Finance API returns the prices for 4000 days maximum. If you are curious about the data in even early times, try modify `data_fetcher.py` code to send multiple queries for one stock. Here is the data archive ([stock-data-lilianweng.tar.gz](https://drive.google.com/open?id=1QKVkiwgCNJsdQMEsfoi6KpqoPgc4O6DD)) of stock prices I crawled up to Jul, 2017. Please untar this file to replace the "data" folder in the repo for test runs.) 4. Run `python main.py --help` to check the available command line args. 5. Run `python main.py` to train the model. For examples, - Train a model only on SP500.csv; no embedding ```bash python main.py --stock_symbol=SP500 --train --input_size=1 --lstm_size=128 --max_epoch=50 ``` - Train a model on 100 stocks; with embedding of size 8 ```bash python main.py --stock_count=100 --train --input_size=1 --lstm_size=128 --max_epoch=50 --embed_size=8 ``` - Start your Tensorboard ```bash cd stock-rnn mkdir logs tensorboard --logdir ./logs --port 1234 --debug ``` My python environment: Python version == 2.7 ``` BeautifulSoup==3.2.1 numpy==1.13.1 pandas==0.16.2 scikit-learn==0.16.1 scipy==0.19.1 tensorflow==1.2.1 urllib3==1.8 ``` stock-rnn/data_model.py import numpy as np import os import pandas as pd import random import time random.seed(time.time()) class StockDataSet(object): def __init__(self, stock_sym, input_size=1, num_steps=30, test_ratio=0.1, normalized=True, close_price_only=True): self.stock_sym = stock_sym self.input_size = input_size self.num_steps = num_steps self.test_ratio = test_ratio self.close_price_only = close_price_only self.normalized = normalized # Read csv file raw_df = pd.read_csv(os.path.join("data", "%s.csv" % stock_sym)) # Merge into one sequence if close_price_only: self.raw_seq = raw_df['Close'].tolist() else: self.raw_seq = [price for tup in raw_df[['Open', 'Close']].values for price in tup] self.raw_seq = np.array(self.raw_seq) self.train_X, self.train_y, self.test_X, self.test_y = self._prepare_data(self.raw_seq) def info(self): return "StockDataSet [%s] train: %d test: %d" % ( self.stock_sym, len(self.train_X), len(self.test_y)) def _prepare_data(self, seq): # split into items of input_size seq = [np.array(seq[i * self.input_size: (i + 1) * self.input_size]) for i in range(len(seq) // self.input_size)] if self.normalized: seq = [seq[0] / seq[0][0] - 1.0] + [ curr / seq[i][-1] - 1.0 for i, curr in enumerate(seq[1:])] # split into groups of num_steps X = np.array([seq[i: i + self.num_steps] for i in range(len(seq) - self.num_steps)]) y = np.array([seq[i + self.num_steps] for i in range(len(seq) - self.num_steps)]) train_size = int(len(X) * (1.0 - self.test_ratio)) train_X, test_X = X[:train_size], X[train_size:] train_y, test_y = y[:train_size], y[train_size:] return train_X, train_y, test_X, test_y def generate_one_epoch(self, batch_size): num_batches = int(len(self.train_X)) // batch_size if batch_size * num_batches < len(self.train_x): num_batches += 1 batch_indices = range(num_batches) random.shuffle(batch_indices) for j in batch_indices: batch_x = self.train_x[j * batch_size: (j + 1) * batch_size] batch_y = self.train_y[j * batch_size: (j + 1) * batch_size] assert set(map(len, batch_x)) == {self.num_steps} yield batch_x, batch_y __macosx/stock-rnn/._data_model.py stock-rnn/logs/.ds_store __macosx/stock-rnn/logs/._.ds_store stock-rnn/.gitignore *.*~ *.pyc *.ipynb data/*.tsv data/*.csv logs/* models/* checkpoints/* images/* .idea/ .ipynb_checkpoints/ tmp_data/ stock-rnn/data_model.pyc stock-rnn/scripts/config.py class rnnconfig(): input_size = 1 num_steps = 30 lstm_size = 128 num_layers = 1 keep_prob = 0.8 batch_size = 64 init_learning_rate = 0.001 learning_rate_decay = 0.99 init_epoch = 5 max_epoch = 50 def to_dict(self): dct = self.__class__.__dict__ return {k: v for k, v in dct.iteritems() if not k.startswith('__') and not callable(v)} def __str__(self): return str(self.to_dict()) def __repr__(self): return str(self.to_dict()) default_config = rnnconfig() print "default configuration:", default_config.to_dict() data_dir = "data" log_dir = "logs" model_dir = "models" stock-rnn/scripts/train_model.py """ run the following command to check tensorboard: $ tensorboard --logdir ./_logs """ import json import os import sys; sys.path.append("..") import tensorflow as tf from build_graph import build_lstm_graph_with_config from config import default_config, model_dir from data_model import stockdataset def load_data(stock_name, input_size, num_steps): stock_dataset = stockdataset(stock_name, input_size=input_size, num_steps=num_steps, test_ratio=0.1, close_price_only=true) print "train data size:", len(stock_dataset.train_x) print "test data size:", len(stock_dataset.test_x) return stock_dataset def _compute_learning_rates(config=default_config): learning_rates_to_use = [ config.init_learning_rate * ( config.learning_rate_decay ** max(float(i + 1 - config.init_epoch), 0.0) ) for i in range(config.max_epoch) ] print "middle learning rate:", learning_rates_to_use[len(learning_rates_to_use) // 2] return learning_rates_to_use def train_lstm_graph(stock_name, lstm_graph, config=default_config): """ stock_name (str) lstm_graph (tf.graph) """ stock_dataset = load_data(stock_name, input_size=config.input_size, num_steps=config.num_steps) final_prediction = [] final_loss = none graph_name = "%s_lr%.2f_lr_decay%.3f_lstm%d_step%d_input%d_batch%d_epoch%d" % ( stock_name, config.init_learning_rate, config.learning_rate_decay, config.lstm_size, config.num_steps, config.input_size, config.batch_size, config.max_epoch) print "graph name:", graph_name learning_rates_to_use = _compute_learning_rates(config) with tf.session(graph=lstm_graph) as sess: merged_summary = tf.summary.merge_all() writer = tf.summary.filewriter('_logs/' + graph_name, sess.graph) writer.add_graph(sess.graph) graph = tf.get_default_graph() tf.global_variables_initializer().run() inputs = graph.get_tensor_by_name('inputs:0') targets = graph.get_tensor_by_name('targets:0') learning_rate = graph.get_tensor_by_name('learning_rate:0') len(self.train_x):="" num_batches="" +="1" batch_indices="range(num_batches)" random.shuffle(batch_indices)="" for="" j="" in="" batch_indices:="" batch_x="self.train_X[j" *="" batch_size:="" (j="" +="" 1)="" *="" batch_size]="" batch_y="self.train_y[j" *="" batch_size:="" (j="" +="" 1)="" *="" batch_size]="" assert="" set(map(len,="" batch_x))="=" {self.num_steps}="" yield="" batch_x,="" batch_y="" __macosx/stock-rnn/._data_model.py="" stock-rnn/logs/.ds_store="" __macosx/stock-rnn/logs/._.ds_store="" stock-rnn/.gitignore="" *.*~="" *.pyc="" *.ipynb="" data/*.tsv="" data/*.csv="" logs/*="" models/*="" checkpoints/*="" images/*="" .idea/="" .ipynb_checkpoints/="" tmp_data/="" stock-rnn/data_model.pyc="" stock-rnn/scripts/config.py="" class="" rnnconfig():="" input_size="1" num_steps="30" lstm_size="128" num_layers="1" keep_prob="0.8" batch_size="64" init_learning_rate="0.001" learning_rate_decay="0.99" init_epoch="5" max_epoch="50" def="" to_dict(self):="" dct="self.__class__.__dict__" return="" {k:="" v="" for="" k,="" v="" in="" dct.iteritems()="" if="" not="" k.startswith('__')="" and="" not="" callable(v)}="" def="" __str__(self):="" return="" str(self.to_dict())="" def="" __repr__(self):="" return="" str(self.to_dict())="" default_config="RNNConfig()" print="" "default="" configuration:",="" default_config.to_dict()="" data_dir="data" log_dir="logs" model_dir="models" stock-rnn/scripts/train_model.py="" """="" run="" the="" following="" command="" to="" check="" tensorboard:="" $="" tensorboard="" --logdir="" ./_logs="" """="" import="" json="" import="" os="" import="" sys;="" sys.path.append("..")="" import="" tensorflow="" as="" tf="" from="" build_graph="" import="" build_lstm_graph_with_config="" from="" config="" import="" default_config,="" model_dir="" from="" data_model="" import="" stockdataset="" def="" load_data(stock_name,="" input_size,="" num_steps):="" stock_dataset="StockDataSet(stock_name," input_size="input_size," num_steps="num_steps," test_ratio="0.1," close_price_only="True)" print="" "train="" data="" size:",="" len(stock_dataset.train_x)="" print="" "test="" data="" size:",="" len(stock_dataset.test_x)="" return="" stock_dataset="" def="" _compute_learning_rates(config="DEFAULT_CONFIG):" learning_rates_to_use="[" config.init_learning_rate="" *="" (="" config.learning_rate_decay="" **="" max(float(i="" +="" 1="" -="" config.init_epoch),="" 0.0)="" )="" for="" i="" in="" range(config.max_epoch)="" ]="" print="" "middle="" learning="" rate:",="" learning_rates_to_use[len(learning_rates_to_use)="" 2]="" return="" learning_rates_to_use="" def="" train_lstm_graph(stock_name,="" lstm_graph,="" config="DEFAULT_CONFIG):" """="" stock_name="" (str)="" lstm_graph="" (tf.graph)="" """="" stock_dataset="load_data(stock_name," input_size="config.input_size," num_steps="config.num_steps)" final_prediction="[]" final_loss="None" graph_name="%s_lr%.2f_lr_decay%.3f_lstm%d_step%d_input%d_batch%d_epoch%d" %="" (="" stock_name,="" config.init_learning_rate,="" config.learning_rate_decay,="" config.lstm_size,="" config.num_steps,="" config.input_size,="" config.batch_size,="" config.max_epoch)="" print="" "graph="" name:",="" graph_name="" learning_rates_to_use="_compute_learning_rates(config)" with="" tf.session(graph="lstm_graph)" as="" sess:="" merged_summary="tf.summary.merge_all()" writer="tf.summary.FileWriter('_logs/'" +="" graph_name,="" sess.graph)="" writer.add_graph(sess.graph)="" graph="tf.get_default_graph()" tf.global_variables_initializer().run()="" inputs="graph.get_tensor_by_name('inputs:0')" targets="graph.get_tensor_by_name('targets:0')" learning_rate=""> len(self.train_x): num_batches += 1 batch_indices = range(num_batches) random.shuffle(batch_indices) for j in batch_indices: batch_x = self.train_x[j * batch_size: (j + 1) * batch_size] batch_y = self.train_y[j * batch_size: (j + 1) * batch_size] assert set(map(len, batch_x)) == {self.num_steps} yield batch_x, batch_y __macosx/stock-rnn/._data_model.py stock-rnn/logs/.ds_store __macosx/stock-rnn/logs/._.ds_store stock-rnn/.gitignore *.*~ *.pyc *.ipynb data/*.tsv data/*.csv logs/* models/* checkpoints/* images/* .idea/ .ipynb_checkpoints/ tmp_data/ stock-rnn/data_model.pyc stock-rnn/scripts/config.py class rnnconfig(): input_size = 1 num_steps = 30 lstm_size = 128 num_layers = 1 keep_prob = 0.8 batch_size = 64 init_learning_rate = 0.001 learning_rate_decay = 0.99 init_epoch = 5 max_epoch = 50 def to_dict(self): dct = self.__class__.__dict__ return {k: v for k, v in dct.iteritems() if not k.startswith('__') and not callable(v)} def __str__(self): return str(self.to_dict()) def __repr__(self): return str(self.to_dict()) default_config = rnnconfig() print "default configuration:", default_config.to_dict() data_dir = "data" log_dir = "logs" model_dir = "models" stock-rnn/scripts/train_model.py """ run the following command to check tensorboard: $ tensorboard --logdir ./_logs """ import json import os import sys; sys.path.append("..") import tensorflow as tf from build_graph import build_lstm_graph_with_config from config import default_config, model_dir from data_model import stockdataset def load_data(stock_name, input_size, num_steps): stock_dataset = stockdataset(stock_name, input_size=input_size, num_steps=num_steps, test_ratio=0.1, close_price_only=true) print "train data size:", len(stock_dataset.train_x) print "test data size:", len(stock_dataset.test_x) return stock_dataset def _compute_learning_rates(config=default_config): learning_rates_to_use = [ config.init_learning_rate * ( config.learning_rate_decay ** max(float(i + 1 - config.init_epoch), 0.0) ) for i in range(config.max_epoch) ] print "middle learning rate:", learning_rates_to_use[len(learning_rates_to_use) // 2] return learning_rates_to_use def train_lstm_graph(stock_name, lstm_graph, config=default_config): """ stock_name (str) lstm_graph (tf.graph) """ stock_dataset = load_data(stock_name, input_size=config.input_size, num_steps=config.num_steps) final_prediction = [] final_loss = none graph_name = "%s_lr%.2f_lr_decay%.3f_lstm%d_step%d_input%d_batch%d_epoch%d" % ( stock_name, config.init_learning_rate, config.learning_rate_decay, config.lstm_size, config.num_steps, config.input_size, config.batch_size, config.max_epoch) print "graph name:", graph_name learning_rates_to_use = _compute_learning_rates(config) with tf.session(graph=lstm_graph) as sess: merged_summary = tf.summary.merge_all() writer = tf.summary.filewriter('_logs/' + graph_name, sess.graph) writer.add_graph(sess.graph) graph = tf.get_default_graph() tf.global_variables_initializer().run() inputs = graph.get_tensor_by_name('inputs:0') targets = graph.get_tensor_by_name('targets:0') learning_rate = graph.get_tensor_by_name('learning_rate:0')>