FINANCE TECHNOLOGY (FIN 480J) – FINAL PROJECT
Instructions:
· Pick one company within the S&P 500 and write the name of the company down on the shared google sheet. The same company cannot be chosen twice, and signing up for the company will be based on the first-come, first served basis. You would have to pick a company that is different from the ones that your classmates have selected. You can click
here
to get access to the sign-up sheet.
· After finalizing on the company, write the programme in python to do the following:
PART I
1. Write the code to let Chrome-driver open the site
https://finviz.com/
automatically. (5 points)
2. Send the keyword to the search box with the key being the ticker of your company. (5 points)
3. Scrape the information in the two tables right before the news section (metrics and analyst rating tables) (10 points).
4. Scrape all of the news article links and sources of the news. Then, store that information in a Data Frame (10 pints).
5. Using the links acquired from the last question, write a for loop to scrape the full article in links that are from the same source (10 points).
6. Scrape the news titles (10 points)
7. Organize the news titles by the date that it was published (10 points)
8. Use the “Sentiment Intensity Analyzer” and “Stanford core NLP” to calculate sentiment score from news titles by hour (20 points)
9. Search for another method to calculate the sentiment score (10 points)
10. Are the sentiment scores by hours’ sensitive under different approaches (5 points)?
11. Graph the sentiment score for your company over time using 3 different methods (10 points)
PART II (50 points)
1. Download the daily stock price (Open, Close, Volume, High, Low) of your company from the past 5 years using Qandl package in python
2. Select at least 5 features that can potentially influence stock price movement. Justify your selections. At least two out of 5 features should not be derived from the stock prices and the volumes.
3. Use at least two different supervised learning algorithms to train your model with your target being the stock price and the direction of the stock price at the end of the day. Use the trained models from the above question to predict the future stock price
a. First, use the three features derived from the stock price and volume
b. Second, use all five features
c. Are the forecasting stock prices significantly deviate from each other? What are the potential reasons? Would your results improve when you include the two additional features that are not derived from stock prices and volumes?
Scenario
|
Feature
|
Label
|
Method
|
1
|
- Percentage change in open and close
- Percentage change in high and low
- Volume
|
- Stock direction
(a dummy variable indicating whether the stock price is up or down)
|
- Logistic Regression
|
2
|
- Percentage change in open and close
- Percentage change in high and low
- Volume
- Feature of your choice 1
- Feature of your choice 2
|
- Stock direction (a dummy variable indicating whether the stock price is up or down)
|
- Logistic Regression
- Also, Include the GridSearchCV for this scenario to optimize your hyperparameter C
|
3
|
- Percentage change in open and close
- Percentage change in high and low
- Volume
|
- Stock Value (Close Price)
|
- Linear Regression
|
4
|
- Percentage change in open and close
- Percentage change in high and low
- Volume
- Feature of your choice 1
- Feature of your choice 2
|
- Stock Value
(Close Price)
|
- Linear Regression
|
Notes:
- When selecting the features, make sure that there some be some variation in the value of your feature through time. It can be daily, weekly, or monthly.
- The actual features for training data should be the lag value of your features. It can be lag of 1 or lag 2 or lag of 5 and etc., depending on your preference. You can also test the performance of your model to see how it is sensitive to choosing different lags for your features.
- For Scenario 2, after performing the Grid Search, you should specify which value of C would give you the model with the best performance (based on its accuracy). Based on the value of C, what does it tell you about the data (Hint: Whether the training data is a good representation of testing data or population).
- For Scenario 4, The two additional features that you select should help you to improve the good fit of your model (Accuracy) in scenario 3.