Financial Data ScienceTask-1 (Session-1, 2021) Financial Data ScienceTask-1 (Session-1, 2021) Assignment on Financial Data Science Total Marks: 100 Submission Deadline: 11:59pm, 6 April 2021 General...

1 answer below »
please find attachment.


Financial Data ScienceTask-1 (Session-1, 2021) Financial Data ScienceTask-1 (Session-1, 2021) Assignment on Financial Data Science Total Marks: 100 Submission Deadline: 11:59pm, 6 April 2021 General Instructions • This assignment has two parts. • Part-I is on theoretical background and descriptive analysis and Part-II is machine learning, specifically, classification models. • You have been assigned a company to work with (look at ‘Assignment_afin8015_s1_2021.xlsx’ on ilearn), you must work on the company listed against your name as the companies in your analysis. • The data time period used in the assessment will be from 1 July 2018 to 28 Feb 2021.1 • Both parts must be documented in one document as Part-I and Part-II. • The assignment requires submission of your working R code files – All data files used in the code must be submitted. – The code must be included in the appendix of the document and an R code file should be uploaded. – You are not required to use the R Markdown format for this assignment but you may use it if you choose to. • Your individual paper must not exceed 12 A4 pages of 11pt font size with 2 spacing. This excludes any appendices, tables and lengthy R output you may elect to incorporate in the report. • The word count mentioned in the questions is the maximum word count and excludes any figures and/or tables. • Marks will be awarded for depth of coverage, quality of insight, succinctness and accuracy of answers. • Marks will be deducted for poorly informed reports which lack proper formatting, referencing etc. Following deduction will apply – No references (in-text and end text), includes reference to data source: -10 1 Inform the unit convenor ASAP if the data time period or OHLC data is not available for the company allocated to you. 1 2 – No coversheet: -5 – Illegible presentation: -10 – Lack of informed research: -10 – Plagiarism will be dealt according to the university policy and a high similarity score will be penalized. • The discussion must be informed by research and the report must cite all the sources. • Both, in-text and end text citations are required. End text references are excluded from the page limit. Use one citation style, either APA or Harvard. • FACTSET is the preferred data source for the assignment along with publicly available in- formation from company website and ASX and the sources mentioned in this document. • Assignment (document) must include a cover sheet. • A sample coversheet is provided on ilearn, you may choose to use it. Please contact your unit convenor well before the submission deadline for any clarifications you may need on the assignment instructions. You may also post your questions on the discussion forum. 3 Assignment Questions Scenario: You are an intern data scientist at Shootingformars Corp. and have data analysis and machine learning skills, particularly in the financial service sector. As it happens, your mentor Ms Fowler has just been approached by Mr Hofstadter, a new client who has recently started investing in the Share Market and has been using some past information to make his trading decisions on a daily basis. Mr Hofstadter has also been researching during his spare time and has heard that modern Data Science methods such as Machine Learning can be used to predict the price direction for stocks and other financial assets. Unfortunately, he has limited understanding of the Data Science process and limited programming skills, he does have some background in statistics. After the initial meeting with Mr Hofstadter, Ms Fowler has decided to treat this as a educational/proof of concept project and brought you on board to conduct the analysis and prepare the documentation for the project. Ms Fowler has assigned a publicly trading stock listed in ‘Assignment_afin8015_s1_2021.xlsx’ and given you a set of tasks as listed in Part-I and Part-II of this document. Part-I is aimed to assist Mr Hofstadter in developing a better understanding of the Data Science and Descriptive Statistics using statistics and visualisation. Part-II of the task is to use the stock assigned to you and conduct a classification exercise for demonstration. The task requires you to create a professional standard document to be presented to the client. You have been given a choice of either using a traditional workflow of creating and word document and R for coding the methods separately and then bring them all together in one document or use a reproducible method with an RMarkdown file. Part I. Data Science Concepts & Descriptive Analysis 1 Task-1 (3+7=10 Marks) 1. Explain the concept of Data Science. (3 marks) 2. Outline and explain the Life Cycle of a Data Science Project. Use example(s) from the financial service sector domain. (7 marks) Go beyond the text book and in-class resources to include recent developments and explain the concept with Financial Service Sector as the main domain. All references must be cited. 2 Task-2 (10 Marks) 1. Use FACTSET and download the daily Open, High, Close (OHLC) Prices and Trading Volume for the company stock assigned to you from 01-July-2018 to 28-Feb-2021. (2 marks) 3 Task-3 (10 Marks) 4 2. Use the closing prices and percentage logarithmic returns of the closing prices to generate descriptive statistics (including Skewness, Kurtosis and Test for Normal Distribution). Present the statistics in the document and briefly discuss the range, distribution and tail behaviour of the price and return series. Keep the discussion brief and to the point, remember your client has some statistical background an understanding of the stock market. (Word limit: 250 words) (8 marks) 3 Task-3 (10 Marks) 1. Plot and present the closing prices and log returns using ggplot2 package in R. Hint: One way is to extract the dates and closing prices and returns in a data frame and convert it from wide to long. (2 marks) 2. Use the last 6 months OHLC prices and the Volume data to plot the following charts : (a) Line Chart (b) Candlestick Chart: (c) Add the following Technical indicators to the candlestick chart i. 5 Day Simple Moving Average ii. 10 Day Exponential Moving Average iii. 5 Day Momentum iv. Bollinger Bands (6 marks) 3. Comment on the trend and price direction based on the plots generated in 1 and 2 above. (Word limit: 150 words) (2 marks) Part II. Classification Models & Application 4 Task-4 (10 Marks) 1. As Mr Hofstadter has limited exposure to Machine Learning (ML) and various methods in ML, you are tasked to conduct a short review of ML and ML methods with a focus on Classification models. Your review should also include the following (a) An overview of Machine Learning. (b) Discussion on three (at least) different classification methods. (c) As the modelling task requires to conduct a price direction forecast exercise, the re- view should also include examples of previous research using ML for stock price move- ment/direction prediction. 5 Task-5 (60 Marks) 5 Go beyond the text book and in-class resources to include recent developments and research. All references must be cited. (Word Limit: 300 words) (10 marks) 5 Task-5 (60 Marks) Your final task is to conduct a proof of concept comparative analysis of two classification methods to demonstrate classification and predictive ability of ML methods in modelling and predicting the price direction based on various technical indicators. Specifically, the task should conduct the following: 1. Select the closing prices from the OHLC stock price data downloaded from FACTSET (same as in Task-2) and create the following Technical Indicators2. (a) Moving Average: 5 day and 10 day and their one period lag (b) Log returns and its one period lag (c) MACD (default values for nFast, nSlow and nSig) (d) Exponential Moving Average: 5 day and 10 day (e) Momentum: 5 day (f) Volatility: 5 day (g) A price direction indicator based on 3 day lagged price 1→ Pt ≥ Pt−3 0 otherswise (14 marks) 2. Combine the indicators in a data frame and visualise the data using (a) A time series plot, and (b) Box plots of indicators categorised by price direction (6 marks) 3. Create a 70:30 training and testing sample from the dataset and conduct a classification exercise using Logistic Regression. The analysis should include the following: (a) Training on the training sample using a ‘timeslice’ sampling. Use at least 250 days as window size and 14 days for prediction horizon in a fixed window. (b) Data pre-processing to standardise the data. (c) Prediction on the test set and corresponding confusion matrix. (d) Brief discussion on the accuracy of the prediction based on the confusion matrix. (20 marks) 2 Hint: Use the TTR and quantmod package 5 Task-5 (60 Marks) 6 4. Conduct the classification exercise (in 3 above) using k-Nearest Neighbours algorithm. The analysis should include the following: (a) A odd number grid search for the ‘k’ parameter from 1 to 99. (b) Prediction on the test set and corresponding confusion matrix. (c) Brief discussion on the accuracy of the prediction based on the confusion matrix. (15 marks) 5. Compare the performance of the Logistic Regression Model and k-NN model based on their accuracy and provide a recommendation for Mr Hofstadter. (Word Limit: 150 words) (5 marks) Your final report must include both Part-I and Part-II and must contain the output from the analysis conducted in R. Final code and data files must be submitted on the relevant links on ilearn. **End of Assignment Questions** I Data Science Concepts & Descriptive Analysis Task-1 (3+7=10 Marks) Task-2 (10 Marks) Task-3 (10 Marks) II Classification Models & Application Task-4 (10 Marks) Task-5 (60 Marks) https://login.factset.com/login/xoM4jTODNt1DQZiEFfv2epjgwEy_90OY4PMHmCT4nVWBNoOUq5l1gngk-6goz21COdbrZuVD15k0pCIYD-6IuetP5fqblSARR2M8A421ykGPyC-oHSj33Viko7AhGs_Su1bsGqV_D10OhjaPmrmF3dE3eaRChh3t/xoM55/uVDd2 you will get authentication code of FACTSET on this email [email protected] password: bosco411# Hey, you have to use FACTSET for the data source It’s part of the assignment I am sending you username and password for FACTSET so you can download data Please tell the tutor to use these login details of FACTSET for downloading data asked in the assignment. username: ajay.yadav2 password: Ajayyadav30# https://login.factset.com/login/xoM4jTODNt1DQZiEFfv2epjgwEy_90OY4PMHmCT4nVWBNoOUq5l1gngk-6goz21COdbrZuVD15k0pCIYD-6IuetP5fqblSARR2M8A421ykGPyC-oHSj33Viko7AhGs_Su1bsGqV_D10OhjaPmrmF3dE3eaRChh3t/xoM55/uVDd2 a one time code will be generated on my email once you login I can send the email id and password for the account where the code will be sent so you don't have to wait for my reply to get the code [email protected] password: bosco411# you will get authentication code of FACTSET on this email
Answered 7 days AfterMar 30, 2021Macquaire University

Answer To: Financial Data ScienceTask-1 (Session-1, 2021) Financial Data ScienceTask-1 (Session-1, 2021)...

Abr Writing answered on Apr 06 2021
150 Votes
Assignment (1).docx
Financial Data Science Task-1
Assignment on Financial Data Science
06/04/2021
Data Science Concepts & Descriptive Analysis
Task-1
Question 1
When you hear data scientists fire a million algorithms while explaining their tests or go into depth about Tensorflow use, it’s easy to believe that a layperson will never master Data Science. Big Data seems to be yet another cosmic puzzle to be locked away in an ivory tower with a small group of modern-day alchemists and magicians. Simultaneously, you learn everywhere about the pressing need to become data-driven.
The trick is that we used to only have a small amount of well-structured files. We are now immersed in never-ending flows of organized, unstructured, and semi-structured data due to the global Internet. It empowers us to better consider technological, commercial, and social systems, but it also necessitates the development of new techniques and technologies.
Data Science is simply a 21st-century extension of math that has been practiced for centuries. In essence, it’s the same ability of obtaining knowledge and optimizing procedures by using
accessible information. The goal is still the same, whether it’s a simple Excel spreadsheet or a database of 100 million records: to find meaning. Data Science differs from conventional analytics in that it aims to forecast future patterns as well as clarify beliefs.
Question 2
It is often beneficial for data scientists to pursue a well-defined data analysis workflow while dealing with big data. The data science workflow process is important regardless of whether a data scientist wishes to conduct research with the goal of telling a story by data visualization or building a data model. A typical workflow for data science programs means that all departments within an enterprise are on the same page, preventing any additional delays.
Any data science project’s ultimate aim is to create a useful data product. A data product is the functional findings produced at the conclusion of a data science project. A data product can be anything - a dashboard, a search engine, or anything else that helps businesses make better decisions. Data scientists, on the other hand, would adopt a formalized step-by-step workflow process to achieve their end goal of generating data items. A data product should aid in the resolution of a business problem. The lifecycle of data science initiatives should put a greater focus on data items rather than the process.
The CRISP-DM lifecycle, which specifies the following standard 6 phases for data mining projects, is close to the data science project lifecycle.
· Business Understanding
· Data Understanding
· Data Preparation
· Modeling
· Evaluation
· Deployment
The lifecycle of data science ventures is merely a modification of the CRISP-DM workflow process.
1. Data Acquisition
1. Data Preparation
1. Hypothesis and Modelling
1. Evaluation and Interpretation
1. Deployment
1. Operations
1. Optimization
Task-2
Question 1
Loading the daily Open, High, Close (OHLC) Prices and Trading Volume for the company stock from 01-July-2018 to 28-Feb-2021.
price <- read_xlsx("PriceHistory.xlsx",
skip = 15)
Question 2
First, we extracted the closing prices and log returns of the closing prices from the data.
data <- data.frame(
date=price$Date,
close=price$Price,
log.return=c(NA,
log(
price$Price[-1] /
price$Price[-nrow(price)]) *
100)
)
The following plot shows the Distribution of Closing Price and Percentage log return of closing Price from 01-July-2018 to 28-Feb-2021.
par(mfrow=c(1,2))
hist(data$close,
main="Closing Price",
xlab = "close",
freq=FALSE)
lines(density(data$close),
col='red',
lwd=3)
abline(v = c(mean(data$close),
median(data$close)),
col=c("green", "blue"),
lty=c(2,2),
lwd=c(3, 3))
hist(data$log.return[2:nrow(data)],
main="Percent log return of Price",
ylim = c(0, 0.40),
xlab = "log.return",
freq=FALSE)
lines(density(data$log.return[2:nrow(data)]),
col='red',
lwd=3)
abline(v = c(mean(data$log.return[2:nrow(data)]),
median(data$log.return[2:nrow(data)])),
col=c("green", "blue"),
lty=c(2,2),
lwd=c(3, 3))
The following table shows the descriptive metrics for both the closing and percent log return of closing price
data.stats <- data.frame(
skweness = c(
skewness(data$close),
skewness(data$log.return[2:nrow(data)])
),
kurtosis = c(
kurtosis(data$close),
kurtosis(data$log.return[2:nrow(data)])
),
min = c(
min(data$close),
min(data$log.return[2:nrow(data)])
),
mean = c(
mean(data$close),
mean(data$log.return[2:nrow(data)])
),
max = c(
max(data$close),
max(data$log.return[2:nrow(data)])
),
median = c(
median(data$close),
median(data$log.return[2:nrow(data)])
),
sd = c(
sd(data$close),
sd(data$log.return[2:nrow(data)])
),
sd = c(
sd(data$close),
sd(data$log.return[2:nrow(data)])
),
p.shapiro.test = c(
shapiro.test(data$close)$p.value,
shapiro.test(data$log.return[2:nrow(data)])$p.value
)
)
row.names(data.stats) = c("close", "log.return")
as.data.frame(
t(
data.stats
)
)
close log.return
skweness 2.035124e-01 -9.903467e-01
kurtosis -9.399552e-01 1.148348e+01
min 2.139000e+01 -1.341081e+01
mean 2.823146e+01 3.807186e-02
max 3.675000e+01 9.369987e+00
median 2.800000e+01 1.141556e-01
sd 3.905009e+00 1.727001e+00
sd.1 3.905009e+00 1.727001e+00
p.shapiro.test 9.371045e-11 2.650683e-24
From the Skewness metrics above, we can se that the percentage log return is moderately or almost highly skewed (negative) distribution. The close price distribution is almost symmetric in nature as evident from the graph above as well. From both the histogram and the kurtosis measure, we can observe that the closing price distribution is platykurtic and percentage log return distribution is leptokurtic in nature. From the p-value for Shapiro test, since it is less than 0.05 implying that the distribution of the both the series are significantly different from normal distribution.
Task-3
Question 1
data %>%
gather(key = "price", value = "value", -date) %>%
ggplot(aes(x=date, y = value)) +
geom_line(aes(color = price,
linetype = price)) +
scale_color_manual(values = c("darkred", "steelblue"))
Question 2
data.half.year <- data.frame(
date = price$Date,
open = price$Open,
high = price$High,
low = price$Low,
close = price$Price
)
data.half.year <- data.half.year[
data.half.year$date >= as.Date(
"2020-09-01",
"%Y-%m-%d"
),
]
Question (a)
data.half.year %>%
gather(key = "price", value = "value", -date) %>%
ggplot(aes(x=date, y = value)) +
geom_line(aes(color = price,
linetype = price))
Question (b)
data.half.year %>%
ggplot(aes(x = date, y = close)) +
geom_candlestick(aes(open = open,
high = high,
low = low,
close = close)) +
labs(
title = "Candlestick Chart",
y = "Closing Price",
x = "Date") +
theme_tq()
Question (c)
data.half.year %>%
ggplot(aes(x = date,
y = close,
open = open,
high = high,
low = low,
close = close)) +
geom_candlestick(aes(open = open,
high = high,
low = low,
close = close)) +
labs(
title = "Candlestick Chart",
y = "Closing Price",
x = "Date") +
#### i.
geom_ma(ma_fun = SMA,
n = 5,
wilder = TRUE,
linetype = 5) +
#### ii.
geom_ma(ma_fun = EMA,
n = 10,
wilder = TRUE,
linetype = 5) +
##### iii.
##### iv.
geom_bbands(ma_fun = SMA,
sd = 2,
n = 20,
linetype = 4,
size = 1,
alpha = 0.2,
fill = palette_light()[[1]],
color_bands = palette_light()[[1]],
color_ma = palette_light()[[2]]) +
theme_tq()
Question 3
The stock dataset includes data on 93612 observations of the organization in question. The data spans from September 1, 2020, to February 28, 2021, and has five columns: Date, Open, High, Low, and Close. I began by gathering the summaries and reviewing the table. The following move is to try to comprehend how the stocks have traded in terms of value and interest. As a result, the first step will be to map the volume and price against time. I first plotted volume against time, then OHLC against time, one by one. Those plots reveal the specifics, oddities, and general behavior. I zoomed in on individual timelines to get a more accurate view.
As a financial analyst, one must be able to grasp and use financial terms. For trend, momentum, uncertainty, and volume, I looked into the different types of metrics used in financial analysis. One example of each category of metric has been plotted in order to explain and represent the data in financial terms. The Moving Average Convergence Divergence (MACD) has been plotted for trend, the Relative Strength Index (RSI) for momentum, Bollinger Bands for volatility, and finally the Rate of Change in Volume (ROCV) for volume. Another relevant metric is the Moving Average, which has been plotted and is used to determine the general uptrend or downtrend. All of the graphs listed above are plotted by first subsetting them for a single organization and then, if necessary, for a certain interval.
Classification Models & Application
Task-4
Question 1
Question (a)
In a nutshell, machine learning is the process of using learning algorithms and optimization techniques to automatically learn a highly accurate predictive or classifier model, or to discover hidden patterns in datasets [3].
The following types of performance are primarily generated by machine learning algorithms: - Two-class and multi-class classification (Supervised) - Regression: Univariate, Multivariate, etc. (Supervised) - Anomaly detection (Unsupervised and Supervised) - Clustering (Unsupervised) - Recommendation systems (aka recommendation engine)
Question (b)
Some of the algorithms [3] used for two-class & Multi-class classification used are: - Decision Tree: The algorithm’s aim is to predict a target variable based on a series of input variables and their characteristics. The method creates a tree structure by performing a sequence of binary splits (yes/no) from the root node to leaf nodes, going through multiple decision nodes (internal nodes). - SVM or Support Vector Machine: They are focused on the concept of determining the best hyperplane for dividing a dataset into two groups. - Naive Bayes: It’s a group of probabilistic algorithms that use Bayes’ Theorem and probability theory to predict the mark or class for the data points given.
Question (c)
Deep learning for forecasting stock market...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here