Answer To: Financial Data ScienceTask-1 (Session-1, 2021) Financial Data ScienceTask-1 (Session-1, 2021)...
Abr Writing answered on Apr 06 2021
Assignment (1).docx
Financial Data Science Task-1
Assignment on Financial Data Science
06/04/2021
Data Science Concepts & Descriptive Analysis
Task-1
Question 1
When you hear data scientists fire a million algorithms while explaining their tests or go into depth about Tensorflow use, it’s easy to believe that a layperson will never master Data Science. Big Data seems to be yet another cosmic puzzle to be locked away in an ivory tower with a small group of modern-day alchemists and magicians. Simultaneously, you learn everywhere about the pressing need to become data-driven.
The trick is that we used to only have a small amount of well-structured files. We are now immersed in never-ending flows of organized, unstructured, and semi-structured data due to the global Internet. It empowers us to better consider technological, commercial, and social systems, but it also necessitates the development of new techniques and technologies.
Data Science is simply a 21st-century extension of math that has been practiced for centuries. In essence, it’s the same ability of obtaining knowledge and optimizing procedures by using accessible information. The goal is still the same, whether it’s a simple Excel spreadsheet or a database of 100 million records: to find meaning. Data Science differs from conventional analytics in that it aims to forecast future patterns as well as clarify beliefs.
Question 2
It is often beneficial for data scientists to pursue a well-defined data analysis workflow while dealing with big data. The data science workflow process is important regardless of whether a data scientist wishes to conduct research with the goal of telling a story by data visualization or building a data model. A typical workflow for data science programs means that all departments within an enterprise are on the same page, preventing any additional delays.
Any data science project’s ultimate aim is to create a useful data product. A data product is the functional findings produced at the conclusion of a data science project. A data product can be anything - a dashboard, a search engine, or anything else that helps businesses make better decisions. Data scientists, on the other hand, would adopt a formalized step-by-step workflow process to achieve their end goal of generating data items. A data product should aid in the resolution of a business problem. The lifecycle of data science initiatives should put a greater focus on data items rather than the process.
The CRISP-DM lifecycle, which specifies the following standard 6 phases for data mining projects, is close to the data science project lifecycle.
· Business Understanding
· Data Understanding
· Data Preparation
· Modeling
· Evaluation
· Deployment
The lifecycle of data science ventures is merely a modification of the CRISP-DM workflow process.
1. Data Acquisition
1. Data Preparation
1. Hypothesis and Modelling
1. Evaluation and Interpretation
1. Deployment
1. Operations
1. Optimization
Task-2
Question 1
Loading the daily Open, High, Close (OHLC) Prices and Trading Volume for the company stock from 01-July-2018 to 28-Feb-2021.
price <- read_xlsx("PriceHistory.xlsx",
skip = 15)
Question 2
First, we extracted the closing prices and log returns of the closing prices from the data.
data <- data.frame(
date=price$Date,
close=price$Price,
log.return=c(NA,
log(
price$Price[-1] /
price$Price[-nrow(price)]) *
100)
)
The following plot shows the Distribution of Closing Price and Percentage log return of closing Price from 01-July-2018 to 28-Feb-2021.
par(mfrow=c(1,2))
hist(data$close,
main="Closing Price",
xlab = "close",
freq=FALSE)
lines(density(data$close),
col='red',
lwd=3)
abline(v = c(mean(data$close),
median(data$close)),
col=c("green", "blue"),
lty=c(2,2),
lwd=c(3, 3))
hist(data$log.return[2:nrow(data)],
main="Percent log return of Price",
ylim = c(0, 0.40),
xlab = "log.return",
freq=FALSE)
lines(density(data$log.return[2:nrow(data)]),
col='red',
lwd=3)
abline(v = c(mean(data$log.return[2:nrow(data)]),
median(data$log.return[2:nrow(data)])),
col=c("green", "blue"),
lty=c(2,2),
lwd=c(3, 3))
The following table shows the descriptive metrics for both the closing and percent log return of closing price
data.stats <- data.frame(
skweness = c(
skewness(data$close),
skewness(data$log.return[2:nrow(data)])
),
kurtosis = c(
kurtosis(data$close),
kurtosis(data$log.return[2:nrow(data)])
),
min = c(
min(data$close),
min(data$log.return[2:nrow(data)])
),
mean = c(
mean(data$close),
mean(data$log.return[2:nrow(data)])
),
max = c(
max(data$close),
max(data$log.return[2:nrow(data)])
),
median = c(
median(data$close),
median(data$log.return[2:nrow(data)])
),
sd = c(
sd(data$close),
sd(data$log.return[2:nrow(data)])
),
sd = c(
sd(data$close),
sd(data$log.return[2:nrow(data)])
),
p.shapiro.test = c(
shapiro.test(data$close)$p.value,
shapiro.test(data$log.return[2:nrow(data)])$p.value
)
)
row.names(data.stats) = c("close", "log.return")
as.data.frame(
t(
data.stats
)
)
close log.return
skweness 2.035124e-01 -9.903467e-01
kurtosis -9.399552e-01 1.148348e+01
min 2.139000e+01 -1.341081e+01
mean 2.823146e+01 3.807186e-02
max 3.675000e+01 9.369987e+00
median 2.800000e+01 1.141556e-01
sd 3.905009e+00 1.727001e+00
sd.1 3.905009e+00 1.727001e+00
p.shapiro.test 9.371045e-11 2.650683e-24
From the Skewness metrics above, we can se that the percentage log return is moderately or almost highly skewed (negative) distribution. The close price distribution is almost symmetric in nature as evident from the graph above as well. From both the histogram and the kurtosis measure, we can observe that the closing price distribution is platykurtic and percentage log return distribution is leptokurtic in nature. From the p-value for Shapiro test, since it is less than 0.05 implying that the distribution of the both the series are significantly different from normal distribution.
Task-3
Question 1
data %>%
gather(key = "price", value = "value", -date) %>%
ggplot(aes(x=date, y = value)) +
geom_line(aes(color = price,
linetype = price)) +
scale_color_manual(values = c("darkred", "steelblue"))
Question 2
data.half.year <- data.frame(
date = price$Date,
open = price$Open,
high = price$High,
low = price$Low,
close = price$Price
)
data.half.year <- data.half.year[
data.half.year$date >= as.Date(
"2020-09-01",
"%Y-%m-%d"
),
]
Question (a)
data.half.year %>%
gather(key = "price", value = "value", -date) %>%
ggplot(aes(x=date, y = value)) +
geom_line(aes(color = price,
linetype = price))
Question (b)
data.half.year %>%
ggplot(aes(x = date, y = close)) +
geom_candlestick(aes(open = open,
high = high,
low = low,
close = close)) +
labs(
title = "Candlestick Chart",
y = "Closing Price",
x = "Date") +
theme_tq()
Question (c)
data.half.year %>%
ggplot(aes(x = date,
y = close,
open = open,
high = high,
low = low,
close = close)) +
geom_candlestick(aes(open = open,
high = high,
low = low,
close = close)) +
labs(
title = "Candlestick Chart",
y = "Closing Price",
x = "Date") +
#### i.
geom_ma(ma_fun = SMA,
n = 5,
wilder = TRUE,
linetype = 5) +
#### ii.
geom_ma(ma_fun = EMA,
n = 10,
wilder = TRUE,
linetype = 5) +
##### iii.
##### iv.
geom_bbands(ma_fun = SMA,
sd = 2,
n = 20,
linetype = 4,
size = 1,
alpha = 0.2,
fill = palette_light()[[1]],
color_bands = palette_light()[[1]],
color_ma = palette_light()[[2]]) +
theme_tq()
Question 3
The stock dataset includes data on 93612 observations of the organization in question. The data spans from September 1, 2020, to February 28, 2021, and has five columns: Date, Open, High, Low, and Close. I began by gathering the summaries and reviewing the table. The following move is to try to comprehend how the stocks have traded in terms of value and interest. As a result, the first step will be to map the volume and price against time. I first plotted volume against time, then OHLC against time, one by one. Those plots reveal the specifics, oddities, and general behavior. I zoomed in on individual timelines to get a more accurate view.
As a financial analyst, one must be able to grasp and use financial terms. For trend, momentum, uncertainty, and volume, I looked into the different types of metrics used in financial analysis. One example of each category of metric has been plotted in order to explain and represent the data in financial terms. The Moving Average Convergence Divergence (MACD) has been plotted for trend, the Relative Strength Index (RSI) for momentum, Bollinger Bands for volatility, and finally the Rate of Change in Volume (ROCV) for volume. Another relevant metric is the Moving Average, which has been plotted and is used to determine the general uptrend or downtrend. All of the graphs listed above are plotted by first subsetting them for a single organization and then, if necessary, for a certain interval.
Classification Models & Application
Task-4
Question 1
Question (a)
In a nutshell, machine learning is the process of using learning algorithms and optimization techniques to automatically learn a highly accurate predictive or classifier model, or to discover hidden patterns in datasets [3].
The following types of performance are primarily generated by machine learning algorithms: - Two-class and multi-class classification (Supervised) - Regression: Univariate, Multivariate, etc. (Supervised) - Anomaly detection (Unsupervised and Supervised) - Clustering (Unsupervised) - Recommendation systems (aka recommendation engine)
Question (b)
Some of the algorithms [3] used for two-class & Multi-class classification used are: - Decision Tree: The algorithm’s aim is to predict a target variable based on a series of input variables and their characteristics. The method creates a tree structure by performing a sequence of binary splits (yes/no) from the root node to leaf nodes, going through multiple decision nodes (internal nodes). - SVM or Support Vector Machine: They are focused on the concept of determining the best hyperplane for dividing a dataset into two groups. - Naive Bayes: It’s a group of probabilistic algorithms that use Bayes’ Theorem and probability theory to predict the mark or class for the data points given.
Question (c)
Deep learning for forecasting stock market...