Answer To: it is a r assignment - mainly it a report and r code file, analysis has to be done based on...
Aakarsh answered on Jun 06 2021
#' ---
#' title: 'Analysis Filming Permits'
#' output: word_document
#' ---
#'
## ----setup, include=FALSE------------------------------------------------
knitr::opts_chunk$set(echo = TRUE)
list.of.packages <- c("ggplot2", "grid","gridExtra","corrplot","hexbin","dplyr","reshape2","tidyverse","rattle","zipcode","car")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
#' ## Introduction
#'
#' The Office of Film, Theatre and Broadcasting issues permits to productions filming on location in the City. To film permission is required from the New York Mayor's Office of Media and Entertainment. The dataset provided consists of basic details of the event, location, shooting and lead start, end date time with category and sub category of film. Statistical and graphical analysis is performed to gather relevant insights from the data.
#'
#' Analysis below will be copmrised of numerical and visual insights with their breif written explanation. Important categorical attributes could be Borough, Category, Subcategory, Zipcodes. Numerical ones would be duration of the shooting and time required to get permit. Categorical variables will be compared based on numerical ones. Lets begin with the analysis.
#'
#' ### Duration Of Filming
#'
#' Head of the data looks like that containing all fields.
## ----Statistical summary, message=FALSE, echo=FALSE, warning=FALSE-------
file_permit <- read.csv("film-permits-5ayrq4bh.csv")
head(file_permit)
#' Available time attributes are StartDateTime and EndDateTime, through them duration of filming event could be calculated, therefore new field duration(in hours) is created i.e. difference in hours b/w two datetimes.
#'
#'
#' #### Duration summary
## ----Date formatting ,message=FALSE, echo=FALSE, warning=FALSE-----------
file_permit$StartDateTime=as.POSIXct(file_permit$StartDateTime, tz = "UTC", "%Y-%m-%dT%H:%M:%OS")
file_permit$EndDateTime=as.POSIXct(file_permit$EndDateTime, tz = "UTC", "%Y-%m-%dT%H:%M:%OS")
file_permit$EnteredOn=as.POSIXct(file_permit$EnteredOn, tz = "UTC", "%Y-%m-%dT%H:%M:%OS")
file_permit$duration=as.integer(difftime(file_permit$EndDateTime,file_permit$StartDateTime, units ="hours"))
file_permit$LeadTime=as.integer(difftime(file_permit$StartDateTime,file_permit$EnteredOn, units ="days"))
summary(file_permit$duration)
#' Duration is generated in hours and it is found average as 19 hours with minimum as less than 1 hour and maximum as 3528 hours.
#'
#'
#' #### BoxPlot Duration
#'
## ----Duration boxplot, message=FALSE, echo=FALSE, warning=FALSE----------
boxplot(file_permit$duration)
boxplot(file_permit$duration[file_permit$duration<30])
#'
#' Above first boxplot show too much outliers which make plot unclear , therefore in second one shown data for duration less than 30 hours for clear visualisation. It clearly shows that 3rd quartile i.e 75% of the duration data is within 20 hours
#'
##...