Complete this assignment using you understanding of Hadoop, HDFS, Hive, HiveQL and Hive DDL to perform data management, storage, retrieval, and analysis. Do not answer questions using any other software or method.
CIND 123 Winter 2020 - Assignment #1 CIND719 – Assignment 1 1 of 2 CIND 719 Assignment 1 Complete this assignment using you understanding of Hadoop, HDFS, Hive, HiveQL and Hive DDL to perform data management, storage, retrieval, and analysis. Do not answer questions using any other software or method. Dataset Download two csv files (stations_data.csv and trip_data.csv) from course shell (Lab Resources & Datasets). Both files contain data collected from the second year of Bay Area Bike Share's operation from 9/1/14 to 8/31/15. The schema of each file is given as below. • station_data.csv - station id: station ID number - name: name of station - lat: latitude - long: longitude - dockcount: number of total docks at station - landmark: city (San Francisco, Redwood City, Palo Alto, Mountain View, San Jose) - installation: original date that station was installed. • trip_data.csv -Trip ID: numeric ID of bike trip -Duration: time of trip in seconds -Start Date: start date of trip with date and time, in PST -Start Station: station name of start station (corresponds to 'name' in the station_data.csv dataset) -Start Terminal: numeric reference for start station (corresponds to 'station id' in the station_data.csv dataset) -End Date: end date of trip with date and time, in PST -End Station: station name for end station (corresponds to 'name' in the station_data.csv dataset) -End Terminal: numeric reference for end station (corresponds to 'station id' in the station_data.csv dataset) -Bike #: numeric ID of bike used -Subscription Type: 'Subscriber' = annual or 30-day member; 'Customer' = 24-hour or 3-day member -Zip Code: Home zip code of subscriber (customers can choose to manually enter zip at kiosk however data is unreliable) Study the datasets and understand how they are related before you attempt following questions. CIND719 – Assignment 1 2 of 2 Questions 1. Find the 'most popular' bike, i.e. the bike that has made the highest number of trips (1.5 pts) 2. Find the number of trips made by each subscription type. (1.5 pts) 3. Build a table that shows which stations are connected, and the minimum duration between them. You can use either station id or station name. Save this table as a comma separated text file in ‘/user/assignment1/stationlist.csv’ in HDFS. Include the directory listing of the output directory and first five lines of the output file in your submission. (3 pts) 4. Find the number of trips originating from each landmark. Your output should include the landmark name and the number of trip originating from it. (3 pts) 5. Find the number of trips crossing landmarks, i.e. trips that originate in one landmark and end in another. Your output should include the originating and ending landmark names and the number of trips between them. (6 pts) Assignment Submission Instructions Prepare a report of your findings. - For each question, provide the steps, commands, queries in both text and image form. o Take clear and readable screenshots of the shell commands along with the outputs. Submit the report to the Assignment 1 folder under Assessment/Assignments in your course shell.