Practical Data Science with Python COSC 2670/2738 Assignment 1 (Part 2) Assessment Type Individual Due Date 23:59, the 3rd of May, 2020 Marks 20 Please read all the following information before...

1 answer below »
this assignment having 2 task - task 1 is slide and task 2 is pdf file based on survey which i have done in assignment


Practical Data Science with Python COSC 2670/2738 Assignment 1 (Part 2) Assessment Type Individual Due Date 23:59, the 3rd of May, 2020 Marks 20 Please read all the following information before attempting your assign- ment. This is an individual assignment. You may not collude with any other people, or plagiarise their work. Each student is expected to present the results of his/her own thinking and writing. Never copy other student’s work (even if they “explain it to you first”) and never give your written work to others. Keep any conversation high-level and never show your solution to others. Never copy from the Web or any other resource. Re- member you are meant to generate the solution to the questions by yourself. Suspected collusion or plagiarism will be dealt with according to RMIT policy. In the submission (your PDF file) you will be required to certify that the submitted solution represents your own work only by agreeing to the following statement: I certify that this is all my own original work. If I took any parts from elsewhere, then they were non-essential parts of the assignment, and they are clearly attributed in my submission. I will show we I agree to this honor code by typing “Yes”: A sample format for this requirement is provided, and please find it in Canvas − > Assignments − > Assignment1Part2. Tasks This is the part 2 of Assignment 1, and it includes two tasks. This is independent to your assignment 1, so your current assignment 1 will not affect this part 2. Task 1: An oral presentation of the work in Assignment 1 (10%) The presentation should briefly describe • How to prepare the data? • How to explore the data? • What are the results from your analysis? The presentation should be a maximum of 10 minutes. Your presentation slides should be: • Microsoft PowerPoint slides (with audio inserted for each slide by using: Insert − > Audio − > Record Audio). • or you can create your own presentation slides (e.g. PDF version) and please submit your own record of your presentation as well. Task 2: Short answer question (10%) The questions in the survey can be divided into two parts: • one is about people’s attitude or opinion about Start War movies, including: – Have you seen any of the 6 films in the Star Wars franchise? – Do you consider yourself to be a fan of the Star Wars film franchise? – Which of the following Star Wars films have you seen? Please select all that apply. (Star Wars: Episode I The Phantom Menace; Star Wars: Episode II Attack of the Clones; Star Wars: Episode III Revenge of the Sith; Star Wars: Episode IV A New Hope; Star Wars: Episode V The Empire Strikes Back; Star Wars: Episode VI Return of the Jedi) – Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. (Star Wars: Episode I The Phantom Menace; Star Wars: Episode II Attack of the Clones; Star Wars: Episode III Revenge of the Sith; Star Wars: Episode IV A New Hope; Star Wars: Episode V The Empire Strikes Back; Star Wars: Episode VI Return of the Jedi) – Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her. (Han Solo, Luke Skywalker, Princess Leia Or- gana, Anakin Skywalker, Obi Wan Kenobi, Emperor Palpatine, Darth Vader, Lando Calrissian, Boba Fett, C-3P0, R2-D2, Jar Jar Binks, Padme Amidala, Yoda) – Which character shot first? – Are you familiar with the Expanded Universe? – Do you consider yourself to be a fan of the Expanded Universe? – Do you consider yourself to be a fan of the Star Trek franchise? • the other is about people’s demographics, including – Gender – Age – Household Income – Education – Location (Census Region) We would like to build a classifier (or some classifiers, for example one classifier per demographic feature), which can classify people’s demographics (gender, age, household income, education, location (census region)) based on their attitude or opinion about 2 Start War movies. Please describe how to build this classifier (or these classifiers) by using the data collected in the survey (the data provided in Assignment 1). Please note that this is a short-answer question, and no coding work is required. Your submission must be in PDF document, and must be at most 6 (in single column format) pages (including figures and references) with a font size between 10 and 12 points. Penalties will apply if the report does not satisfy the requirement. What to Submit, When, and How The assignment is due at 23:59, the 3rd of May, 2020 . Assignments submitted after this time will be subject to standard late submission penal- ties. You need to submit the following files: • your presentation slides and the oral audio presentation as required in Task 2. • Your Assignment1 Part2.pdf file includes your answers to Task 2. They must be submitted as ONE single zip file, named as your student number (for example, 1234567.zip if your student ID is s1234567). The zip file must be submitted in Canvas: Assignments/Assignment 1 (Part 2). Please do NOT submit other unnecessary files. 3 { "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "#Task 1: Data Preparation\n", "# \"You will start by loading the CSV data from the file (using appropriate pandas functions) and checking whether the loaded data is equivalent to the data in the source CSV file.\n", "# Then, you need to clean the data by using the knowledge we taught in the lectures. You need to deal with all the potential issues/errors in the data appropriately (such as: typos, extra whitespaces, sanity checks for impossible values, and missing values etc). \"\n", "\n", "# Please structure code as follows: \n", "# always provide one line of comments to explain the purpose of the code, e.g. load the data, checking the equivalent to original data, checking typos (do this for each other types of errors)\n", "\n", "#Code goes after this line by adding cells" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "#reading csv file using pandas\n", "import pandas as pd\n", "starwars = pd.read_csv(\"starwars.csv\", encoding=\"ISO-8859-1\")" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(1187, 38)\n" ] }, { "data": { "text/plain": [ "Index(['RespondentID',\n", " 'Have you seen any of the 6 films in the Star Wars franchise?',\n", " 'Do you consider yourself to be a fan of the Star Wars film franchise?',\n", " 'Which of the following Star Wars films have you seen? Please select all that apply.',\n", " 'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8',\n", " 'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.',\n", " 'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13',\n", " 'Unnamed: 14',\n", " 'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.',\n", " 'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19',\n", " 'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23',\n", " 'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27',\n", " 'Unnamed: 28', 'Which character shot first?',\n", " 'Are you familiar with the Expanded Universe?',\n", " 'Do you consider yourself to be a fan of the Expanded Universe?Œæ',\n", " 'Do you consider yourself to be a fan of the Star Trek franchise?',\n", " 'Gender', 'Age', 'Household Income', 'Education',\n", " 'Location (Census Region)'],\n", " dtype='object')" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#to print the shape of file and data type of columns\n", "print(starwars.shape)\n", "starwars.columns" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "#to find null values from RespondentID column\n", "starwars = starwars[starwars['RespondentID'].notnull()]" ] }, { "cell_type": "code"
Answered Same DayMay 09, 2021COSC2670

Answer To: Practical Data Science with Python COSC 2670/2738 Assignment 1 (Part 2) Assessment Type Individual...

Pushpendra answered on May 10 2021
150 Votes
Project Task-2: Classification of Movie Data analysisˆ

Feature Engineering:

Feature engineering is the process of using domain knowledge to extract features from raw
data via data mining techniques. These features can be used to improve the performance
of machine learning algorithms.
Feature engineering is a process where manually and automatically select those
features in data that contribute most to the prediction variable or output. Having
irrelevant features in your data can decrease the accuracy of many models.

 Feature selection techniques in Python with scikit-learn library:

1) Calculate the no of features which has low variance. This could be applied by using a
threshold value using Variance Threshold in the sklearn library.

2) Remove the features which have a high correlation. Correlation can be positive or
negative.

3) Univariate Feature Selection (ANOVA):

o Statistical tests can be used to select those features that have the strongest
relationship with the output variable.
o Use the chi-squared (chi2) statistical test for non-negative features to select the
best features from the dataset.
4) Recursive Feature Elimination:
o The Recursive Feature Elimination (or RFE) works by recursively removing attributes and
building a model on those attributes that remain. It uses the model accuracy to identify
which attributes (and combination of attributes) contribute the most to predicting the
target attribute.
Training Model:
Firstly prepare machine learning algorithm on training dataset and use predictions from this
same dataset to evaluate performance.
Split into Train and Test Sets:
o The large amount of data and the complexity of the models require very long training
times. It is typically to use a simple separation of data into training and test datasets or
training and validation datasets use Python scikit-learn machine learning
o Use 70% for training and the remaining 30% of the data for validation. The validation
dataset can be specified to the fit () function.
o The key parameter to understand about:
1) Training Dataset
2) Validation Dataset
3) Test Dataset
Training Dataset:

The sample of data used to fit the model. The actual dataset use to train the model. The
model sees and learns from this data.

Validation Dataset:

The sample of data used to provide an unbiased evaluation of a model fit on the training dataset
while tuning model hyperparameters. The evaluation becomes more biased as skill on the
validation dataset is incorporated into the model configuration.

The validation set is used to evaluate a given model, but this is for frequent evaluation.

Use this data to fine-tune the model hyperparameters, so the model occasionally sees this data,
but never does it “Learn” from this.

We use the validation set results and update higher level hyperparameters. So the validation set
in a way affects a model, but indirectly....
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here