This project needs to be done on Jupyter Notebook. I need a Jupyter Notebook file with the code and a PDF file answering the questions and with snippets of the code used to answer those questions.
I am attaching a file with instructions and table 1.1 and also another file with a different dataset that needs to be used for the other questions.
Please note that that forquestion 1 (a, b, c, d etc) the table that needs to be used to answer that question is table 1.1.
For the other questions I amgoing to attach a different dataset.
Please
oThis is an open book exam under the following guidelines.
oYou can use Lecture and Tutorial notes to answer your questions, however, you should
express in words based on your own understanding and description.oUrkund database will be used to check the plagiarism of this exam.oThe final assessment will be uploaded at your Moodle page.
oIf you are using pen and paper for this assignment. After completion of your work, take a picture with your mobile phone camera/ digital camera, then you can submit the snapshots on Moodle. Please ensure that you write below your drawing the question number the drawing corresponds to. The page number must be mentioned clearly in the case of snapshots.
oUseasinglecolumnlayoutdocumentandfontsizeforthebodyofthetextshouldbe12point Arial/ Calibri.
oWrite student ID, Name, Module Name and undergraduate year at the top of the first page.oCleary write the question number along with answer in the Assignment.
Attempt all questions and the marks for each question are mentioned separately.
Question 1:
Can you describe an importance of data cleaning before data exploration? Identify which attributes in Table 1.1 has outliers or inappropriate or missing data values.
Marital Status
|
Transaction Amount
|
C10001 C10002 C10003 C10004 C10005 C10005
Table 1.1
10037 F K2S7P7 M
80123
8767 F
44202 M S2D4 F
Age
45,000 B M 6000 -50,000 34 W 5000 10,000,000 40 S 2000 60,000 46 S 1500
99,999 47 2
65,000 51 D 6000
0.3000
a) Refer to the income attribute of the six customers in Table 1.1, before pre-processing. Find the mean (average) income before data pre-processing. Do you think that mean value is correct? Now, remove the incorrect values from the income column by using your own strategy and again calculate the mean value after pre-processing. Does this value provide a logical understanding of the income column? Explain briefly.
b) Explain why zip codes should be considered as the text variable rather than numeric.
c) What is an outlier? Why do we need to treat outliers carefully in the datasets? Can you identify some outliers from various attributes in the Table 1.1?
d) Remove outliers, duplicate values and missed data values from Table 1.1. Represent the Table 1.1 after pre-processing of all columns in a correct format for the data exploration.
(10, 5, 5, 5, 5 = 30 marks)
Question 2:
What is difference between Independent Component Analysis (ICA) and Non-negative Matrix Factorisation (NNMF). Use any dataset [Consider any dataset other than used in the class or labs] to illustrate the difference between both methods. Express your understanding in your own words.
(300 - 350 word, 35 marks)
Question 3:
Consider any dataset of your choice [Consider any dataset other than used in the class or labs] and show the benefits of feature selection to find the relevant data for the problem domain. Provide a supporting graph or figure that show the correlation or ranking of different features.
(300 - 350 word, 35 marks)
Note:Use Harvard style references in an assignment. Provide jupyter notebook for Python coding or provide a code in the Assignment.