Assignment overview:
In recent years, advances in machine learning are opening the door for intelligent health care data prediction and decision-making. A variety of machine learning algorithms can be used to iteratively learn from data to improve, find out the hidden patterns, and predict future events. Successful applications such as individualized diagnosis and prognosis, hospital readmission prediction, and personalized medicine can lead to improvements in medical practices and health care experiences.
Your final assignment will work on two health care datasets, one is the mammographic masses dataset, the other one is the GBD dataset. The goal of this project is tofollow the data science analysis pipelineto answer interesting questions of your own choosing, acquire the data, perform data manipulations, design your visualizations, build your predictive modelling and present the results in a report format.
Classification -- Mammographic masses dataset
Step 1: Get your dataset: You will useonehealth care dataset called Mammographic Mass Data Set (retrieve it fromhttp://archive.ics.uci.edu/ml/datasets/mammographic+qqmass.)
Step 2: You will raisetwointerestingquestions onthedataset and prepare to answer them in your following analysis via data manipulation, visualization or predictive modeling, etc.
Step 3: Data manipulation and cleaning: Observe your dataset and pre-process the data if necessary and justify.
Step 4: Exploratory data analysis:perform initial investigations on data using summary statistic and visualizations.
Step 5: You willselectat least
two
classification methods and apply them to the dataset for predictive modeling. The performances of different models should be evaluated.
Step 6: Analyze the results
Step 7: Document all your findings
Clustering -- GBD dataset
Step 1.Get your dataset: You will use one health care datasetaboutGlobal Burden of Disease Study (GBD) Data Set from LMS.
NOTE: IHME GBD data 2017_F_csv is the GDB data of females in 2017;IHME GBD data 2017_M_csv is the GDB data of males in 2017. YOU ONLY NEED TO SELECT ANY ONE OF THEM FOR THE FOLLOWING ANALYSIS.
Background ofGBD: http://www.healthdata.org/gbd/about
Data retrieved from:
http://ghdx.healthdata.org/gbd-results-tool
http://ghdx.healthdata.org/record/ihme-data/global-health-spending-1995-2017
http://ghdx.healthdata.org/record/ihme-data/gbd-2017-socio-demographic-index-sdi-1950%E2%80%932017
Step 2: You will raisetwointerestingquestions onthedataset and prepare to answer them in your following analysis via data manipulation, visualization orclusteringmodeling, etc.
Step 2. Data manipulation and cleaning: Observe your dataset and pre-process the data if necessary and justify.
Step 3. Exploratory data analysis: perform initial investigations on data using summary statistic and visualizations.
Step 4. You will selectat least
twoclustering methodsto identify the groups of countries from the dataset. The performances of different models should be evaluated.
Step 5. Analyze the results
Step 6. Document all your findings
What you need to submit:
R file
An essential part of your project is your R coding. Your R file should record the steps in developing your solutions and obtaining the final data analysis results. Make sure your code matches thefindings you put in the report. For example, if there are three separate plotsin the report,your code should produce exactly the same three separate plots.
Report
You also need to submit an in-depth report. The following components and discussions might be considered in your report:
Overview of the project: Provide an overview of the project, the goals, and the motivation for it. Consider that this will be read by people who first see your project.
Dataset: Describe the background of the dataset and provide the summary statistic. Interesting questions: What questions are you trying to answer? Do any questions evolve throughout the project? Are there any new questions you consider in the course of your analysis? ...
Data manipulation and cleaning: Are there any data pre-processing stepsperformed, and why? Are there any questions that can be answered via data manipulation? ...
Exploratory data analysis: What visualizations did you use to look at your data in different ways?Are there any detected outliers?...
Predictive modelling: What are the various machine learning methods you considered? Justify the decisions you made. What are the main ideas of the selected methods? How do you build the models? Are there any concerns when designing your model? ...
Final analysis: What did you learn about the data? Which method statistically outperformed the rest? Have you found the answers to the raised questions? How can you justify your answers? ... Engagingly present your results using text, visualizations.
Conclusion: Are there any limitations of your study? What is your future work?