The dataset represents 10 years (1999-2008) of clinical care at 130 US hospitals. It includes over...

Question

The dataset represents 10 years (1999-2008) of clinical care at 130 US hospitals. It includes over 50 features representing diabetic patient and hospital outcomes. Detailed description of all the atrributes is provided in Table 1 in Beata Strack, et al.’s paper “Impact of HbA1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records,” BioMed Research International, vol. 2014, Article ID 781670, 11 pages, 2014.( https://www.hindawi.com/journals/bmri/2014/781670/). The dependent variable is “Readmitted” – whether a patient will re-admitted the next year and how long they will stay in hospital. In this original dataset, it represents days to inpatient readmission and include three possible values (>30, Tasks: (the sentences in
bold
indicate the thing you need to include in your word document, and you don’t need to submit anything else)

Download and install WEKA tool from http://www.cs.waikato.ac.nz/ml/weka/

Download the sample data from D2L site (diabetes.zip), unzip it. You will find a training dataset diabetic_training.csv and a test dataset diabetic_test.csv.

Use weka->explorer->preprocess-> open file to open the training dataset(diabetic_training.csv), remove two variables “encounter_id” and “patient_nbr” (these ids shouldn’t used in modeling) and then save it as diabetic_training.arff. Then open the test dataset (diabetic_test.csv), again remove “encounter_id” and “patient_nbr” and save it as diabetic_test.arff.

-------- You don’t need to submit anything for the above tasks -------------------------------

Open diabetic_training.arff. Use three variable selection methods including 1) filter-based method based on “information gain”, 2) filter-based method with “Chi-squared attribute evaluation”, and 2) wrapper-based method with the J48 decision tree. Please try to combine the results you obtained from these three different methods.

In your word document, show me the outputs of these three variable selection methods. We want to select 10 variables. Please try to combine the variables selected using different methods and show the 10 variables you think should be selected. Please briefly explain how you combine the results of the different methods.

Remove the variables that haven’t been selected from diabetic_training.arff and save it as diabetic_training2.arff. Then open diabetic_test.arff. Again, remove the variables that haven’t been selected and save it as diabetic_test2.arff. (you don’t need to submit anything for this task)

Open diabetic_training2.arff. Then we fit three models:

Use the training dataset (diabetic_training2.arff) with 10 variables to fit a neural network (MultilevelPercetron in weka) model. Please let weka automatically split your training data into training vs. validation (70% vs. 30%), and
show me the results including include recall, precision, F1-score and accuracy.
Warning:
It will take quite some time to fit a neural network model.

Using the training dataset (diabetic_training2.arff), fit a SVM model (SMO in weka). Please let weka automatically split your training data into training vs. validation (70% vs. 30%). Please
show me the results including include recall, precision, F1-score and accuracy.

Using the training dataset (diabetic_training2.arff), run logistic regression (rather than simple logistic regression) model ). Please let weka automatically split your training data into training vs. validation (70% vs. 30%). Please
show me the results including include recall, precision, F1-score and accuracy.

----When you fit these models, please just use the default hyperparameters, but in real practice, you need to tune algorithm hyper-parameters.--------------------------------------

Recommend the best model among the three. Please briefly justify your recommendation.

Now you have selected which algorithm you want to use, we want to fit and assess the final model using the training dataset (diabetic_training2.arff) and the test dataset (diabetic_test2.arff). Please
show me the results including include recall, precision, F1-score and accuracy.

005_pbd9a5o-xtvntdbq.csv 005_glh9a5o-ruualpda.csv 005_dbl9a5o-iursscje.doc

David · Accepted Answer

Data Mining
	
	Data Mining
	[Diabetes Readmission]
	
	
	
	
Contents
2Attribute selection
2Filter-based method based on “information gain”
2Filter-based method with Chi-squared evaluation
3Wrapper-based method with the J48 decision tree.
3Selected Features
4Modeling
4SMO
5Logistic
6Neural Network
6Best Model
7Testing with Test data
Attribute selection
We conduct the attribute selection on the basis of three methods: 
Filter-based method based on “information gain”
Filter-based method with Chi-squared evaluation
Wrapper-based method with the J48 decision tree. 
Wrapper based took a lot of time to execute.

The dataset represents 10 years XXXXXXXXXXof clinical care at 130 US hospitals. It includes over 50 features representing diabetic patient and hospital outcomes. Detailed description of all the...

Answer To: The dataset represents 10 years XXXXXXXXXXof clinical care at 130 US hospitals. It includes over 50...

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment