please notify my last expert to see if he can help! we have worked couple of times. order NO....

Question

please notify my last expert to see if he can help! we have worked couple of times. order NO. is50416.2/23/2020 1/5 Decision Trees Due: Friday, Feb 28th at 6pm The purpose of this assignment is to give you experience with data cleaning and decision trees. You can work in pairs on this assignment.     Once you have followed these instructions, your repository will contain a directory named pa4. That direc- tory will include: transform.py: skeleton code for Task 1. test_transform.py: test code for Task 1. decision_tree.py: skeleton code for Task 2. test_decision_tree.py: test code for Task 2. data: a directory with sample data sets. To pick up the data for this assignment, change to the data directory and run the following on the Linux command-line: $ ./get_files.sh  Pima Indian Data The Pima Indian Data Set, which is from the UC Irvine Machine Learning Repository, contains anonymized information on women from the Pima Indian Tribe. This information was collected by the Na- tional Institute of Diabetes and Digestive and Kidney Diseases to study diabetes, which is prevalent among members of this tribe. The data set has 768 entries, each of which contains the following attributes: 1. Number of pregnancies 2. Plasma glucose concentration from a 2 hours in an oral glucose tolerance test 3. Diastolic blood pressure (mm Hg) 4. Triceps skin fold thickness (mm) 5. 2-Hour serum insulin (mu U/ml) 6. Body mass index (weight in kg/(height in m)^2) 7. Diabetes pedigree function 8. Age (years) 9. Has diabetes (1 means yes, 0 means no) Task 1: Data Cleaning Your first task is to clean and then transform the raw Pima data into a training set and a testing set. We have seeded your pa4 directory with a file named transform.py. Your task is to complete the function clean, which takes four arguments: 1. the name of a raw Pima Indians Diabetes data file, 2. a filename for the training data 3. a filename for the testing data 4. a seed for use with train_test_split. Getting started http://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes https://en.wikipedia.org/wiki/Pima_people 2/23/2020 2/5 Your function should clean and transform the raw data as described below, split the resulting data into training and testing sets, and save the split data in CSV files. Each row in the raw file contains an observation. The raw attribute values are floating point numbers. For every attribute but the first and the last, a zero should be interpreted as missing data. The "Triceps skin fold thickness (mm)" and "2-Hour serum insulin (mu U/ml)" columns have a lot of missing data, so you should eliminate them when you process the data. Also, you should remove any observation with a value of zero for plasma glucose concentration, diastolic blood pressure, or body mass index. Once the data is cleaned, you will need to convert the numeric data into categorical data. We have included a dictionary named BOUNDS in transform.py, that specifies for each category, a list of category boundaries and a list of category labels. For example, the categories for plasma glucose level are represented with the following pair of lists: [0.1, 95, 141, float("inf")] and ["low", "medium", "high"]. Together, these list specify that a plasma glucose level between 0.1 (inclusive) and 95 (exclusive) should be labeled as “low”, a level between 95 (inclusive) and 141 (exclusive) should be labeled as medium, and a level of 141 or higher should be labeled as high. Note: the Python expression float('inf') evaluates to positive infinity. For all floating point values x, x

Kshitij · Accepted Answer

pa4/ci.yml
compile_and_lint:
    stage: build
    script:
        - python3 -m py_compile pa4/*.py
        - pylint -E pa4/*.py
run_tests:
    stage: test
    script:
        - cd pa4/ && py.test -v
    after_script:
        - cd pa4/ && ../common/grader.py
pa4/data/get_files.sh
echo "Getting PA4 files..."
wget -nv -O pa4-files.tgz https://www.classes.cs.uchicago.edu/archive/2020/winter/30122-1/pa4-files.tgz
tar xvzf pa4-files.tgz
pa4/data/README.txt
# CAPP30122 W'20: Building decision trees assignment
ex.csv           -- Example from chapter on Classification and
    Decision Trees from Introduction to Data Mining by Tan, Steinbach, and Kumar
   (http://www-users.cs.umn.edu/~kumar/dmbook/ch4.pdf.)
pima-indians-diabetes.csv -- This data set contains anonymized
  information on women from the `Pima Indian Tribe.
  See http://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes
  for more information
README.txt                -- this file
pa4/decision_tree.py
'''
CAPP30122 W'20: Building decision trees
Your name
'''
import math
import sys
import pandas
class DecisionTree:
    def __init__(self, data, attributes):
        self.data = data
        self.attributes = sorted(attributes)
        self.__target = self.data.columns[-1]
        values = sorted(self.data[self.__target].unique())
        self.__label = values[np.argmax([sum(self.data[self.__target] == value) for value in values])]
        self.__split_attr = None
        self.__children = {}
    @staticmethod
    def __rate(data, attribute, value):
        
        return sum(data[attribute] == value) / data.shape[0]
    def __attr_gini(self, data, attribute):
        
        gini = 1
        for value in data[attribute].unique():
            gini -= self.__rate(data, attribute, value)**2
        return gini
    def __attr_gain_ratio(self, attribute):
        
        gain = self.__attr_gini(self.data, self.__target)
        split_info = 0
        attr = self.data[attribute]
        for value in attr.unique():
            gain -= self.__rate(self.data, attribute, value) * self.__attr_gini(self.data[attr == value], self.__target)
            split_info -= self.__rate(self.data, attribute, value) * math.log(self.__rate(self.data, attribute, value))
        if split_info == 0:
            return 0
        return gain / split_info
    def find_best_split(self):
        
        self.__split_attr = self.attributes[np.argmax([ self.__attr_gain_ratio(attr) for attr in self.attributes])]
    def is_leaf(self):
        
        if any([self.data[self.__target].nunique() == 1, not self.attributes, all(self.data[self.attributes].apply(lambda col: col.nunique() == 1))]):
            return True
        return False
    def train(self):
       
        if self.is_leaf():
            return self
        self.find_best_split()
        if self.__attr_gain_ratio(self.__split_attr) == 0:
            return self
        for edge in self.data[self.__split_attr].unique():
            sub_data = self.data[self.data[self.__split_attr] == edge]
            sub_attr = list(filter(lambda x: x != self.__split_attr, self.attributes))
            self.__children[edge] = DecisionTree(sub_data, sub_attr).train()
        return self
    def classify(self, row):
       
        if not self.__children or row[self.__split_attr] not in self.__children:
            return self.__label
        return self.__children[row[self.__split_attr]].classify(row)
def go(training_filename, testing_filename):
    '''
    Construct a decision tree using the training data and then apply
    it to the testing data.
    Inputs:
      training_filename (string): the name of the file with the
        training data
      testing_filename (string): the name of the file with the testing
        data
    Returns (list of strings or pandas series of strings): result of
      applying the decision tree to the testing data.
    '''
    train = pd.read_csv(training_filename, dtype=str)
    test = pd.read_csv(testing_filename, dtype=str)
    output = []
    trained_tree = DecisionTree(train, list(train.columns[:-1])).train()
    for _, row in test.iterrows():
        output.append(trained_tree.

2/23/2020 1/5 Decision Trees Due: Friday, Feb 28th at 6pm The purpose of this assignment is to give you experience with data cleaning and decision trees. You can work in pairs on this assignment. Once...

Answer To: 2/23/2020 1/5 Decision Trees Due: Friday, Feb 28th at 6pm The purpose of this assignment is to give...

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment