2/23/2020 1/5 Decision Trees Due: Friday, Feb 28th at 6pm The purpose of this assignment is to give you experience with data cleaning and decision trees. You can work in pairs on this assignment. Once...

1 answer below »
please notify my last expert to see if he can help! we have worked couple of times. order NO. is50416.


2/23/2020 1/5 Decision Trees Due: Friday, Feb 28th at 6pm The purpose of this assignment is to give you experience with data cleaning and decision trees. You can work in pairs on this assignment. Once you have followed these instructions, your repository will contain a directory named pa4. That direc- tory will include: transform.py: skeleton code for Task 1. test_transform.py: test code for Task 1. decision_tree.py: skeleton code for Task 2. test_decision_tree.py: test code for Task 2. data: a directory with sample data sets. To pick up the data for this assignment, change to the data directory and run the following on the Linux command-line: $ ./get_files.sh Pima Indian Data The Pima Indian Data Set, which is from the UC Irvine Machine Learning Repository, contains anonymized information on women from the Pima Indian Tribe. This information was collected by the Na- tional Institute of Diabetes and Digestive and Kidney Diseases to study diabetes, which is prevalent among members of this tribe. The data set has 768 entries, each of which contains the following attributes: 1. Number of pregnancies 2. Plasma glucose concentration from a 2 hours in an oral glucose tolerance test 3. Diastolic blood pressure (mm Hg) 4. Triceps skin fold thickness (mm) 5. 2-Hour serum insulin (mu U/ml) 6. Body mass index (weight in kg/(height in m)^2) 7. Diabetes pedigree function 8. Age (years) 9. Has diabetes (1 means yes, 0 means no) Task 1: Data Cleaning Your first task is to clean and then transform the raw Pima data into a training set and a testing set. We have seeded your pa4 directory with a file named transform.py. Your task is to complete the function clean, which takes four arguments: 1. the name of a raw Pima Indians Diabetes data file, 2. a filename for the training data 3. a filename for the testing data 4. a seed for use with train_test_split. Getting started http://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes https://en.wikipedia.org/wiki/Pima_people 2/23/2020 2/5 Your function should clean and transform the raw data as described below, split the resulting data into training and testing sets, and save the split data in CSV files. Each row in the raw file contains an observation. The raw attribute values are floating point numbers. For every attribute but the first and the last, a zero should be interpreted as missing data. The "Triceps skin fold thickness (mm)" and "2-Hour serum insulin (mu U/ml)" columns have a lot of missing data, so you should eliminate them when you process the data. Also, you should remove any observation with a value of zero for plasma glucose concentration, diastolic blood pressure, or body mass index. Once the data is cleaned, you will need to convert the numeric data into categorical data. We have included a dictionary named BOUNDS in transform.py, that specifies for each category, a list of category boundaries and a list of category labels. For example, the categories for plasma glucose level are represented with the following pair of lists: [0.1, 95, 141, float("inf")] and ["low", "medium", "high"]. Together, these list specify that a plasma glucose level between 0.1 (inclusive) and 95 (exclusive) should be labeled as “low”, a level between 95 (inclusive) and 141 (exclusive) should be labeled as medium, and a level of 141 or higher should be labeled as high. Note: the Python expression float('inf') evaluates to positive infinity. For all floating point values x, x < float('inf').="" finally,="" once="" the="" data="" is="" cleaned="" and="" transformed,="" you="" should="" randomly="" split="" the="" observations="" into="" two="" sets–training="" and="" testing–using="" the="" sklearn.model_selection="" function="" train_test_split="" using="" the="" specified="" seed="" for="" the="" random_state="" parameter.="" the="" training="" set="" should="" contain="" roughly="" 90%="" of="" the="" trans-="" formed="" data,="" with="" the="" remainder="" going="" into="" the="" testing="" set.="" the="" raw="" data="" includes="" a="" header="" row,="" which="" should="" be="" suitably="" modified="" and="" included="" in="" both="" output="" files.="" do="" not="" include="" the="" row="" index="" in="" the="" output="" files.="" pandas="" is="" ideally="" suited="" for="" this="" task.="" testing="" task="" 1="" we="" have="" provided="" test="" code="" for="" task="" 1="" in="" test_transform.py.="" decision="" trees="" as="" we="" discussed="" in="" lecture,="" decision="" trees="" are="" a="" data="" structure="" used="" to="" solve="" classification="" problems.="" here="" is="" a="" sample="" decision="" tree="" that="" labels="" tax="" payers="" as="" potential="" cheaters="" or="" non-cheaters.="" this="" tree,="" for="" example,="" would="" classify="" a="" single="" person="" who="" did="" not="" get="" a="" refund="" and="" makes="" $85,000="" a="" year="" as="" a="" possible="" cheater.="" we="" briefly="" summarize="" the="" algorithm="" for="" building="" decision="" trees="" below.="" see="" the="" chapter="" on="" classification="" and="" decision="" trees="" from="" introduction="" to="" data="" mining="" by="" tan,="" steinbach,="" and="" kumar="" for="" a="" more="" detailed="" description.="" definitions="" https://www-users.cs.umn.edu/~kumar001/dmbook/ch3_classification.pdf="" 2/23/2020="" 3/5="" before="" we="" describe="" the="" decision="" tree="" algorithm,="" we="" need="" to="" define="" a="" few="" formulas.="" let="" be="" a="" multiset="" of="" observations,="" a="" “row”="" or="" “observation”="" in="" ,="" an="" attribute="" set,="" and="" a="" row="" in="" .="" denote="" the="" number="" of="" observed="" elements="" in="" (including="" repetition="" of="" the="" same="" element.)="" we="" use="" the="" following="" definitions:="" to="" describe="" the="" subset="" of="" the="" observations="" in="" that="" have="" value="" j="" for="" attribute="" and="" the="" fraction="" of="" the="" ob-="" servations="" in="" that="" have="" value="" j="" for="" attribute="" .="" decision="" tree="" algorithm="" given="" a="" multiset="" of="" observations="" ,="" a="" target="" attribute="" (that="" is,="" the="" label="" we="" are="" trying="" to="" predict),="" and="" a="" set,="" ,="" of="" possible="" attributes="" to="" split="" on,="" the="" basic="" algorithm="" to="" build="" a="" decision="" tree,="" based="" on="" hunt’s="" algorithm,="" works="" as="" follows:="" 1.="" create="" a="" tree="" node,="" n,="" with="" its="" class="" label="" set="" to="" the="" value="" from="" the="" target="" attribute="" that="" occurs="" most="" often:="" where="" is="" the="" set="" of="" possible="" values="" for="" attribute="" and="" argmax="" yields="" the="" value="" that="" maximizes="" the="" function.="" for="" interior="" nodes,="" the="" class="" label="" will="" be="" used="" when="" a="" traversal="" encounters="" an="" unexpected="" value="" for="" the="" split="" attribute.="" 2.="" if="" all="" the="" observations="" in="" are="" from="" the="" same="" target="" class,="" is="" the="" empty="" set,="" or="" the="" remaining="" ob-="" servations="" share="" the="" same="" values="" for="" the="" attributes="" in="" ,="" return="" the="" node="" n.="" 3.="" find="" the="" attribute="" from="" that="" yields="" the="" largest="" gain="" ratio="" (defined="" below),="" set="" the="" split="" attribute="" for="" tree="" node="" n="" to="" be="" ,and="" set="" the="" children="" of="" n="" to="" be="" decision="" trees="" computed="" from="" the="" subsets="" ob-="" tained="" by="" splitting="" on="" with="" t="" as="" the="" target="" class="" and="" the="" remaining="" attributes="" (attr="" -="" {a})="" as="" the="" set="" of="" possible="" split="" attributes.="" the="" edge="" from="" n="" to="" the="" child="" computed="" from="" the="" subset="" should="" be="" labeled="" .="" stop="" the="" recursion="" if="" the="" largest="" gain="" ratio="" is="" zero.="" we="" use="" the="" term="" gain="" to="" describe="" the="" increase="" in="" purity="" with="" respect="" to="" attribute="" that="" can="" be="" obtained="" by="" splitting="" the="" observations="" in="" according="" to="" the="" value="" of="" attribute="" .="" (in="" less="" formal="" terms,="" we="" want="" to="" iden-="" tify="" the="" attribute="" that="" will="" do="" the="" best="" job="" of="" splitting="" the="" data="" into="" groups="" such="" that="" more="" of="" the="" members="" share="" the="" same="" value="" for="" the="" target="" attribute.)="" there="" are="" multiple="" ways="" to="" define="" impurity,="" we’ll="" use="" the="" gini="" coefficient="" in="" this="" assignment:="" given="" that="" definition,="" we="" can="" define="" gain="" formally="" as:="" we="" might="" see="" a="" large="" gain="" merely="" because="" splitting="" on="" an="" attribute="" produces="" many="" small="" subsets.="" to="" pro-="" tect="" against="" this="" problem,="" we="" will="" compute="" a="" ratio="" of="" the="" gain="" from="" splitting="" on="" an="" attribute="" to="" the="" split="" in-="" formation="" for="" that="" attribute:="" where="" split="" information="" is="" defined="" as:="" s="×" .="" .="" .="" ×a1="" a2="" ak="" r="" s="" a="" ∈="" {="" ,="" .="" .="" .="" ,="" }a1="" ak="" r[a]="" a="" |s|="" s="" sa="j" p(s,="" a,="" j)="=" {r="" ∈="" s="" |="" r[a]="j}" |="" |sa="j" |s|="" s="" a="" s="" a="" s="" t="" attr="" t="" p(s,="" t,="" v)argmax="" v∈values(t)="" values(t)="" t="" v="" s="" attr="" attr="" a="" attr="" a="" s="" a="" sa="j" j="" t="" s="" a="" gini(s,="" a)="1" −="" p(s,="" a,="" jσj∈values(a)="" )="" 2="" gain(s,="" a,="" t)="gini(S," t)="" −="" p(s,="" a,="" j)="" ∗="" gini(="" ,="" t)∑="" j∈values(a)="" sa="j" gain_ratio(s,="" a,="" t)="gain(S," a,="" t)="" split_info(s,="" a)="" 2/23/2020="" 4/5="" task="" 2:="" building="" and="" using="" decision="" trees="" we="" have="" seeded="" your="" pa4="" directory="" with="" a="" file="" named="" decision_tree.py.="" this="" file="" includes="" a="" main="" block="" that="" processes="" the="" expected="" command-line="" arguments–filenames="" for="" the="" training="" and="" testing="" data–and="" then="" calls="" a="" function="" named="" go.="" your="" task="" is="" to="" implement="" go="" and="" any="" necessary="" auxiliary="" functions.="" your="" go="" function="" should="" build="" a="" decision="" tree="" from="" the="" training="" data="" and="" then="" return="" a="" list="" (or="" pandas="" series)="" of="" the="" classifications="" obtained="" by="" using="" the="" decision="" tree="" to="" classify="" each="" observation="" in="" the="" testing="" data.="" your="" program="" must="" be="" able="" to="" handle="" any="" data="" set="" that:="" 1.="" has="" a="" header="" row,="" 2.="" has="" categorical="" attributes,="" and="" 3.="" in="" which="" the="" (binary)="" target="" attribute="" appears="" in="" the="" last="" column.="" you="" should="" use="" all="" the="" columns="" except="" the="" last="" one="" as="" attributes="" when="" building="" the="" decision="" tree.="" you="" could="" break="" ties="" in="" steps="" 1="" and="" 3="" of="" the="" algorithm="" arbitrarily,="" but="" to="" simplify="" the="" process="" of="" testing="" we="" will="" dictate="" a="" specific="" method.="" in="" step="" 1,="" choose="" the="" value="" that="" occurs="" earlier="" in="" the="" natural="" ordering="" for="" strings,="" if="" both="" classes="" occur="" the="" same="" number="" of="" times.="" for="" example,="" if="" "yes"="" occurs="" six="" times="" and="" "no"="" occurs="" six="" times,="" choose="" "no",="" because="" "no"="">< "yes".="" in="" the="" unlikely="" event="" that="" the="" gain="" ratio="" for="" two="" at-="" tributes="" a1="" and="" a2,="" where="" a1="">< a2, is the same, chose a1. you must define a python class to represent the nodes of the decision tree. we strongly encourage you to use pandas for this task as well. it is well suited to the task of computing the different metrics (gini, gain, etc). testing task 2 we have provided test code for task 2 in test_decision_tree.py. grading completeness: 65% correctness: 10% design: 15% style: 10% obtaining your test score the completeness part of your score will be determined using automated tests. to get your score for the automated tests, simply run the following from the linux command-line. (remember to leave out the $ prompt when you type the command.) $ py.test $ ../common/grader.py notice that we’re running py.test without the -k or -x options: we want it to run all the tests. if you’re still failing some tests, and don’t want to see the output from all the failed tests, you can add the --tb=no option when running py.test: $ py.test --tb=no $ ../common/grader.py split_info(s, a) = −( p(s, a, j) ∗ log p(s, a, j))∑ j∈values(a) programming assignments will be graded according to a general rubric. specifically, we will assign points for completeness a2,="" is="" the="" same,="" chose="" a1.="" you="" must="" define="" a="" python="" class="" to="" represent="" the="" nodes="" of="" the="" decision="" tree.="" we="" strongly="" encourage="" you="" to="" use="" pandas="" for="" this="" task="" as="" well.="" it="" is="" well="" suited="" to="" the="" task="" of="" computing="" the="" different="" metrics="" (gini,="" gain,="" etc).="" testing="" task="" 2="" we="" have="" provided="" test="" code="" for="" task="" 2="" in="" test_decision_tree.py.="" grading="" completeness:="" 65%="" correctness:="" 10%="" design:="" 15%="" style:="" 10%="" obtaining="" your="" test="" score="" the="" completeness="" part="" of="" your="" score="" will="" be="" determined="" using="" automated="" tests.="" to="" get="" your="" score="" for="" the="" automated="" tests,="" simply="" run="" the="" following="" from="" the="" linux="" command-line.="" (remember="" to="" leave="" out="" the="" $="" prompt="" when="" you="" type="" the="" command.)="" $="" py.test="" $="" ../common/grader.py="" notice="" that="" we’re="" running="" py.test="" without="" the="" -k="" or="" -x="" options:="" we="" want="" it="" to="" run="" all="" the="" tests.="" if="" you’re="" still="" failing="" some="" tests,="" and="" don’t="" want="" to="" see="" the="" output="" from="" all="" the="" failed="" tests,="" you="" can="" add="" the="" --tb="no" option="" when="" running="" py.test:="" $="" py.test="" --tb="no" $="" ../common/grader.py="" split_info(s,="" a)="−(" p(s,="" a,="" j)="" ∗="" log="" p(s,="" a,="" j))∑="" j∈values(a)="" programming="" assignments="" will="" be="" graded="" according="" to="" a="" general="" rubric.="" specifically,="" we="" will="" assign="" points="" for="">
Answered Same DayFeb 23, 2021

Answer To: 2/23/2020 1/5 Decision Trees Due: Friday, Feb 28th at 6pm The purpose of this assignment is to give...

Kshitij answered on Feb 26 2021
146 Votes
pa4/ci.yml
compile_and_lint:
stage: build
script:
- python3 -m py_compile pa4/*.py
- pylint -E pa4/*.py
run_tests:
stage: test
script:
- cd pa4/ && py.test -v
after_script:
- cd pa4/ && ../common/grader.py
pa4/data/get_files.sh
echo "Getting PA4 files..."
wget -nv -O pa4-files.tgz https://www.classes.cs.uchicago.edu/archive/2020/winter/30122-1/pa4-files.tgz
tar xvz
f pa4-files.tgz
pa4/data/README.txt
# CAPP30122 W'20: Building decision trees assignment
ex.csv -- Example from chapter on Classification and
Decision Trees from Introduction to Data Mining by Tan, Steinbach, and Kumar
(http://www-users.cs.umn.edu/~kumar/dmbook/ch4.pdf.)
pima-indians-diabetes.csv -- This data set contains anonymized
information on women from the `Pima Indian Tribe.
See http://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes
for more information
README.txt -- this file
pa4/decision_tree.py
'''
CAPP30122 W'20: Building decision trees
Your name
'''
import math
import sys
import pandas
class DecisionTree:
def __init__(self, data, attributes):
self.data = data
self.attributes = sorted(attributes)
self.__target = self.data.columns[-1]
values = sorted(self.data[self.__target].unique())
self.__label = values[np.argmax([sum(self.data[self.__target] == value) for value in values])]
self.__split_attr = None
self.__children = {}
@staticmethod
def __rate(data, attribute, value):

return sum(data[attribute] == value) / data.shape[0]
def __attr_gini(self, data, attribute):

gini = 1
for value in data[attribute].unique():
gini -= self.__rate(data, attribute, value)**2
return gini
def __attr_gain_ratio(self, attribute):

gain = self.__attr_gini(self.data, self.__target)
split_info = 0
attr = self.data[attribute]
for value in attr.unique():
gain -= self.__rate(self.data, attribute, value) * self.__attr_gini(self.data[attr == value], self.__target)
split_info -= self.__rate(self.data, attribute, value) * math.log(self.__rate(self.data, attribute, value))
if split_info == 0:
return 0
return gain / split_info
def find_best_split(self):

self.__split_attr = self.attributes[np.argmax([ self.__attr_gain_ratio(attr) for attr in self.attributes])]
def is_leaf(self):

if any([self.data[self.__target].nunique() == 1, not self.attributes, all(self.data[self.attributes].apply(lambda col: col.nunique() == 1))]):
return True
return False
def train(self):

if self.is_leaf():
return self
self.find_best_split()
if self.__attr_gain_ratio(self.__split_attr) == 0:
return self
for edge in self.data[self.__split_attr].unique():
sub_data = self.data[self.data[self.__split_attr] == edge]
sub_attr = list(filter(lambda x: x != self.__split_attr, self.attributes))
self.__children[edge] = DecisionTree(sub_data, sub_attr).train()
return self
def classify(self, row):

if not self.__children or row[self.__split_attr] not in self.__children:
return self.__label
return self.__children[row[self.__split_attr]].classify(row)
def go(training_filename, testing_filename):
'''
Construct a decision tree using the training data and then apply
it to the testing data.
Inputs:
training_filename (string): the name of the file with the
training data
testing_filename (string): the name of the file with the testing
data
Returns (list of strings or pandas series of strings): result of
applying the decision tree to the testing data.
'''
train = pd.read_csv(training_filename, dtype=str)
test = pd.read_csv(testing_filename, dtype=str)
output = []
trained_tree = DecisionTree(train, list(train.columns[:-1])).train()
for _, row in test.iterrows():
output.append(trained_tree.classify(row))
return output
if __name__ == "__main__":
if...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here