please notify my last expert to see if he can help! we have worked couple of times. order NO. is
50416.
2/23/2020 1/5 Decision Trees Due: Friday, Feb 28th at 6pm The purpose of this assignment is to give you experience with data cleaning and decision trees. You can work in pairs on this assignment. Once you have followed these instructions, your repository will contain a directory named pa4. That direc- tory will include: transform.py: skeleton code for Task 1. test_transform.py: test code for Task 1. decision_tree.py: skeleton code for Task 2. test_decision_tree.py: test code for Task 2. data: a directory with sample data sets. To pick up the data for this assignment, change to the data directory and run the following on the Linux command-line: $ ./get_files.sh Pima Indian Data The Pima Indian Data Set, which is from the UC Irvine Machine Learning Repository, contains anonymized information on women from the Pima Indian Tribe. This information was collected by the Na- tional Institute of Diabetes and Digestive and Kidney Diseases to study diabetes, which is prevalent among members of this tribe. The data set has 768 entries, each of which contains the following attributes: 1. Number of pregnancies 2. Plasma glucose concentration from a 2 hours in an oral glucose tolerance test 3. Diastolic blood pressure (mm Hg) 4. Triceps skin fold thickness (mm) 5. 2-Hour serum insulin (mu U/ml) 6. Body mass index (weight in kg/(height in m)^2) 7. Diabetes pedigree function 8. Age (years) 9. Has diabetes (1 means yes, 0 means no) Task 1: Data Cleaning Your first task is to clean and then transform the raw Pima data into a training set and a testing set. We have seeded your pa4 directory with a file named transform.py. Your task is to complete the function clean, which takes four arguments: 1. the name of a raw Pima Indians Diabetes data file, 2. a filename for the training data 3. a filename for the testing data 4. a seed for use with train_test_split. Getting started http://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes https://en.wikipedia.org/wiki/Pima_people 2/23/2020 2/5 Your function should clean and transform the raw data as described below, split the resulting data into training and testing sets, and save the split data in CSV files. Each row in the raw file contains an observation. The raw attribute values are floating point numbers. For every attribute but the first and the last, a zero should be interpreted as missing data. The "Triceps skin fold thickness (mm)" and "2-Hour serum insulin (mu U/ml)" columns have a lot of missing data, so you should eliminate them when you process the data. Also, you should remove any observation with a value of zero for plasma glucose concentration, diastolic blood pressure, or body mass index. Once the data is cleaned, you will need to convert the numeric data into categorical data. We have included a dictionary named BOUNDS in transform.py, that specifies for each category, a list of category boundaries and a list of category labels. For example, the categories for plasma glucose level are represented with the following pair of lists: [0.1, 95, 141, float("inf")] and ["low", "medium", "high"]. Together, these list specify that a plasma glucose level between 0.1 (inclusive) and 95 (exclusive) should be labeled as “low”, a level between 95 (inclusive) and 141 (exclusive) should be labeled as medium, and a level of 141 or higher should be labeled as high. Note: the Python expression float('inf') evaluates to positive infinity. For all floating point values x, x < float('inf').="" finally,="" once="" the="" data="" is="" cleaned="" and="" transformed,="" you="" should="" randomly="" split="" the="" observations="" into="" two="" sets–training="" and="" testing–using="" the="" sklearn.model_selection="" function="" train_test_split="" using="" the="" specified="" seed="" for="" the="" random_state="" parameter.="" the="" training="" set="" should="" contain="" roughly="" 90%="" of="" the="" trans-="" formed="" data,="" with="" the="" remainder="" going="" into="" the="" testing="" set.="" the="" raw="" data="" includes="" a="" header="" row,="" which="" should="" be="" suitably="" modified="" and="" included="" in="" both="" output="" files.="" do="" not="" include="" the="" row="" index="" in="" the="" output="" files.="" pandas="" is="" ideally="" suited="" for="" this="" task.="" testing="" task="" 1="" we="" have="" provided="" test="" code="" for="" task="" 1="" in="" test_transform.py.="" decision="" trees="" as="" we="" discussed="" in="" lecture,="" decision="" trees="" are="" a="" data="" structure="" used="" to="" solve="" classification="" problems.="" here="" is="" a="" sample="" decision="" tree="" that="" labels="" tax="" payers="" as="" potential="" cheaters="" or="" non-cheaters.="" this="" tree,="" for="" example,="" would="" classify="" a="" single="" person="" who="" did="" not="" get="" a="" refund="" and="" makes="" $85,000="" a="" year="" as="" a="" possible="" cheater.="" we="" briefly="" summarize="" the="" algorithm="" for="" building="" decision="" trees="" below.="" see="" the="" chapter="" on="" classification="" and="" decision="" trees="" from="" introduction="" to="" data="" mining="" by="" tan,="" steinbach,="" and="" kumar="" for="" a="" more="" detailed="" description.="" definitions="" https://www-users.cs.umn.edu/~kumar001/dmbook/ch3_classification.pdf="" 2/23/2020="" 3/5="" before="" we="" describe="" the="" decision="" tree="" algorithm,="" we="" need="" to="" define="" a="" few="" formulas.="" let="" be="" a="" multiset="" of="" observations,="" a="" “row”="" or="" “observation”="" in="" ,="" an="" attribute="" set,="" and="" a="" row="" in="" .="" denote="" the="" number="" of="" observed="" elements="" in="" (including="" repetition="" of="" the="" same="" element.)="" we="" use="" the="" following="" definitions:="" to="" describe="" the="" subset="" of="" the="" observations="" in="" that="" have="" value="" j="" for="" attribute="" and="" the="" fraction="" of="" the="" ob-="" servations="" in="" that="" have="" value="" j="" for="" attribute="" .="" decision="" tree="" algorithm="" given="" a="" multiset="" of="" observations="" ,="" a="" target="" attribute="" (that="" is,="" the="" label="" we="" are="" trying="" to="" predict),="" and="" a="" set,="" ,="" of="" possible="" attributes="" to="" split="" on,="" the="" basic="" algorithm="" to="" build="" a="" decision="" tree,="" based="" on="" hunt’s="" algorithm,="" works="" as="" follows:="" 1.="" create="" a="" tree="" node,="" n,="" with="" its="" class="" label="" set="" to="" the="" value="" from="" the="" target="" attribute="" that="" occurs="" most="" often:="" where="" is="" the="" set="" of="" possible="" values="" for="" attribute="" and="" argmax="" yields="" the="" value="" that="" maximizes="" the="" function.="" for="" interior="" nodes,="" the="" class="" label="" will="" be="" used="" when="" a="" traversal="" encounters="" an="" unexpected="" value="" for="" the="" split="" attribute.="" 2.="" if="" all="" the="" observations="" in="" are="" from="" the="" same="" target="" class,="" is="" the="" empty="" set,="" or="" the="" remaining="" ob-="" servations="" share="" the="" same="" values="" for="" the="" attributes="" in="" ,="" return="" the="" node="" n.="" 3.="" find="" the="" attribute="" from="" that="" yields="" the="" largest="" gain="" ratio="" (defined="" below),="" set="" the="" split="" attribute="" for="" tree="" node="" n="" to="" be="" ,and="" set="" the="" children="" of="" n="" to="" be="" decision="" trees="" computed="" from="" the="" subsets="" ob-="" tained="" by="" splitting="" on="" with="" t="" as="" the="" target="" class="" and="" the="" remaining="" attributes="" (attr="" -="" {a})="" as="" the="" set="" of="" possible="" split="" attributes.="" the="" edge="" from="" n="" to="" the="" child="" computed="" from="" the="" subset="" should="" be="" labeled="" .="" stop="" the="" recursion="" if="" the="" largest="" gain="" ratio="" is="" zero.="" we="" use="" the="" term="" gain="" to="" describe="" the="" increase="" in="" purity="" with="" respect="" to="" attribute="" that="" can="" be="" obtained="" by="" splitting="" the="" observations="" in="" according="" to="" the="" value="" of="" attribute="" .="" (in="" less="" formal="" terms,="" we="" want="" to="" iden-="" tify="" the="" attribute="" that="" will="" do="" the="" best="" job="" of="" splitting="" the="" data="" into="" groups="" such="" that="" more="" of="" the="" members="" share="" the="" same="" value="" for="" the="" target="" attribute.)="" there="" are="" multiple="" ways="" to="" define="" impurity,="" we’ll="" use="" the="" gini="" coefficient="" in="" this="" assignment:="" given="" that="" definition,="" we="" can="" define="" gain="" formally="" as:="" we="" might="" see="" a="" large="" gain="" merely="" because="" splitting="" on="" an="" attribute="" produces="" many="" small="" subsets.="" to="" pro-="" tect="" against="" this="" problem,="" we="" will="" compute="" a="" ratio="" of="" the="" gain="" from="" splitting="" on="" an="" attribute="" to="" the="" split="" in-="" formation="" for="" that="" attribute:="" where="" split="" information="" is="" defined="" as:="" s="×" .="" .="" .="" ×a1="" a2="" ak="" r="" s="" a="" ∈="" {="" ,="" .="" .="" .="" ,="" }a1="" ak="" r[a]="" a="" |s|="" s="" sa="j" p(s,="" a,="" j)="=" {r="" ∈="" s="" |="" r[a]="j}" |="" |sa="j" |s|="" s="" a="" s="" a="" s="" t="" attr="" t="" p(s,="" t,="" v)argmax="" v∈values(t)="" values(t)="" t="" v="" s="" attr="" attr="" a="" attr="" a="" s="" a="" sa="j" j="" t="" s="" a="" gini(s,="" a)="1" −="" p(s,="" a,="" jσj∈values(a)="" )="" 2="" gain(s,="" a,="" t)="gini(S," t)="" −="" p(s,="" a,="" j)="" ∗="" gini(="" ,="" t)∑="" j∈values(a)="" sa="j" gain_ratio(s,="" a,="" t)="gain(S," a,="" t)="" split_info(s,="" a)="" 2/23/2020="" 4/5="" task="" 2:="" building="" and="" using="" decision="" trees="" we="" have="" seeded="" your="" pa4="" directory="" with="" a="" file="" named="" decision_tree.py.="" this="" file="" includes="" a="" main="" block="" that="" processes="" the="" expected="" command-line="" arguments–filenames="" for="" the="" training="" and="" testing="" data–and="" then="" calls="" a="" function="" named="" go.="" your="" task="" is="" to="" implement="" go="" and="" any="" necessary="" auxiliary="" functions.="" your="" go="" function="" should="" build="" a="" decision="" tree="" from="" the="" training="" data="" and="" then="" return="" a="" list="" (or="" pandas="" series)="" of="" the="" classifications="" obtained="" by="" using="" the="" decision="" tree="" to="" classify="" each="" observation="" in="" the="" testing="" data.="" your="" program="" must="" be="" able="" to="" handle="" any="" data="" set="" that:="" 1.="" has="" a="" header="" row,="" 2.="" has="" categorical="" attributes,="" and="" 3.="" in="" which="" the="" (binary)="" target="" attribute="" appears="" in="" the="" last="" column.="" you="" should="" use="" all="" the="" columns="" except="" the="" last="" one="" as="" attributes="" when="" building="" the="" decision="" tree.="" you="" could="" break="" ties="" in="" steps="" 1="" and="" 3="" of="" the="" algorithm="" arbitrarily,="" but="" to="" simplify="" the="" process="" of="" testing="" we="" will="" dictate="" a="" specific="" method.="" in="" step="" 1,="" choose="" the="" value="" that="" occurs="" earlier="" in="" the="" natural="" ordering="" for="" strings,="" if="" both="" classes="" occur="" the="" same="" number="" of="" times.="" for="" example,="" if="" "yes"="" occurs="" six="" times="" and="" "no"="" occurs="" six="" times,="" choose="" "no",="" because="" "no"="">< "yes".="" in="" the="" unlikely="" event="" that="" the="" gain="" ratio="" for="" two="" at-="" tributes="" a1="" and="" a2,="" where="" a1="">< a2, is the same, chose a1. you must define a python class to represent the nodes of the decision tree. we strongly encourage you to use pandas for this task as well. it is well suited to the task of computing the different metrics (gini, gain, etc). testing task 2 we have provided test code for task 2 in test_decision_tree.py. grading completeness: 65% correctness: 10% design: 15% style: 10% obtaining your test score the completeness part of your score will be determined using automated tests. to get your score for the automated tests, simply run the following from the linux command-line. (remember to leave out the $ prompt when you type the command.) $ py.test $ ../common/grader.py notice that we’re running py.test without the -k or -x options: we want it to run all the tests. if you’re still failing some tests, and don’t want to see the output from all the failed tests, you can add the --tb=no option when running py.test: $ py.test --tb=no $ ../common/grader.py split_info(s, a) = −( p(s, a, j) ∗ log p(s, a, j))∑ j∈values(a) programming assignments will be graded according to a general rubric. specifically, we will assign points for completeness a2,="" is="" the="" same,="" chose="" a1.="" you="" must="" define="" a="" python="" class="" to="" represent="" the="" nodes="" of="" the="" decision="" tree.="" we="" strongly="" encourage="" you="" to="" use="" pandas="" for="" this="" task="" as="" well.="" it="" is="" well="" suited="" to="" the="" task="" of="" computing="" the="" different="" metrics="" (gini,="" gain,="" etc).="" testing="" task="" 2="" we="" have="" provided="" test="" code="" for="" task="" 2="" in="" test_decision_tree.py.="" grading="" completeness:="" 65%="" correctness:="" 10%="" design:="" 15%="" style:="" 10%="" obtaining="" your="" test="" score="" the="" completeness="" part="" of="" your="" score="" will="" be="" determined="" using="" automated="" tests.="" to="" get="" your="" score="" for="" the="" automated="" tests,="" simply="" run="" the="" following="" from="" the="" linux="" command-line.="" (remember="" to="" leave="" out="" the="" $="" prompt="" when="" you="" type="" the="" command.)="" $="" py.test="" $="" ../common/grader.py="" notice="" that="" we’re="" running="" py.test="" without="" the="" -k="" or="" -x="" options:="" we="" want="" it="" to="" run="" all="" the="" tests.="" if="" you’re="" still="" failing="" some="" tests,="" and="" don’t="" want="" to="" see="" the="" output="" from="" all="" the="" failed="" tests,="" you="" can="" add="" the="" --tb="no" option="" when="" running="" py.test:="" $="" py.test="" --tb="no" $="" ../common/grader.py="" split_info(s,="" a)="−(" p(s,="" a,="" j)="" ∗="" log="" p(s,="" a,="" j))∑="" j∈values(a)="" programming="" assignments="" will="" be="" graded="" according="" to="" a="" general="" rubric.="" specifically,="" we="" will="" assign="" points="" for=""> a2, is the same, chose a1. you must define a python class to represent the nodes of the decision tree. we strongly encourage you to use pandas for this task as well. it is well suited to the task of computing the different metrics (gini, gain, etc). testing task 2 we have provided test code for task 2 in test_decision_tree.py. grading completeness: 65% correctness: 10% design: 15% style: 10% obtaining your test score the completeness part of your score will be determined using automated tests. to get your score for the automated tests, simply run the following from the linux command-line. (remember to leave out the $ prompt when you type the command.) $ py.test $ ../common/grader.py notice that we’re running py.test without the -k or -x options: we want it to run all the tests. if you’re still failing some tests, and don’t want to see the output from all the failed tests, you can add the --tb=no option when running py.test: $ py.test --tb=no $ ../common/grader.py split_info(s, a) = −( p(s, a, j) ∗ log p(s, a, j))∑ j∈values(a) programming assignments will be graded according to a general rubric. specifically, we will assign points for completeness>