I have attached a rough guideline for the midterm. Handle the features one column at a time. If you attempt to deal with all of the features at once, you'll notice that the errors you get aren't useful for determining their origin. Following are a few things worth mentioning.
1. Someone noticed that there are duplicate rows in a data table. Consider doing something about this.
2. Someone mentioned that if you handle the train and test separately, you get different number of features. This is entirely possible when a category does not appear in a particular set. There are 2 ways to handle this, following is one idea
Create a Master table which takes the train and test set and combines them. There is only one way to do this.
Before binding you'll need to create an index variable so that the Master table can eventually be separated back into the train and test.
Do the feature manipulation on the Master table.
Separate the Master table back into train and test.
This also avoids needing to do any feature manipulation twice. Melt and Dcast may prove useful.
3. You may decide that a particular feature needs to be removed entirely. This is fine. However you need to add comments for justification. Comments in general is something I'll be checking in this assignment. In general, I want you to use all features.
Midterm Assignment Your goal for this assignment is to predict the prices of Magic The Gathering cards. Whereas in previous assignments you had features and needed predictions, here you are not provided features directly. For this assignment I require the following. - Note: There are many “missing values.” Many of these are not missing but rather not applicable. For instance, many spell/land cards do not have a power and toughness. So sometimes a data cell being empty does not convey missing information and should be handled accordingly. - You don’t have a train a test set directly, so you must generate them with the given start files. - The “types” feature is an important feature which may take more than one value. The way it is given is not model ready, so it must be engineered in a way such that each row can convey if and what types a card is. - Create at least 3 novel features. While it needs to be novel, it need not be complicated. For instance, the square of the converted mana cost may be of interest. That is, instead of cmc you could use cmc^2. Now that I have used this as an example it cannot be used as a novel feature (or any cmc^z). - For modeling I wish for you to use both LASSO and ridge regression. Have a script for each and include both in your run.R file with comments on which is which. Note that the difference between these two files is changing 2ish lines of code, so focus on one of the two then copy paste. Example for Colors with pictures 1. temp DT <- split="" the="" strings="" in="" types.="" (use="" tstrsplit)="" this="" line="" of="" code="" will="" turn="" you="" table="" from="" to="" this.="" the="" na’s="" are="" fine="" for="" now.="" 2.="" add="" in="" the="" id="" column="" to="" temp="" dt="" the="" ids="" are="" missing="" so="" append="" them="" to="" get="" 3.="" temp2="" dt="">-><- melt the previous table (hence why you need step 2) melt the table and set the id variable as “id”. (the top 5 in my example aren’t multi colored but if you search you will find them). 4. get rid of the appropriate nas you’ll need an extra column for the cast so here is what that looks like the entire column is just 1s 5. cast temp2 dt using the id variable. the cast (dcast) needs an equation so id~value . dcast requires an aggregation technique. use the default. set value.var to the new column made in the previous step in red. value.var =”new column name” 6. join that casted table into either a master table or train and test you will have na’s. some rows will be entirely na’s. that means the card is colorless. fix it by making that row all 0’s melt="" the="" previous="" table="" (hence="" why="" you="" need="" step="" 2)="" melt="" the="" table="" and="" set="" the="" id="" variable="" as="" “id”.="" (the="" top="" 5="" in="" my="" example="" aren’t="" multi="" colored="" but="" if="" you="" search="" you="" will="" find="" them).="" 4.="" get="" rid="" of="" the="" appropriate="" nas="" you’ll="" need="" an="" extra="" column="" for="" the="" cast="" so="" here="" is="" what="" that="" looks="" like="" the="" entire="" column="" is="" just="" 1s="" 5.="" cast="" temp2="" dt="" using="" the="" id="" variable.="" the="" cast="" (dcast)="" needs="" an="" equation="" so="" id~value="" .="" dcast="" requires="" an="" aggregation="" technique.="" use="" the="" default.="" set="" value.var="" to="" the="" new="" column="" made="" in="" the="" previous="" step="" in="" red.="" value.var="”new" column="" name”="" 6.="" join="" that="" casted="" table="" into="" either="" a="" master="" table="" or="" train="" and="" test="" you="" will="" have="" na’s.="" some="" rows="" will="" be="" entirely="" na’s.="" that="" means="" the="" card="" is="" colorless.="" fix="" it="" by="" making="" that="" row="" all="">- melt the previous table (hence why you need step 2) melt the table and set the id variable as “id”. (the top 5 in my example aren’t multi colored but if you search you will find them). 4. get rid of the appropriate nas you’ll need an extra column for the cast so here is what that looks like the entire column is just 1s 5. cast temp2 dt using the id variable. the cast (dcast) needs an equation so id~value . dcast requires an aggregation technique. use the default. set value.var to the new column made in the previous step in red. value.var =”new column name” 6. join that casted table into either a master table or train and test you will have na’s. some rows will be entirely na’s. that means the card is colorless. fix it by making that row all 0’s>