In the zip folder, there are testing files, util java file, and DataSet java file which you are going to code. The dataset file is suppose to load a CSV file, organize the information into a 2D array and display unique information. There are pictures to help understand it more.
ITI 1121. Introduction to Computing II Winter 2021 Assignment 1 (Last modified on January 13, 2021) Deadline: February 5, 2021, 11:30 pm Learning objectives • Edit, compile and run Java programs • Utilize arrays to store information • Apply basic object-oriented programming concepts • Understand the university policies for academic integrity Introduction This year, we are going to implement, through a succession of assignments, a simplified version of a useful machine learning technique, called decision tree classification. If you are not familiar with decision trees and are curious to know what they are, you may wish to have a quick look at the following Wikipedia page: https://en.wikipedia. org/wiki/Decision_tree_learning. For Assignment 1, however, you are not going to do anything that is specific to decision trees; you can complete Assignment 1 without any knowledge of decision trees! We will get to decision trees only in Assignments 2 and 3. If you find the above Wikipedia page overwhelming, fear not! As we go along, we will provide you with simple and accessible material to read on decision tree classification. Ultimately, the version of decision tree classifica- tion that you implement, despite still being extremely useful, has many of the complexities of the more advanced implementations removed (for example, handling “unknown” values in your training data). As far as the current assignment – Assignment 1 – is concerned, we have modest goals: we would like to read an input file, which will (in future assignments) constitute the training data for our learning algorithm, and perform some basic tasks that are prerequisites to virtually any type of machine learning. Specifically, you will be implementing the following tasks in Assignment 1: • Task 1. Parsing comma-separated values (CSV) from a given data file and populating appropriate data struc- tures in memory • Task 2. Extracting certain summary data (metadata) about the characteristics of the input data; this metadata will come handy for the construction of decision trees in future assignments. These two tasks are best illustrated with a simple example. Suppose we have a CSV file named weather.csv with the content shown in Figure 1.1 The data is simply a table. The first (non-empty) row in the file provides the names of the table columns in a comma-separated format. Each column represents an attribute (also called a feature). The remaining (non-empty) rows are the datapoints. In our example, each datapoint is a historical observation about weather conditions (in terms of outlook, tem- perature in fahrenheit, humidity and wind), and whether it has been possible to “play” a certain tournament (for example, cricket) outside. What a machine learning algorithm can do here is to “learn from examples” and help decide / predict whether one can play a tournament on a given day according to the weather conditions on that day. Now, going backing to Task 1 and Task 2, below is what each of these tasks would do with the data in Figure 1. 1This example is borrowed from “Data Mining: Practical Machine Learning Tools and Techniques” 3rd Ed. (2011) by Ian H. Witten, Eibe Frank and Mark A. Hall. 1 https://en.wikipedia.org/wiki/Decision_tree_learning https://en.wikipedia.org/wiki/Decision_tree_learning outlook,temperature,humidity,windy,play sunny,85,85,FALSE,no sunny,80,90,TRUE,no overcast,83,86,FALSE,yes rainy,70,96,FALSE,yes rainy,68,80,FALSE,yes rainy,65,70,TRUE,no overcast,64,65,TRUE,yes sunny,72,95,FALSE,no sunny,69,70,FALSE,yes rainy,75,80,FALSE,yes sunny,75,70,TRUE,yes overcast,72,90,TRUE,yes overcast,81,75,FALSE,yes rainy,71,91,TRUE,no Figure 1: Example CSV file (weather.csv)
AAAD5nic1ZLdbtMwFMfdho9RPtaBuEICi0oIgRQlkSa4rIQqwd2Q6DZprobjnrTWHDuy3bJhLG654g5xu9fiDbjjFXDSaizdE3AinRyd/P+/nJw4rwQ3Nkl+dbrRtes3bm7d6t2+c/fedn/n/r5RC81gzJRQ+jCnBgSXMLbcCjisNNAyF3CQn7ypnx8sQRuu5Ad7VsGkpDPJC86oDa3jnc43jImAwhJHltWcSqtKR3KYcekszReCau++iPXlybx+UQ9jnL5M8bOQs5DjOG5qiQnB/yRZI8kuSbJNyaq9mVuSsqGUlyjlJuU/k2ACcnqxXU80n81tfNwfJHHSBL5apOtiMPzYP/9a/BZ74d+NyFSxRQnSMkGNOWJKFqBBMpi4d6PRyGoqfY8sDFSUndAZuObAtFpH1bSwcDpxM02rOWenbcNCi3aDak3PfK9HJHxiqixp+BQCAuop/FE6cY4USlmpLBj+GUhgW1O4Qeq9b5tKLnltvHBJWzvcrneJD3si4VQDs3W/TWljTNCoirMGQ5YmDAovXJwFMKmoJlJxOQ3DuQbCwz0vDGgOBq9wYfHp5pqvFvtZnO7Gyft0MHyNVrGFHqGn6DlK0Ss0RG/RHhoj1vnTfdh93H0SzaPv0Y/o50ra7aw9D1ArovO/1C4juA==8 >>>>>>>>>>>><>>>>>>>>>>>>:
AAAElXic7ZPPb9MwFMfdEGCUH1vhwIGLRSWEQIqSSNO4IFWConFiSHSbVFeV47601hI7st2xYSz+E/4VLhyBO/8NTlqNpTtz40V6eXr5fj95eXKyquDaxPHvTnAtvH7j5tat7u07d+9t7/TuH2q5VAxGTBZSHWdUQ8EFjAw3BRxXCmiZFXCUnbyqnx+dgtJcig/mvIJJSeeC55xR41vTXrCHMSkgN8SS02pBhZGlJRnMubCGZsuCKmc/F+vLkUX9oi7GOHme4Cc+pz5HUdTUAhOC/0rSRpJekqSbklV7M7ckZUMpL1HKTcp/yb+UYAJidnEYHFF8vjDRdKcfR3ET+GqRrIv+gE/ct/0vPw6mvc6QzCRbliAMK6jWYyZFDgoEg4l9OxwOjaLCdclSQ0XZCZ2Dbc53qzWuZrmBs4mdK1otODtrG5aqaDeoUvTcdbtEwEcmy5L6TyFQQD2FGycTa0kupRHSgOafgHi20bntJ865tqnkgtfGC5cwtcPuOhs7vyfif0Jgpu63KW2M9hpZcdZgyKn2g8IzG6UeTCqqiJBczPxwtoFwf89yDYqDxiucX3yyuearxWEaJbtR/D7pD16gVWyhR+gxeooStIcGaB8doBFiwdfge/Az+BU+DF+Gr8M3K2nQWXseoFaE7/4AaiFMkA== 8>>>>>>>>>>>>>>>>>>>>>>><>>>>>>>>>>>>>>>>>>>>>>>: 5 attributes 14 examples (datapoints) String[5] String[14][5] outlook temperature humidity windy play sunny 85 85 FALSE no sunny 80 90 TRUE no overcast 83 86 FALSE yes rainy 70 96 FALSE yes rainy 68 80 FALSE yes rainy 65 70 TRUE no overcast 64 65 TRUE yes sunny 72 95 FALSE no sunny 69 70 FALSE yes rainy 75 80 FALSE yes sunny 75 70 TRUE yes overcast 72 90 TRUE yes overcast 81 75 FALSE yes rainy 71 91 TRUE no attributeNames matrix Figure 2: Results of parsing our example input file • Task 1 will parse the input data and build the conceptual memory representation shown in Figure 2. More precisely, we get (1) an instance variable, attributeNames (discussed later), instantiated with a String array of length 5 and containing the column names, and (2) an instance variable, matrix (also discussed later), instantiated with a two-dimensional String array (of size 14×5) and populated with the datapoints in the file. • Task 2 will identify the unique values observed in each column. If all the values in a column happen to be numeric, then the column is found to be of numeric type. Otherwise, the column will be of nominal type, meaning that the values in the column are to be treated as labels without any quantitative value associated with them. For our example file, the column types and the set of unique values for each column would be as shown in Figure 3. Note that, for this assignment, you do not need to sort the numeric value sets in either ascending or descending order. This becomes necessary only in the future assignments. 2 1) outlook (nominal): [’sunny’, ’overcast’, ’rainy’] 2) temperature (numeric): [85, 80, 83, 70, 68, 65, 64, 72, 69, 75, 81, 71] 3) humidity (numeric): [85, 90, 86, 96, 80, 70, 65, 95, 75, 91] 4) windy (nominal): [’FALSE’, ’TRUE’] 5) play (nominal): [’no’, ’yes’] Figure 3: Derived column types and unique values sets for our example input file outlook,temperature,humidity,windy,play sunny, 85, 85,FALSE,no sunny,80,90,TRUE, no overcast,83,86,FALSE, yes rainy,70,96, FALSE,yes rainy,68,80,FALSE, yes rainy,65,70,TRUE,no overcast,64,65, TRUE,yes sunny,72,95,FALSE,no sunny, 69,70,FALSE,yes rainy,75, 80,FALSE,yes sunny,75,70, TRUE,yes overcast,72,90, TRUE,yes overcast,81,75,FALSE,yes rainy, 71,91,TRUE,no Figure 4: Example with surrounding spaces and empty lines that need to be ignored Important Considerations (Read Carefully!) While the assignment is conceptually simple, there are some important consideration that you need to carefully pay attention to in your implementation. Determining the size of the arrays to instantiate: You will be storing the attribute names and datapoints using two instance variables that are respectively declared as follows: private String[] attributeNames; private String[][] matrix; One problem that you have to deal with is how to instantiate these variables. To do so, you need to know the number of attributes (columns) and the number of datapoints. You can know the former number only after counting the attributes names on the first (non-empty) line of the file. As for the latter (number of datapoints), you can only know this once you have traversed the entire file. Later on in the course, we will see “expandible” data structure like linked lists, which do not have a fixed size, allowing elements to be added to them as you go along. For this assignment, you are expressly forbidden from using lists or similar data structures with non-fixed sizes. Instead, you are expected to work with fixed-size arrays. For this assignment, the easiest way to instantiate the arrays is through a two-pass strategy. This means that you will go over the input file twice. In the first pass, you merely count the number of columns and datapoints. With these numbers known, you can instantiate attributeNames and matrix. Then, in a second pass, you can populate (the now-instantiated) attributeNames and matrix. Note that, as illustrated in Figure 2, you are expected to instantiate matrix as a row × column array, as oppposed to a column × row array. While this latter strategy is correct too, you are asked to use the former (that is, row× column) as a convention throughout this assignment. Removing blank spaces and empty lines: The blank spaces surrounding attribute names and values should be discarded. For example, consider the input file in Figure 4. This file is the same as the one in Figure 1, only 3 outlook,’temperature, in fahrenheit’,humidity,windy,play sunny,85,85,FALSE,no sunny,80,90,TRUE,no overcast,83,86,FALSE,yes ’rainy, mild’,70,96,FALSE,yes ’rainy, mild’,68,80,FALSE,yes ’rainy, heavy’,65,70,TRUE,no overcast,64,65,TRUE,yes sunny,72,95,FALSE,no sunny,69,70,FALSE,yes ’rainy, mild’,75,80,FALSE,yes sunny,75,70,TRUE,yes overcast,72,90,TRUE,yes overcast,81,75,FALSE,yes ’rainy, heavy’,71,91,TRUE,no Figure 5: Example usage of commas in attribute names and values 1) outlook (nominal): [’sunny’, ’overcast’, ’rainy, mild’, ’rainy, heavy’] 2) temperature, in fahrenheit (numeric): [85, 80, 83, 70, 68, 65, 64, 72, 69, 75, 81, 71] 3) humidity (numeric): [85, 90, 86, 96, 80, 70, 65, 95, 75, 91] 4) windy (nominal): [’FALSE’, ’TRUE’] 5) play (nominal): [’no’, ’yes’] Figure 6: Metadata derived from the input shown in Figure 5 with some surrounding spaces added. These surrounding spaces need to be trimmed and ignored. The same goes with empty lines. Empty lines can be treated as non-existent. Supporting commas in attribute names and values: Since commas are used as separators (delimiters), it is impor- tant to provide a way to support commas within attribute names and attribute values. To do so, you will need to implement an escape sequence mechanism. You will do so using single quotes (’). More precisely, commas are to be treated as regular characters if a text segment is embraced with single quotes. To illustrate, consider the example input in Figure 5. The metadata information derived from this input is shown in Figure 6. While not explicitly shown by Figure 6, the values to store in attributeNames and matrix are obviously affected when escape sequences with single quotes are present in the input file. Missing attribute values: There may be situations where not all attribute values are known (for example, due to incomplete data collection). In such cases, the attribute values in question may be left empty. Your imple- mentation needs to be able empty (missing) attribute values. You can choose to represent missing values with a special value, for example ‘MISSING’. Alternatively, you can choose to represent missing values with an empty string (‘’). To illustrate, consider the input file in Figure 7, where some values are missing. The meta- data derived from this input file is shown in Figure 8. Here, we have chosen to represent missing values with the empty string. For this assignment, you